Wednesday, August 15, 2012

Rasch Model Convergence Black Box

When this audit started over two years ago, I never planned to travel inside the Rasch model algorithms. Surely by taking a close look from outside (the black box audit tool) one could satisfy that no pixy dust was needed to accomplish the proclaimed feats. The fact that several claims have been discounted in the past ten years, after being applied to NCLB standardized paper tests, makes me question the basic process of estimating item difficulty and student ability measures. (This doubting has nothing to do with the successful application of the Rasch model in many other areas.)

A number of things have raised my doubts. Normal test results cannot be given special value by converting from normal to logit values. The perfect Rasch model is a curve, not a straight line. Laddering across several grades is now in question (in part IMHO because the tests are still scored at the lowest levels of thinking – a guessing exercise -- rather than giving students the option to report what they actually trust – Knowledge and Judgment Scoring). Even some psychometricians still doubt the claims made for the Rasch model.

The black box audit tool that I have been using relates what goes on inside the full Rasch model to the normal world. Conversions of normal to logit and logit to normal add nothing to the audit tool. And nothing has been found amiss. A better audit tool is needed now that we are inside the full Rasch model.

The various ways of estimating student ability and item difficulty measures make use of the original student marks on the test, the student score, the item difficulty, and their distributions (mean and standard deviation). One end result from the transposed Rasch partial credit model is that the estimated individual latent student ability expresses, in one term, what is presented by Knowledge and Judgment Scoring in two terms: quantity and quality. Quantity and quality are directly related to student marks. Estimated latent student ability is dependent upon an estimate of how student marks are related to item difficulty: convergence.

This new black box audit tool relates just these internal values. The two EDS distributions, from the previous post, for student ability and item difficulty, radiate from different starting points and expand at different rates. The span from -2 to +2 logits covers the expected score range from 12% to 88%.

The relative locations for student abilities and item difficulties increasingly change from the positive to the negative regions of the logit scale. This maybe intended or is just an acceptable artifact to psychometricians. It presents a problem for state departments of education that claim their tests are so difficult that passing can be set around 40% on NCLB standardized tests. (I would think this region would be of very little interest in the classroom where the customary passing point is 60% and the average test score is 75%. It is of no interest when hidden by state departments of education by only reporting passing rates without reporting test scores.)

I read the individual student abilities and item difficulties plots, from the internal audit black box tool, to mean that item difficulty is being reported increasingly higher than the comparable student ability, the lower the item difficulty. This would make a test using these items easier than expected in this region, the region of super low cut scores.

Distribution statistics do not reveal individual student and item performances.  Instead they show that, as the item difficulty mean of -0.44 was relocated very near to zero, the student ability mean drifted from 0.39 to 0.71 log its.

The rate of expansion for student ability was larger than for item difficulty. This did little to match the two distributions. The expansion of the two logit distributions was very linear. Yet the end result of the analysis, in normal values, was S-shaped curves.

If the distribution statistics are acceptable, does it matter what happens to individual students? I think so. This is an example of why research techniques do not always work in the application environment.

The best defense students have is to actually know their subject or master the skill to the point that passing is assured. Trying to pass with one point over the line is as flawed a student method as is the state department of education method of setting the cut score before the test is scored to see what actually happened. Application has more serious individual requirements than research.

One auditing method that I have yet to see used is for teachers to include a ranking of their students with the answer sheets. An even better method IMHO is for students to rank themselves by electing Knowledge and Judgment Scoring and reporting what they trust (using all levels of thinking) rather than taking the traditional forced-choice test based on the lowest levels of thinking (meaningless guessing in the cut score region). [Meaningless to everyone except psychometricians only interested in ranking performance.]

No comments:

Post a Comment