40
When this audit started over two years ago, I never planned
to travel inside the Rasch model algorithms. Surely by taking a close look from
outside (the black box audit tool) one could satisfy that no pixy dust was
needed to accomplish the proclaimed feats. The fact that several claims have
been discounted in the past ten years, after being applied to NCLB standardized
paper tests, makes me question the basic process of estimating item difficulty
and student ability measures. (This doubting has nothing to do with the
successful application of the Rasch model in many other areas.)
A number of things have raised my doubts. Normal test
results cannot be given special value by converting from normal to logit
values. The perfect Rasch model is a curve, not a straight line. Laddering
across several grades is now in question (in part IMHO because the tests are
still scored at the lowest levels of thinking – a guessing exercise -- rather
than giving students the option to report what they actually trust – Knowledge
and Judgment Scoring). Even some psychometricians still doubt the claims made
for the Rasch model.
The black box audit tool that I have been using relates what
goes on inside the full Rasch model to the normal world. Conversions of normal
to logit and logit to normal add nothing to the audit tool. And nothing has been
found amiss. A better audit tool is needed now that we are inside the full
Rasch model.
The various ways of estimating student ability and item
difficulty measures make use of the original student marks on the test, the student
score, the item difficulty, and their distributions (mean and standard
deviation). One end result from the transposed Rasch partial credit model is that
the estimated individual latent student ability expresses, in one term, what is
presented by Knowledge and Judgment Scoring
in two terms: quantity and quality. Quantity and quality are directly related
to student marks. Estimated latent student ability is dependent upon an
estimate of how student marks are related to item difficulty: convergence.
This new black box audit tool relates just these internal
values. The two EDS distributions, from the previous post, for student ability
and item difficulty, radiate from different starting points and expand at
different rates. The span from -2 to +2 logits covers
the expected score range from 12% to 88%.
The relative locations for student abilities and item
difficulties increasingly change from the positive to the negative regions of
the logit scale. This maybe intended or is just an acceptable artifact to
psychometricians. It presents a problem for state departments of education that
claim their tests are so difficult that passing can be set around 40% on NCLB
standardized tests. (I would think this region would be of very little interest
in the classroom where the customary passing point is 60% and the average test
score is 75%. It is of no interest when hidden by state departments of
education by only reporting passing rates without reporting test scores.)
I read the individual student abilities and item
difficulties plots, from the internal audit black box tool, to mean that item
difficulty is being reported increasingly higher than the comparable student
ability, the lower the item difficulty. This would make a test using these
items easier than expected in this region, the region of super low cut scores.
Distribution statistics do not reveal individual student and
item performances. Instead they
show that, as the item difficulty mean of -0.44 was relocated very near to
zero, the student ability mean drifted from 0.39 to 0.71 log its.
The rate of expansion for student ability was larger than
for item difficulty. This did little to match the two distributions. The
expansion of the two logit distributions was very linear. Yet the end result of the analysis, in normal values, was
S-shaped curves.
The best defense students have is to actually know their
subject or master the skill to the point that passing is assured. Trying to
pass with one point over the line is as flawed a student method as is the state department of education method of setting
the cut score before the test is scored to see what actually happened.
Application has more serious individual requirements than research.
One auditing method that I have yet to see used is for
teachers to include a ranking of their students with the answer sheets. An even
better method IMHO is for students to rank themselves by electing Knowledge and Judgment Scoring and reporting
what they trust (using all levels of thinking) rather than taking the
traditional forced-choice test based on the lowest levels of thinking (meaningless
guessing in the cut score region). [Meaningless to everyone except
psychometricians only interested in ranking performance.]
No comments:
Post a Comment