Wednesday, July 20, 2011

Winsteps - Score Distributions


Winsteps requires score distributions to be very similar to fit the Rasch model when equating. You can do two things to a distribution of scores. You can shift the location of the distribution by subtracting from or adding a constant to each measure, or you can stretch or shrink the distribution by multiplying or dividing with a constant. Winsteps uses one or both of these adjustments when equating.

It is impractical, impossible, to have one set of students mark answers to all of the questions needed for a test bank on one test. This problem is solved, in theory, by administering several tests. Each test contains a set of common items. In theory, these common items will be equally difficult on every test.

Score distributions have many statistics: mean, median, mode, skew, kurtosis and standard deviation (SD). Winsteps uses the SD as the most meaningful way to compare distributions. Combine two very similar distributions (the common items SD of test A/SD of test B is near 1) by shifting the mean of one distribution to match the other distribution. A constant is added to or subtracted from each measure to put test B into the frame of reference of test A.

If the two distributions are not very similar, extreme items can be liberally discarded to obtain a better match for Winsteps in estimating Rasch model measures. This is not directly comparable to discarding values based on right mark counts using CTT. Counts and measures are not the same thing (see previous post).

Winsteps reports student raw scores in perfect alignment with student abilities, Table 17.1. But it reports item difficulties in a fuzzy array, Table 13.1. A range of item difficulty raw score counts can yield the same measure. Two difficult items can be worth the same as three easy items in measures.  

If the two distributions are still not very similar, they can be combined by both shifting the mean, as above, and by stretching or shrinking. The ratio obtained by dividing the common item SD for one test by the other is the required constant. Measures in one of the distributions are multiplied or divided by this constant to put them into the frame of reference of the other distribution.

When to add or subtract, or to multiply or divide, is determined by what activity you are engaged in (item calibration, test banking, cut score, or application) as well as how the two test score distributions match. Psychometricians tend to think along the line that they are sampling from one big population when calibrating items and when applying the standardized test. Many statistics are set up with the normal curve, the know-nothing curve (the curve obtained by marking the answer sheet without looking at the test), as the background reference. (This idea is mostly false in NCLB standardized testing where there is a strong demand for higher scores every year. The students of this year are, hopefully, better prepared than those of past years. They should not be members of the same population. If they are, there is no progress.)

If the higher scoring students on test B had been less able, they would have scored the same as those on test A. Also if the lower scoring test B students had been more able they would have scored the same as those on test A. So, in theory, adjust accordingly.

Several states have made the argument that they are making their tests more difficult. Therefore lower students scores should be increased to match those from earlier years (or the cut score should be lowered).

But application is more complicated than the above chart. There are more than just two outcomes. This is true using CTT or IRT methods. Because IRT measures are derived values (from both student right counts and item difficulty wrong counts) they do not maintain a direct relationship with counts (see item difficulty, Winsteps Table 13.1 above). The same student mark data can yield opposite effects using CTT or IRT methods. The following four outcomes must be considered fully within only one method at a time.

The two un-shaded outcomes result from the common items average scores not being in sync with the total test scores. This can be avoided by discarding data that leads to results that do not match the expectations of the Rasch model.

The two shaded outcomes make sense when calibrating and test banking from a common population. These two outcomes are open to question during application.

If there is reason to believe the benchmark test A and the application test B are really sampling the same population, then the given adjustment stands when test B yields both total and common item average scores higher than test A. If not, the application test has a significantly higher average test score than the benchmark test A, then lowering the test B scores or raising the cut score seems incorrect. We have two different populations. The current one has performed better than the previous one.

The same reasoning applies when test B yields both total and common item average scores that are lower than test A. The more difficult test results require increasing the student scores or lowering the cut score. But this makes little sense. Common items do not change in difficulty. Students change in ability. We are not sampling the same population. The current one is not performing as well as the previous one. If this trend were followed to the extreme, student scores would become adjusted higher or cut scores lower until randomly created results (mark the answer sheet without looking at the test) would pass most students. This is the end game for psychometricans, politicians, administrators and teachers when functioning at the lowest levels of thinking and there is little meaningful relationship between the test and the domain it is reported to be assessing.

Winsteps does an adequate job of item calibration, test banking, and equating (it has a zillion refinements that I do not know enough about to appreciate). How these are used is a matter of judgment on the part of those who control the assessment process. These people must be held to high professional standards by appropriate audits and transparency. A distinction needs to be kept in mind between the requirements of research, application, and natural experiments. NCLB assessments now span all of these.

A strong relationship needs to be made between the test and what it is assessing. A current example (developed to fill the entrepreneurial vacuum created by high school diplomas of questionable value) is the ACT WorkKeys test. What skills have students learned in high school that prepare them to do specific, well defined, tasks commonly needed in the workplace? The questions are presented as a sampling of select domains at all levels of thinking in Applied Mathematics, Reading for Information, and Locating Information. Doing well on the test is a prediction of success in the selected domains at all levels of thinking. Knowledge and Judgment Scoring (KJS) has similar properties: students, teachers and employers can know what can be trusted as the basis for further learning and instruction at all levels of thinking.

I have learned, in the last 12 months, that there is a difference between counting things and measuring them. Counting is measuring only if all items being counted have the exact same properties. This brings my audit of the Rasch model to a close. In the process, Winsteps has become a friend that adds finer detail to Power Up Plus (PUP) when using the Partial Credit Rasch Model.

No comments:

Post a Comment