## Wednesday, July 20, 2011

### Winsteps - Score Distributions

30

Winsteps requires score distributions to be very similar to fit the Rasch model when equating. You can do two things to a distribution of scores. You can shift the location of the distribution by subtracting from or adding a constant to each measure, or you can stretch or shrink the distribution by multiplying or dividing with a constant. Winsteps uses one or both of these adjustments when equating.

It is impractical, impossible, to have one set of students mark answers to all of the questions needed for a test bank on one test. This problem is solved, in theory, by administering several tests. Each test contains a set of common items. In theory, these common items will be equally difficult on every test.

Score distributions have many statistics: mean, median, mode, skew, kurtosis and standard deviation (SD). Winsteps uses the SD as the most meaningful way to compare distributions. Combine two very similar distributions (the common items SD of test A/SD of test B is near 1) by shifting the mean of one distribution to match the other distribution. A constant is added to or subtracted from each measure to put test B into the frame of reference of test A.

If the two distributions are not very similar, extreme items can be liberally discarded to obtain a better match for Winsteps in estimating Rasch model measures. This is not directly comparable to discarding values based on right mark counts using CTT. Counts and measures are not the same thing (see previous post).

Winsteps reports student raw scores in perfect alignment with student abilities, Table 17.1. But it reports item difficulties in a fuzzy array, Table 13.1. A range of item difficulty raw score counts can yield the same measure. Two difficult items can be worth the same as three easy items in measures.

If the two distributions are still not very similar, they can be combined by both shifting the mean, as above, and by stretching or shrinking. The ratio obtained by dividing the common item SD for one test by the other is the required constant. Measures in one of the distributions are multiplied or divided by this constant to put them into the frame of reference of the other distribution.

When to add or subtract, or to multiply or divide, is determined by what activity you are engaged in (item calibration, test banking, cut score, or application) as well as how the two test score distributions match. Psychometricians tend to think along the line that they are sampling from one big population when calibrating items and when applying the standardized test. Many statistics are set up with the normal curve, the know-nothing curve (the curve obtained by marking the answer sheet without looking at the test), as the background reference. (This idea is mostly false in NCLB standardized testing where there is a strong demand for higher scores every year. The students of this year are, hopefully, better prepared than those of past years. They should not be members of the same population. If they are, there is no progress.)

If the higher scoring students on test B had been less able, they would have scored the same as those on test A. Also if the lower scoring test B students had been more able they would have scored the same as those on test A. So, in theory, adjust accordingly.

Several states have made the argument that they are making their tests more difficult. Therefore lower students scores should be increased to match those from earlier years (or the cut score should be lowered).

But application is more complicated than the above chart. There are more than just two outcomes. This is true using CTT or IRT methods. Because IRT measures are derived values (from both student right counts and item difficulty wrong counts) they do not maintain a direct relationship with counts (see item difficulty, Winsteps Table 13.1 above). The same student mark data can yield opposite effects using CTT or IRT methods. The following four outcomes must be considered fully within only one method at a time.

The two un-shaded outcomes result from the common items average scores not being in sync with the total test scores. This can be avoided by discarding data that leads to results that do not match the expectations of the Rasch model.

The two shaded outcomes make sense when calibrating and test banking from a common population. These two outcomes are open to question during application.

If there is reason to believe the benchmark test A and the application test B are really sampling the same population, then the given adjustment stands when test B yields both total and common item average scores higher than test A. If not, the application test has a significantly higher average test score than the benchmark test A, then lowering the test B scores or raising the cut score seems incorrect. We have two different populations. The current one has performed better than the previous one.

The same reasoning applies when test B yields both total and common item average scores that are lower than test A. The more difficult test results require increasing the student scores or lowering the cut score. But this makes little sense. Common items do not change in difficulty. Students change in ability. We are not sampling the same population. The current one is not performing as well as the previous one. If this trend were followed to the extreme, student scores would become adjusted higher or cut scores lower until randomly created results (mark the answer sheet without looking at the test) would pass most students. This is the end game for psychometricans, politicians, administrators and teachers when functioning at the lowest levels of thinking and there is little meaningful relationship between the test and the domain it is reported to be assessing.

Winsteps does an adequate job of item calibration, test banking, and equating (it has a zillion refinements that I do not know enough about to appreciate). How these are used is a matter of judgment on the part of those who control the assessment process. These people must be held to high professional standards by appropriate audits and transparency. A distinction needs to be kept in mind between the requirements of research, application, and natural experiments. NCLB assessments now span all of these.

A strong relationship needs to be made between the test and what it is assessing. A current example (developed to fill the entrepreneurial vacuum created by high school diplomas of questionable value) is the ACT WorkKeys test. What skills have students learned in high school that prepare them to do specific, well defined, tasks commonly needed in the workplace? The questions are presented as a sampling of select domains at all levels of thinking in Applied Mathematics, Reading for Information, and Locating Information. Doing well on the test is a prediction of success in the selected domains at all levels of thinking. Knowledge and Judgment Scoring (KJS) has similar properties: students, teachers and employers can know what can be trusted as the basis for further learning and instruction at all levels of thinking.

I have learned, in the last 12 months, that there is a difference between counting things and measuring them. Counting is measuring only if all items being counted have the exact same properties. This brings my audit of the Rasch model to a close. In the process, Winsteps has become a friend that adds finer detail to Power Up Plus (PUP) when using the Partial Credit Rasch Model.

## Wednesday, July 13, 2011

### Winsteps - Basic Relationships

29

Before proceeding with equating, it is important to have in mind just what is being equated, by Winsteps using item response theory (IRT), or by traditional, classic test theory (CTT).  Psychometricians, politicians, administrators, teachers, and students look at test data in different ways. Psychometricans are concerned with how well the data matches some ideal concept, the Rasch model for Winsteps, or a normal distribution. Administrators and politicians are concerned over average test scores.

Good teachers see how well individual students respond to instruction when students are free to report what they know, and trust, and what they have yet to learn. Students have a wide range of interests from the inattentive passive pupil to the self-correcting scholar. Multiple-choice test scores do a very poor job of reflecting the variation in student performance when only the right marks are counted. The scores produce a ranking that is still commonly accepted without question.

The measure ogive for a complete test represents a powerful relationship between student ability and item difficulty. Students with an ability equal to the same item difficulty have a 50:50 chance of marking a right answer all along this line. Easy questions require little ability. Difficult questions require high ability. With CTT, this relationship only occurs for an item with a difficulty equal to the average test score when marked by students with an ability also equal to the average test score. With Winsteps, this unique point is the zero point on the student ability and item difficulty measures scale. It transforms into an expected student score of 50%.

Winsteps sets the item difficulty measures for the three charted tests at zero measures. That means student ability measures are lower than item difficulty measures on an impossible test. Student ability measures are higher than item difficulty measures on a traditional classroom test.

Student ability measures are based on the relative difficulty of the items marked correctly. Item difficulty measures are based on the relative student ability to mark correctly (hence the cyclic math used to estimate measures). Each student receives an individualized ability estimated measure based on the interrelated average student and item performances.

A measure is not the sum or average of counts. Counts make a variable look uniform when it is not: one point for each right answer to questions of variable difficulty (CTT). Measures assess the value of the variable being counted (IRT). You can buy melons of variable sizes at \$2 each, by count, or you can buy melons at 10 cents a pound, by weight measure.

Students who cannot read the test or understand the questions are placed in the impossible position of gambling for a passing score. On a four-option test, they receive a handicap of 25%, on average. A normal distribution around this point rarely produces a passing score. Even though the same relationship between student ability and item difficulty holds the full length of the ogive, that does not mean very low scores represent any meaningful measurement of student performance. The results of a randomly generated test at the 25% performance level are utter nonsense. Equating these scores to some higher level of performance does not make them any more meaningful.

Computerized adaptive testing (CAT) functions at the 50% performance level. It administers questions that closely match the ability of each student. This is very efficient. It takes less time and fewer items than when using a paper test. High ability students are not bothered with easy questions. Low ability students are not forced to come up with the “best answer” on items they have no idea of how to answer. Each student receives an individualized test based on the average performance of other students.

Knowledge and Judgment Scoring (KJS) also starts at the 50% performance level when knowledge and judgment are given equal weight. High ability students can zip through easy items and low ability students can skip impossible items. Both can yield high quality results: few if any wrong marks. With KJS each student individualizes the test to report what is known and what has yet to be learned (quantity and quality are combined into a test score). Each student receives an individualized test based on each student’s own judgment.

Both KJS and traditional right mark scoring (RMS) produce average classroom scores of 75%. The difference is that with KJS both the student and the test maker know what the student knows, and how well the student knows, at the time of the test and long afterwards. With RMS we only get a ranking increasingly contaminated with chance at lower scores. The same set of questions can be used on both with KJS and RMS to get a desired average test score. With the exception of KJS, the lower the quantitative test scores, the lower the quality of the results.

Score distributions subject to equating can carry far more information than the simple normal bell-curve. An hour test in a remedial general studies biology course often yielded four modes: A large mode centered at 60% from students enrolled pass/fail; a smaller mode at 75% from attentive passive pupils needing a C out of the course; a smaller mode at 90% who were self-correcting scholars; and a very small mode at 100%, who in later years were tested out with credit (if an A or B) the first week of the course. Classroom and standardized tests have different characteristics because they are designed to assess for different reasons. The classroom test monitors learning. The standardized test, when limited to right count scoring, is a ranking device that obtains only a small portion of the information available when using KJS, or as Bond and Fox (2007), Chapter 7, call it, the Partial Credit Rasch Model.