## Wednesday, July 13, 2011

### Winsteps - Basic Relationships

29

Before proceeding with equating, it is important to have in mind just what is being equated, by Winsteps using item response theory (IRT), or by traditional, classic test theory (CTT).  Psychometricians, politicians, administrators, teachers, and students look at test data in different ways. Psychometricans are concerned with how well the data matches some ideal concept, the Rasch model for Winsteps, or a normal distribution. Administrators and politicians are concerned over average test scores.

Good teachers see how well individual students respond to instruction when students are free to report what they know, and trust, and what they have yet to learn. Students have a wide range of interests from the inattentive passive pupil to the self-correcting scholar. Multiple-choice test scores do a very poor job of reflecting the variation in student performance when only the right marks are counted. The scores produce a ranking that is still commonly accepted without question.

The measure ogive for a complete test represents a powerful relationship between student ability and item difficulty. Students with an ability equal to the same item difficulty have a 50:50 chance of marking a right answer all along this line. Easy questions require little ability. Difficult questions require high ability. With CTT, this relationship only occurs for an item with a difficulty equal to the average test score when marked by students with an ability also equal to the average test score. With Winsteps, this unique point is the zero point on the student ability and item difficulty measures scale. It transforms into an expected student score of 50%.

Winsteps sets the item difficulty measures for the three charted tests at zero measures. That means student ability measures are lower than item difficulty measures on an impossible test. Student ability measures are higher than item difficulty measures on a traditional classroom test.

Student ability measures are based on the relative difficulty of the items marked correctly. Item difficulty measures are based on the relative student ability to mark correctly (hence the cyclic math used to estimate measures). Each student receives an individualized ability estimated measure based on the interrelated average student and item performances.

A measure is not the sum or average of counts. Counts make a variable look uniform when it is not: one point for each right answer to questions of variable difficulty (CTT). Measures assess the value of the variable being counted (IRT). You can buy melons of variable sizes at \$2 each, by count, or you can buy melons at 10 cents a pound, by weight measure.

Students who cannot read the test or understand the questions are placed in the impossible position of gambling for a passing score. On a four-option test, they receive a handicap of 25%, on average. A normal distribution around this point rarely produces a passing score. Even though the same relationship between student ability and item difficulty holds the full length of the ogive, that does not mean very low scores represent any meaningful measurement of student performance. The results of a randomly generated test at the 25% performance level are utter nonsense. Equating these scores to some higher level of performance does not make them any more meaningful.

Computerized adaptive testing (CAT) functions at the 50% performance level. It administers questions that closely match the ability of each student. This is very efficient. It takes less time and fewer items than when using a paper test. High ability students are not bothered with easy questions. Low ability students are not forced to come up with the “best answer” on items they have no idea of how to answer. Each student receives an individualized test based on the average performance of other students.

Knowledge and Judgment Scoring (KJS) also starts at the 50% performance level when knowledge and judgment are given equal weight. High ability students can zip through easy items and low ability students can skip impossible items. Both can yield high quality results: few if any wrong marks. With KJS each student individualizes the test to report what is known and what has yet to be learned (quantity and quality are combined into a test score). Each student receives an individualized test based on each student’s own judgment.

Both KJS and traditional right mark scoring (RMS) produce average classroom scores of 75%. The difference is that with KJS both the student and the test maker know what the student knows, and how well the student knows, at the time of the test and long afterwards. With RMS we only get a ranking increasingly contaminated with chance at lower scores. The same set of questions can be used on both with KJS and RMS to get a desired average test score. With the exception of KJS, the lower the quantitative test scores, the lower the quality of the results.

Score distributions subject to equating can carry far more information than the simple normal bell-curve. An hour test in a remedial general studies biology course often yielded four modes: A large mode centered at 60% from students enrolled pass/fail; a smaller mode at 75% from attentive passive pupils needing a C out of the course; a smaller mode at 90% who were self-correcting scholars; and a very small mode at 100%, who in later years were tested out with credit (if an A or B) the first week of the course. Classroom and standardized tests have different characteristics because they are designed to assess for different reasons. The classroom test monitors learning. The standardized test, when limited to right count scoring, is a ranking device that obtains only a small portion of the information available when using KJS, or as Bond and Fox (2007), Chapter 7, call it, the Partial Credit Rasch Model.