Wednesday, October 3, 2012

Rasch Model Student Ability and CTT Quality


                                                              47
How the Rasch model IRT latent student ability value is related to the classical test theory (CTT) PUP quality score (% Right) has not been fully examined. The following discussion reviews the black box results from Fall8850a.txt (50 students and 47 items with no extreme items) and then examines the final audit sheets from normal and transformed analyses. It ends with a comparison of the distributions of latent student ability and CTT quality. The objective is to follow individual students and items through the process of Rasch model IRT analysis. There is no problem with the average values.

We need to know not only what happened but how it happened to fully understand; to obtain one or more meaningful and usefully views. Fifty students were asked to report what they trusted using multiple-choice questions. They were scored zero for wrong (poor judgment), one point for omit (good judgment not to guess and mark a wrong answer), and two points for good judgment (to accurately report what they trusted) and a right answer. Knowledge and Judgment Scoring (KJS) shifts the responsibility for knowing from the teacher to the student. It promotes independent scholarship rather than the traditional dependency promoted by scoring a test only for  right marks and the teacher then telling students which marks were right marks (there is no way to know what students really trust when “DUMB” test scores fall below 90%).

Winsteps displays student and item performance in dramatic bubble charts. The Person & Item chart shows students in blue and items in red. Transposed results (columns and rows become rows and columns) are shown in an Item & Person chart where students are red and items are blue (basically everything has been turned upside down or end over end except the paint job). Blue student 21 with the highest measure (ability) lands as red student 21 with nearly the lowest measure when transposed. That is what is done. Why it is done comes later.

A plot of input/output logit values shows how the process of convergence changes the locations of latent student abilities (log right/wrong ratio of raw student scores) and item difficulties (log wrong/right ratio of raw item difficulties) so they end up as the values plotted on the bubble charts. The ranges of measures on the input/output charts are the same as on the bubble charts. The end over end tipping, from transposing, on the bubble charts also occurs on the input/output charts. Student abilities are grouped as items are treated individually (items with the same test score land at different points on the logit scale). When transposed, items difficulties are grouped as students are treated individually (students with the same test score land at different points on the logit scale). And, in either case, the distribution being examined individually has its mean moved to register at the zero logit location.

The end over end tipping, from transposition, also shows in the normal black box charts. It is easy to see here that the distribution being held as a reference shows little change during the process of convergence. The distribution being examined individually is widely dispersed. Only the highest and lowest individual values for the same grouped value are shown for clarity. Also a contour plot line has been added, for clarity, to show how the individual values would relate to the grouped values if a location correction were made for the fact that all of these individual values have been reduced by the distance their mean was moved to put it on the zero logit location during the process of convergence. In general, the individual values are disbursed about the contour line. This makes sense as they must add up to their original logit mean in the above input/output charts.

The above charts display the values on the final audit sheets for Fall8850a data. Values from Winsteps Table 17.1 Person Statistics were entered in column four Student Logit (+) Output. Values from Table 13.1 Item Statistics were entered in column ten, Item Logit (-) Output. Logit input values were derived from the log right/wrong and log wrong/right ratios for students and items. Normal input values are scores expressed as a percent. Normal output values are from the perfect Rasch model algorithm: exp(logit (+) output)/(1 + exp(logit (+) output)). Normal output (+) item values come from subtracting Normal (-) values from 100% (this inverts the normal scale order in the same way as multiplying values on the logit scale with a -1). One result of this tabling is that comparable output student ability and item difficult values that are clustered together add up to 100% (colored on the chart for clarity). This makes sense. A student ability of 79% should align with an item difficulty with 21% (both with a location of 1.32 logits).

The same thing happens when the data are transposed except, as noted in the above charts, everything is end over end. Column four is now Item Logit (+) Output from Winstep Table 17.1 Item Statistics and column ten, Student (-) Output, is from Table 13.1 Person Statistics. Again an item difficulty of 59% aligns with a student ability of 41% (both with a location of 0.37 logits).

Only normal values can be used to compare IRT results with CTT results. Sorting the above charts by logit input values from individual analyses (right side of each chart) puts the results in order to compare IRT and CTT results. Items 4, 34, and 36 had the same IRT and CTT input difficulties (73%). They had different IRT output values and different CTT quality (% Right) values. The item difficulty quality indicators change in a comparable fashion. (Normally a quality indicator (% Right) is not calculated for CTT item difficulty. It is included here to show how both distributions are treated by CTT and IRT analyses.)

CTT and IRT Quality Indicators
Method
Item (73% Input)

34
4
36
CTT
83%
85%
93%
IRT
44%
46%
56%

Sorting the transposed analysis by input values groups student abilities. Four students had the same IRT and CTT abilities (70%). They had different IRT output values and CTT quality (% Right) indicators. The point is that these quality indicators behaved the same for student ability and item difficulty and for normal and transposed analyses. 

CTT and IRT Quality Indicators
Method
Student (70% Input)

26
37
40
44
CTT
81%
88%
88%
95%
IRT
43%
51%
51%
63%
IRT + Mean
68%
76%
76%
88%

These quality indicators cannot be expected to be the same as they include different components. CTT divides the number of right answers by the total number of marks a student makes to measure quality (% Right). The number of right marks is an indicator of quantity. The test score is a combination of quantity and quality (PUP uses a 50:50 ratio). Winsteps combines IRT student ability and item difficulty, with the Rasch model algorithm, during the JMLE analysis into one expected value, at the same time, it is reducing the output value by the distance the mean location must be moved to the zero location point: convergence. CTT only sees mark counts. The perfect Rasch model sees student ability and item difficulty as probabilities ranging from zero to 1. A more able student has a higher probability of marking right than a less able student. A more difficulty item has a lower probability of being marked right than a less difficult item. This makes sense. A question ranks higher if marked right by more able students. A student ranks higher marking difficult items than marking easier items.

The chart of student ability, from normal and transposed analyses, plots the values for the students in the above table scoring 70% on the test. By following up from 70% Input you encounter 43, 51, and 63% transposed individual values below 76% for the grouped non-transposed value. 



The above selection of students and items was made from a PUP Table 3c. Guttman Mark Matrix. The two selections represented uneventful sets of test performances that seemed to offer the best chance for comparing IRT and CTT. PUP imports the unexpected values from Winsteps Tables 6.5 and 6.6 to color the chart. Coloring clearly shows the behavior of three students who never made the transition from guessing at answers to reporting what they trusted: Order 016, Order 031, and Order 035 with poor judgment scores (wrong) of 24, 13, and 26.

In conclusion, Winsteps does exactly what it is advertised to do. It provides the tools needed for experienced operators to calibrate items for standardized tests and to equate tests. No pixy dust is needed. In contrast, PUP with Knowledge and Judgment Scoring produces classroom friendly tables any student or teacher can use directly in counseling and in improving instruction and assessment. Winsteps with the Rasch partial credit model can perform the same scoring as is done with Knowledge and Judgment Scoring. The coloring of PUP tables provided by Winsteps adds more detail and makes them even easier to use.

There is no excuse for standardized tests and classroom tests being scored at the lowest levels of thinking. The crime is if you test at the lowest levels of thinking you promote classroom instruction at the same level (please see post on multiple-choice reborn). This holds for essay, report, and project assessment, as well as, for multiple-choice tests. The Winsteps Rasch partial credit model and PUP Knowledge and Judgment Scoring offer students a way out of the current academic trap: learning meaningless stuff for “the test” rather than making meaningful sense of each assignment that then empowers the self-correcting learner to learn more. The real end goal of education is developing competent self-educating learners. It is not to process meaningless information that is forgotten with each “mind dump” examination.

Personal computers have been readily available now for more than 30 years. Some day we will look back and wonder why it took so long for multiple-choice to be scored, as it originally was before academia adopted it; in such a manner that the examinee was free to accurately report rather than to continue an academic lottery used to make meaningless rankings.