Rasch Model Audit: July 2012

Wednesday, July 25, 2012

Rasch Rating Scale Model

At this point I have not found a way to directly relate the inner workings (not the outputs) of the partial credit Rasch model and the non-iterative PROX methods for estimating measures. I have now found an intermediary method for estimating measures that bridges this gap: the Rasch rating scale method within Winsteps.

Both the Rasch rating scale method and non-iterative PROX method produce the same measures for students with the same scores and for items with the same difficulties. Winsteps transposed values (columns become rows and rows become columns) also group students and items by common scores and difficulties. You can visualize this by printing out the normal and transposed ability-difficulty tallies (person-item bar charts) and then flipping the transposed bar chart left to right and then tipping it end for end.

These logit values can be restored to normal by multiplying by a -1 to flip the log ratio scale (that is centered on zero) end for end and then adding the measure means (1.78 log its in this Nursing1 example) to shift the values into their original locations.

Restoring again matches student and item measures. A transposed student ability value of -0.34 is restored to +2.12 (0.34 + 1.78). A transposed item difficulty value of 3.30 is restored to -1.51 (-3.30 + 1.78).

Rounding from 1/10 measure to 1/4 measure (Winsteps) produces a better looking bar chart. But there is a noticeable distortion. This becomes apparent when restoring transposed Rasch rating scale values. The original and restored values are the same when manipulating the numbers (flip the log ratio scale and add the measure mean).

The charts, however, can easily be seen to differ after flipping and tipping by placing normal and transposed charts on one sheet. This distortion is an artifact from rounding numbers in two directions on a logit scale.

With this in mind, the Rasch rating scale method adds a neat feature. Instead of just counting right marks (the traditional forced choice, DUMB test), a test can be designed to let students report what they actually know and trust to be of value; what they understand and find useful at all levels of thinking. This is the same as Knowledge and Judgment Scoring. The Fall8850a data ranked responses 0, 1, and 2 for wrong (guessing, poor judgment in reporting what is known and trusted), omit (good judgment in accurately reporting what is known and trusted), and right (reporting what is known and trusted).

However the Rasch rating scale still groups students with like scores, and items with like difficulties, as convergence is controlled by one set of rating scale thresholds for all items (and when transposed, for all students). The partial credit Rasch model builds on this foundation.

Wednesday, July 18, 2012

Winsteps Cantrell Measure Estimation

The tally charts in Post 33 show non-iterative PROX and Winsteps producing almost identical results with classroom Nursing1 data (21 students and 22 items). Post 34, with Cantrell (made up) data (34 students and 14 items), shows very different results between non-iterative PROX and Winsteps. This post explores the way the different results were produced. Winsteps was stopped after each iteration and the person, Table 13.1, and item, Table 17.1, printouts examined.

Winsteps iterative PROX creates iteration one by subtracting the item measure mean from each item logit (shifting the distribution to the person ability zero measure location). Expansion factors are applied to person ability and Item difficulty, and the item mean is again adjusted to zero measure for the next iteration in Post 35. In contrast, Winsteps JMLE makes adjustments on person ability and item difficulty simultaneously by filling in each cell with the probability of expected score based on both person ability and item difficulty.

The relative location of the third from the lowest student ability measure and the closely related item difficulty measures changes from one iteration to the next on the PROX chart (left). The same is true on the JMLE chart (right) This change in relative location of student ability and item difficulty is, in part, the most noticeable effect of placing two sets of data on the same graph with different starting locations and that are each expanded at different rates. The rate of expansion decreases with each iteration until it is too low to justify further iterations. Convergence is then declared; that point at which person ability and item difficulty are found at the same point on the logit measure scale.

The JMLE chart shows JMLE starting with the last Cantrell PROX iteration. After about two iterations, the locations for person ability and item difficulty resemble those from non-iterative PROX. But JMLE continues on another dozen iterations to iteration 14 before stopping. By now the distribution has been expanded an additional logit in either direction. Clearly the PUP non-iterative PROX and Winsteps JMLE are not in agreement using Cantrell data. The two methods are in almost perfect agreement when using Nursing1 data after just two PROX and two JMLE Winsteps iterations.

My view on this is that poor, and inadequate, data can produce poor results. The Cantrell charts show wavy lines at the greatest distances from the zero logit location. This hunting, hysteresis effect, indicates the data reduction method is making large changes that may lead to inaccurate results. The JMLE portion of Winsteps is a more finely tuned method than the iterative PROX portiion.

Four methods for estimating measures have now been explored: graphic, non-iterative PROX, iterative PROX and JMLE. These inventions each have increasing sensitivity in producing convergence. Since the first three have been fully discussed (and have been found to have no need for any pixy dust), I am willing to trust that JMLE does not require any either. In general, the location for person ability will yield a higher expected score than the raw test scores from which it is derived. The further the raw score is above 50%, the greater the difference between raw score and expected score. The same goes for scores below 50%; the lower the raw test score, the increasingly lower the expected score.

I am still puzzled by two observations: Students correctly answering the same number of questions of differing difficulty land at the same ability location. Items answered correctly by the same number of students with differing abilities land at the same difficulty location. This does not, in my opinion, square with ability-independent and item-independent qualities or that correctly marking one difficult question is worth marking two easier questions.

The Rasch model requires student ability and item difficulty to be located on one fairly linear scale. It adds properties related to latent student ability and latent Item difficulty. I see nothing in the four examined estimation methods that, by themselves, confers these powers or properties to marks on answer sheets. The elusive properties of the Rasch model may be based more on use and operator skill than on the methods for estimating student ability and item difficulty measures.

Wednesday, July 11, 2012

Iterative and Non-Iterative PROX

Non-Iterative PROX starts with item wrong counts, shifts the item distribution to match student right count distribution (subtracts the item mean from each value in the item distribution), and then applies expansion factors to each logit distribution. This is done in one operation.

Iterative PROX follows a similar sequence, but in small steps. Here the item difficulty logit distribution is reset (shifted) to the zero logit location each step. The average student ability mean of one iteration becomes the average item difficulty mean on the next iteration (which is then subtracted to reset the item difficulty mean to zero).

Double click any estimated ability or difficulty measure cell, after the first iteration, to see the following algorithms outlined for each cell in an Excel spreadsheet, http://www.nine-patch.com/download/CIPROX.xls. Change the constant value (2.647) to see the entire sheet recalculate.

The combined shift and expansion factor algorithm for student ability becomes revised student ability estimate (Ar) = current item difficulty mean (Dm) + current item difficulty standard deviation (DSD) x the initial student ability (Ai) logit (ln right/wrong ratio). For item difficulty it becomes revised item difficulty estimate (Dr) = current student ability mean (Am) + current student ability standard deviation (ASD) x the initial item difficulty (Di) logit (ln wrong/right ratio).

Again after adding in constants to match logistic and normal distributions, the algorithm used here for student ability location became Ar = Dm + SQRT(1 + (DSD^2/2.9)) * Ai. For item difficulty location it became Dr = (Am - SQRT(1 + (ASD^2/2.9)) * Di )* -1 on the Excel spreadsheet. These yielded the same check sums (Extreme 5 Range for Person and Item) printed on Winsteps Table 0.2 when the constant of 2.9 was adjusted to 2.647.

A plot of the average student ability and the average item difficulty measures, for each iteration, shows an orderly expansion from one iteration to the next. The spread from iterative PROX is less than for non-iterative PROX as the fifth iteration stopped the analysis to allow JMLE estimations to finish the analysis.

The black box chart shows the final locations for student ability and item difficulty means approaching 50%. The primary interest in the results of five PROX iterations is that the student ability and item difficulty means (47 & 48) match non-iterative PROX results (47 & 48) more closely than Winsteps results (40 & 47). The difference in results between PUP non-iterative PROX and Winsteps is then not found in this first, iterative PROX, stage of Winsteps. The difference must be in the second, JMLE, stage in Winsteps.

Wednesday, July 4, 2012

Cantrell with Non-Iterative PROX and Winsteps

The Cantrell data are very different from Nursing1 data. They include 34 students by 14 items instead of 22 students by 21 items (but similar, 476 and 462 data points). The average test score is less than 50% (48% instead of 80%). [I always suggest a minimum of 1000 data points for traditional test analysis, classical test theory (CTT), and the same may apply to item response theory (IRT).]

The student ability-item difficulty tallies produced by PUP PROX and Winsteps appear quite similar until the actual locations are observed. In one case only, the relative locations of student ability and item difficulty are reversed (the third lowest score and difficulty).

Non-iterative PROX determines the final item difficulty locations by subtracting a constant from the initial item difficulty locations and multiplying by an expansion factor constant. Winsteps seeks the final item difficulty location by applying individual mark adjustments. Individual item difficulty mark adjustments (EF) ranged from 1.51 to 2.01 measures (1.78 on average). Student ability mark adjustments (EF) ranged from 2.16 to 4.87 measures (3.09 on average).

Here is the first evidence that Winsteps may treat each student raw score mark and each item difficulty mark individually. Student ability is related to the difficulty of items marked correctly. Item difficulty is related to the ability of the marking students. [Marking one difficult item correctly may be worth as much (latent student ability) as marking two easier items correctly? Being marked correctly by one high ability student may be worth as much (latent item difficulty) as being marked correctly by two lower ability students?]

Both methods of estimation did not set the final location for item difficulties on the 50% (or zero logit) location. Instead, both the graphic method and PROX matched student ability and item difficulty means to the average test score (48%). Winsteps matched the average item difficulties to the average test score, but reported a lower value (40%) for predicted student scores based on student abilities. This may be a case of known "estimation bias" related to "small samples or short tests" that "inflates the logit distance between estimates".

As student abilities and item difficulties are pushed, either direction, further from the zero starting point and then converted back to normal values, the distribution sags down or rises up from the starting, no change, black box chart line. Winsteps presents a fuller development than non-iterative PROX. (Winsteps uses iterative PROX as the first stage and then JMLE to make the final estimate of measures.)

The small sample size (14 items) and the fact that the Cantrell data were made up for demonstration may contribute to these results. The black box chart can show variations in measure estimates but cannot explain them. More tests need to be examined to determine when, and if, analysis fails.