Wednesday, July 8, 2015

CTT and IRT - Precision

49
Precision is calculated differently for CTT (CSEM) and IRT (SEE). In this post I compare the conditional standard error of measurement (CSEM) based on the amount of right marks and the standard error of estimate (SEE) based on the rate of making right marks (a ratio instead of a count). It then follows that a count of 1 (out of 20) is the same as a ratio of 1:19 or a rate of 1/20.  Precision in Rasch IRT is then the inverse of precision in CTT in order to align precision estimates with their respective distributions (logit and normal).
 Table 45

 Table 48
Table 45, my audit tool, relates CTT and IRT using real classroom test results. It does not show the values when calculating the probability of a right mark (Table 45b). I dissected that equation into the difference between ability and difficulty (Table 48a) in measures and in normal values (Table 48b).

 Table 49
Table 49 shows the highlights. I found the degree of difference between item difficulties (15, 18, and 21) of 3 was maintained across all student scores (from 14 to 20), 0.81 for 3 counts below the average difficulty of 18, and 1.62 for 3 counts above the average item difficulty (upper right). A successful converge then maintains a constant difference across varying student scores. [Is this the basis for item difficulties being independent from student scores?]

On the average student score (17) the difficulty measures value doubled from 0.81, at 3 counts below the mean, to 1.62 at the mean, and doubled again to 3.24 at 3 counts above the mean (center upper left).

The above uniform expansion on the logit scale (Table 49) yields an even larger expansion on the normal scale. As the converging process continues, the top scores are pushed further from the test mean score than lessor scores. The top score (and equivalent logit item difficulty) of 4.95 was pushed out to 141 normal units. That is about 10 times the value for an item with a normal difficulty of 15 that is an equal distance below the test mean (lower left).
 Chart 92

 Chart 102
Re-plotting Chart 92, normal scale, on measures (Chart 102) shows how markers below the mean are compressed and markers above the mean are expanded. The uniformly spaced normal markers in Chart 92 are now spaced in increasing distance from left to right in measures in Chart 102.
 Chart 103

 Chart 104
I sequentially summed each item information function and plotted the item characteristic curves (Chart 103, normal scale and Chart 104, measures). High scores (above 17 count/81%/1.73 measures) with low precision drop away from the useful straight line portion of the Rasch model curve for items with low difficulties (high right score counts). This makes sense.

 Chart 91
 Chart 105
These item information curves fit on the higher end of the Rasch model as shown in Chart 91. I plotted an estimated location in Chart 105 and included the right counts for scores and items after Winsteps, Table 20.1. [Some equations see actual counts, other equations only see smoothed normal curves.]

 Chart 90
 Chart 106
 Chart 82
Chart 91 from classroom data (mean = 80%) is very similar to Chart 90 from Dummy data (mean = 50%) for a right count of 17. I added precision Dummy data from Table 46 to Chart 90 to obtain a general summary of precision based on the rate of making right marks (Chart 106).  Chart 106 relates measures (ratio) to estimated scores (counts) by way of the Rasch model curve where student ability and item difficulty are 50% right and 50% wrong for each measure location. See Chart 82 for a similar display using a normal scale instead of a logit scale. In both cases, IRT precision is much more stable than CTT precision.

Precision (IRT SEE, Table 46) ranges from 0.44 for 21 items. [50 items = 0.28; 100 = 0.20; 200 = 0.14; and 400 = 0.10 measure (0.03 for 3,000 items.] Only a near infinite number of items would bring it to zero. [Error variance on 50 items = 0.08; 100 = 0.04; 200 = 0.02; and 400 items = 0.01. Doubling the number of items on a test cuts the error variance in half. SEE = SQRT(Error Variance).]

The process of converging two normal scales (student scores and item difficulties) involves changing two normal distributions (0 to infinity with 50% being the midpoint) into locations (measures) on a logit scale (- infinity to + infinity with zero (0) being the midpoint. The IRT analysis then inverts the variance (information) to match the combined logit scale distribution (see comments I added to the bottom of Table 46). The apparent paradox appears if you ignore the two different scales for CTT (normal) and Rasch IRT (logit). Information must be inverted to match the logit distribution of measure locations.
 Table 46

The Rasch model makes computer adaptive testing (CAT) possible. IMHO it does not justify a strong emphasis on items with a difficulty of 50%. Precision is also limited by the number of items on the test.  Unless the Rasch IRT partial credit model is used, where students report what they actually know, the results are limited to ranking a student by comparing the examinee’s responses to those from a control group that is never truly comparable to the examinee as a consequence of luck on test day (different date, preparation, testing environment and a host of other factors). The results continue to be the “best that psychometricians can do” in making test results “look right” rather than an assessment of what students know and can do (understand and value) as the basis for further learning and instruction.

 Chart 101 * Free Source: Knowledge and Judgment and Partial Credit Model    (Knowledge Factor is not free)
Addendum: Billions of dollars and classroom hours have been wasted in a misguided attempt to improve institutionalized education in the United States of America using traditional forced-choice testing. Doing more of what does not work, will not make it work. Doing more at lower levels of thinking will not produce higher levels of thinking results; instead, IMHO, it makes failure more certain (forced-choice at the bottom of Chart 101). Individually assessing and rewarding higher levels of thinking does produce positive results. Easy ways to do this have now existed for over 30 years! Two are now free.

References:
Culligan, Brent. Date & Location Unknown.  Item Response Theory, Reliability and Standard Error. 10 pages.  http://www.wordengine.jp/research/pdf/IRT_reliability_and_standard_error.pdf

What is the reasoning behind the formulae for the different standard errors of measurement? 3 pages. Downloaded 4/27/2015. http://stats.stackexchange.com/questions/60190/what-is-the-reasoning-behind-the-formulae-for-the-different-standard-errors-of-m