Wednesday, July 8, 2015

CTT and IRT - Precision

Precision is calculated differently for CTT (CSEM) and IRT (SEE). In this post I compare the conditional standard error of measurement (CSEM) based on the amount of right marks and the standard error of estimate (SEE) based on the rate of making right marks (a ratio instead of a count). It then follows that a count of 1 (out of 20) is the same as a ratio of 1:19 or a rate of 1/20.  Precision in Rasch IRT is then the inverse of precision in CTT in order to align precision estimates with their respective distributions (logit and normal).
Table 45

Table 48
Table 45, my audit tool, relates CTT and IRT using real classroom test results. It does not show the values when calculating the probability of a right mark (Table 45b). I dissected that equation into the difference between ability and difficulty (Table 48a) in measures and in normal values (Table 48b).

Table 49
Table 49 shows the highlights. I found the degree of difference between item difficulties (15, 18, and 21) of 3 was maintained across all student scores (from 14 to 20), 0.81 for 3 counts below the average difficulty of 18, and 1.62 for 3 counts above the average item difficulty (upper right). A successful converge then maintains a constant difference across varying student scores. [Is this the basis for item difficulties being independent from student scores?]

On the average student score (17) the difficulty measures value doubled from 0.81, at 3 counts below the mean, to 1.62 at the mean, and doubled again to 3.24 at 3 counts above the mean (center upper left).

The above uniform expansion on the logit scale (Table 49) yields an even larger expansion on the normal scale. As the converging process continues, the top scores are pushed further from the test mean score than lessor scores. The top score (and equivalent logit item difficulty) of 4.95 was pushed out to 141 normal units. That is about 10 times the value for an item with a normal difficulty of 15 that is an equal distance below the test mean (lower left).
Chart 92

Chart 102
Re-plotting Chart 92, normal scale, on measures (Chart 102) shows how markers below the mean are compressed and markers above the mean are expanded. The uniformly spaced normal markers in Chart 92 are now spaced in increasing distance from left to right in measures in Chart 102.
Chart 103

Chart 104
I sequentially summed each item information function and plotted the item characteristic curves (Chart 103, normal scale and Chart 104, measures). High scores (above 17 count/81%/1.73 measures) with low precision drop away from the useful straight line portion of the Rasch model curve for items with low difficulties (high right score counts). This makes sense.

Chart 91
Chart 105
These item information curves fit on the higher end of the Rasch model as shown in Chart 91. I plotted an estimated location in Chart 105 and included the right counts for scores and items after Winsteps, Table 20.1. [Some equations see actual counts, other equations only see smoothed normal curves.]

Chart 90
Chart 106
Chart 82
Chart 91 from classroom data (mean = 80%) is very similar to Chart 90 from Dummy data (mean = 50%) for a right count of 17. I added precision Dummy data from Table 46 to Chart 90 to obtain a general summary of precision based on the rate of making right marks (Chart 106).  Chart 106 relates measures (ratio) to estimated scores (counts) by way of the Rasch model curve where student ability and item difficulty are 50% right and 50% wrong for each measure location. See Chart 82 for a similar display using a normal scale instead of a logit scale. In both cases, IRT precision is much more stable than CTT precision.

Precision (IRT SEE, Table 46) ranges from 0.44 for 21 items. [50 items = 0.28; 100 = 0.20; 200 = 0.14; and 400 = 0.10 measure (0.03 for 3,000 items.] Only a near infinite number of items would bring it to zero. [Error variance on 50 items = 0.08; 100 = 0.04; 200 = 0.02; and 400 items = 0.01. Doubling the number of items on a test cuts the error variance in half. SEE = SQRT(Error Variance).]

The process of converging two normal scales (student scores and item difficulties) involves changing two normal distributions (0 to infinity with 50% being the midpoint) into locations (measures) on a logit scale (- infinity to + infinity with zero (0) being the midpoint. The IRT analysis then inverts the variance (information) to match the combined logit scale distribution (see comments I added to the bottom of Table 46). The apparent paradox appears if you ignore the two different scales for CTT (normal) and Rasch IRT (logit). Information must be inverted to match the logit distribution of measure locations.
Table 46

The Rasch model makes computer adaptive testing (CAT) possible. IMHO it does not justify a strong emphasis on items with a difficulty of 50%. Precision is also limited by the number of items on the test.  Unless the Rasch IRT partial credit model is used, where students report what they actually know, the results are limited to ranking a student by comparing the examinee’s responses to those from a control group that is never truly comparable to the examinee as a consequence of luck on test day (different date, preparation, testing environment and a host of other factors). The results continue to be the “best that psychometricians can do” in making test results “look right” rather than an assessment of what students know and can do (understand and value) as the basis for further learning and instruction.

Chart 101
* Free Source: Knowledge and Judgment and Partial Credit Model    (Knowledge Factor is not free)
Addendum: Billions of dollars and classroom hours have been wasted in a misguided attempt to improve institutionalized education in the United States of America using traditional forced-choice testing. Doing more of what does not work, will not make it work. Doing more at lower levels of thinking will not produce higher levels of thinking results; instead, IMHO, it makes failure more certain (forced-choice at the bottom of Chart 101). Individually assessing and rewarding higher levels of thinking does produce positive results. Easy ways to do this have now existed for over 30 years! Two are now free.

Culligan, Brent. Date & Location Unknown.  Item Response Theory, Reliability and Standard Error. 10 pages.

What is the reasoning behind the formulae for the different standard errors of measurement? 3 pages. Downloaded 4/27/2015.

Basic Statistics Review. Standard Error of Estimate. 4 pages. Downloaded 4/27/2015. 

Wednesday, June 10, 2015

CTT and IRT - Scores

[Chart and table numbers in this post continue from Multiple-Choice Reborn, May 2015. This summary seemed more appropriate in this post.]

Before digging further into the relationship between CTT and IRT, we need to get an overall perspective of educational assessment. When are test scores telling us something about students and when about test makers. How the test is administered is as important as what is on the test. The Rasch partial credit model can deliver the same knowledge and judgment information needed to guide student develop as provided by Power Up Plus.

A perfect educational system has no need for an elaborate method of test item analysis. All students master assigned tasks. Their record is a check-off form. There is no variation within tasks to analyze.

Educational systems designed for failure (A, B, C, D, and F, rather than mastery) generate variation in response to test items from students with variation in preparation and native ability (nurture and nature). 

Further there is a strongly held belief in institutionalized education that the “normal” distribution of grades must approximate the normal curve [of error]. Tests are then designed to generate the desired distribution (rather than let students report what they actually know and can do). Too many students must not get high or low grades. If so, then adjust the data analysis (two different results from the same set of answer sheets).

The last posts to Multiple-Choice Reborn make it very clear that CTT is a less complete analysis than IRT. Parts (CTT) cannot inform us about what is missing to make an analysis whole (IRT). Only the whole (IRT) can indicate what is missing (CTT). The Rasch IRT model may shed light on the missing parts not in CTT. The Rasch model seems to be very accommodating in making test results “look right” judging from its use in Arkansas and Texas to achieve an almost perfect annual rate of improvement and to “correct” or reset the Texas starting score for the rate of improvement.

Table 45
A mathematical model includes the fixed structure and the variable data set it supports or portrays. The fixed structure sets the limits in which the data may appear. My audit tool (Table 45) contains the data. Now I want to relate it to the fixed structures of CTT and IRT.

The CTT model starts with the observed raw scores (vertical right mark scale, Table 45a). Item difficulty is on the horizontal bottom scale. These values stored in the marginal cells are summed from the central cells containing right and wrong marks (Table 45a). Test reliability, test SEM and student CSEM are calculated from the tabled right mark data. This simple model starts with the right mark facts.

The Rasch model for scores turns right mark facts (scores) into a natural logarithm of the R/W ratio and a W/R ratio from item right marks (Table 45b). [ln(ratio) = logit] Winsteps then places the mean of item wrong marks on the zero point of the score right mark scale. Now student ability = item difficulties at each measure location. [1 measure = 1 standard deviation on the logit scale]

The Rasch model for precision is based on probabilities generated from the two sets of marginal cells (score and difficulty, blue, Table 45b).  Starting with a generalized probability rather than the pattern of right and wrong marks makes IRT precision calculations different (more complete?) from CTT. The peak of the curve for items is arbitrarily set at the zero location by Winsteps (Chart 100). This also forces the variation to zero (perfect precision) at this location. [Precision will be treated in the next blog.]

I created a Rasch model for a test of 30 items to summarize the treatment of student scores (raw, measures and expected).

Chart 93
Chart 94
Chart 95
Chart 93 shows a normal distribution (BIONOM.DIST) of raw scores for a 30 item test with an average score of 50% and of 80%. The companion normal (right count) distribution for item difficulty (Chart 94) from 30 students looks the same. This is the typical classroom display.

The values in Chart 94 were then flipped horizontally. This normal (wrong count) distribution for item difficulty (Chart 95) prepares the item difficulty values to be combined with scores onto a single scale.

Chart 96
Chart 97
I created the perfect Rasch model curve for a 30 item test in two steps. The Rasch model for scores (solid black, Chart 96) equals the natural logarithm of the ratio of right/wrong [ln(R/W)] in Chart 93. Flipping the axes (scatter plot) produced the traditional appearing Rasch model Chart 97. This model is for any test of 30 items and for any number of students.   

Chart 98 shows the perfect Rasch model: the curve, and score and difficulty, for a test with an average score of 50%.  The peak values for score and difficulty are at 15 items or 50% at zero measures. This of course never happens. The item difficulties generally have a spread of about twice that of student 
scores. (See Table 46 in Multiple-Choice Reborn, and the related charts for 21 items.)

Chart 99
Throughout this blog and Multiple-Choice Reborn the maximum average test score that seems appropriate for the Rasch model, as well as comments from others, has been near 80%. Chart 99 shows right mark score and wrong mark item values as they are input into the Rasch model. They balance on the zero point.

Chart 100
Next, Winsteps, relocates the average test item value (red dashed) to the zero test score location (green dashed, Chart 100). Now item difficulty and student ability are equal at each and every location on the measures scale. I have reviewed several ways to do this for test items scored right or wrong: graphic, non-iterative PROX and iterative PROX.

In a perfect world the transforming line, IMHO, would be a straight line. Instead it is an S-shaped wave (a characteristic curve) that is the best psychometricians can do with the number system used. Both are used in Winsteps Table 20.1. Scores as measures are transformed into expected student scores (Winsteps Table 20.1). In a perfect world expected scores would equal raw scores; there would be no difference between CTT and IRT score results. [For practical purposes, the space between -1 measure and +1 measure can be considered a straight line; another reason for using items with difficulties of 50%.]

Chart 101
* Free Source: Knowledge and Judgment and Partial Credit Model    (Knowledge Factor is not free)
Addendum: Billions of dollars and classroom hours have been wasted in a misguided attempt to improve institutionalized education in the United States of America using traditional forced-choice testing. Doing more of what does not work, will not make it work. Doing more at lower levels of thinking will not produce higher levels of thinking results; instead, IMHO, it makes failure more certain (forced-choice at the bottom of Chart 101). Individually assessing and rewarding higher levels of thinking does produce positive results. Easy ways to do this have now existed for over 30 years! Two are now free.