Most standardized tests include a small identical set of questions called common items. The idea is that the common items set a standard of performance. If the average performance of the common (C) items is the same when attached to Test A and to Test B, then Test A and Test B are equivalent tests. If the common items differ, then one test can be adjusted, equated, to fit the other test’s frame of reference.

The 24 by 24 question-student test was divided into two sets of nine questions (Test A and Test B) and one common set of six items that covered the range of difficulties. The estimated difficulty measure for each common item varied from test to test: A, B, C, and AB. The average measures of the six common items also varied from test to test. This variation may not occur with Winsteps when using hundreds of answer sheets.

A ratio near 1.0 of the common item average measure standard deviations from Test B/Test A indicates the two tests performed in a similar fashion and can be equated. Excel returned a standard deviation Test B/Test A slope of 1.26/1.30 or 0.97 for the common items included in Test A and Test B. A perfect match would require not only a close match of standard deviations, but also of the other distribution characteristics: mean, mode, median, skew and kurtosis.

The difference in the location of the common items for Test A and Test B, each with 15 items, is related to Test A average measures (0.20) being estimated less difficult then Test B average measures (0.56). The related test scores were 80% for Test A and 84% for Test B. The easier the test the higher the estimated matching item difficulty and student ability measures

Test B measures are put into the frame of reference of Test A by reducing Test B measures by the average of Test B common measures and then adding the average of Test A common measures. The correction factor (- Test B average + Test A average) is -0.56 + 0.20 or -0.36. Adding -0.36 to each Test B measure puts it into the frame of reference of Test A. This is the same as sliding one Table 1.0 person-item map past the other in the [previous] blog, which produced a confirming value of -0.33.

In either case, graphed or calculated, one collective group correction factor (that is most reliable near the mid-range on the scale) is applied to each individual item difficulty measure. This same practice of applying a group correction factor to individual statistics occurs in traditional item analysis, classical test theory (CTT), as on PUP Table 7 Test Performance Profile.

## No comments:

## Post a Comment