Wednesday, February 23, 2011
The benchmark test is the standard for all future tests. It is constructed from calibrated questions and, generally, includes a set of common items. It has been administered at least once to verify that it satisfies statistical expectations. When scored by only counting right marks, the expectations only include the ranking of each item and each student. The analysis says little about the validity of the test to assess what students trust they know, as the basis for future instruction and learning, or for future job performance.
Calibrated questions are obtained by administering new questions to comparable groups of students. Field-testing presents a new set of questions to students without consequences. This is a common way to start the process of standardization. Operational testing presents new questions embedded in a test that has consequences, for more valid results. Winsteps estimates individual student ability and item difficulty measures. The measures are reported to provide sample-free item calibrations.
Bookmarking is the practice of placing calibrated questions in a book in an order of increasing difficulty. Experts and judges examine each question to find the one a student must answer to pass the test, a very political process. The outcome has been highly variable (as it is impossible for those with full knowledge to guess what answers students with incomplete knowledge will mark). The average judged item difficulty of the selected questions sets the raw cut score for the test.
Classroom tests, unlike standardized tests, include everyone and everything. PUP Table 7 lists Mastery, Unfinished, and Discriminating (the questions normally used in standardized tests). Even a classroom test can yield high test reliability (>0.90 KR20) if enough discriminating questions are included.
Reporters were concerned last year by the proximity of the random guessing score and the cut score. Ministep, implementing the Rasch model, does not recognize guessing. This gives students a handicap of 25% on 4-option questions. (I customized PUP, PUP7, to handle 7-option questions, a 14% handicap, last year for a faculty member with a cheating problem from answer copying.)
Winsteps performs as advertised estimating student and item measures, but it has no magic to correct errors in judgment in creating the standardized test, managing drift, and in setting cut scores. Time for testing places a limit on the number of questions on the benchmark test. The main goal of Winsteps is to help select the minimum number of questions that will yield acceptable test reliability.
Thursday, February 17, 2011
Most standardized tests include a small identical set of questions called common items. The idea is that the common items set a standard of performance. If the average performance of the common (C) items is the same when attached to Test A and to Test B, then Test A and Test B are equivalent tests. If the common items differ, then one test can be adjusted, equated, to fit the other test’s frame of reference.
The 24 by 24 question-student test was divided into two sets of nine questions (Test A and Test B) and one common set of six items that covered the range of difficulties. The estimated difficulty measure for each common item varied from test to test: A, B, C, and AB. The average measures of the six common items also varied from test to test. This variation may not occur with Winsteps when using hundreds of answer sheets.
A ratio near 1.0 of the common item average measure standard deviations from Test B/Test A indicates the two tests performed in a similar fashion and can be equated. Excel returned a standard deviation Test B/Test A slope of 1.26/1.30 or 0.97 for the common items included in Test A and Test B. A perfect match would require not only a close match of standard deviations, but also of the other distribution characteristics: mean, mode, median, skew and kurtosis.
The difference in the location of the common items for Test A and Test B, each with 15 items, is related to Test A average measures (0.20) being estimated less difficult then Test B average measures (0.56). The related test scores were 80% for Test A and 84% for Test B. The easier the test the higher the estimated matching item difficulty and student ability measures
In either case, graphed or calculated, one collective group correction factor (that is most reliable near the mid-range on the scale) is applied to each individual item difficulty measure. This same practice of applying a group correction factor to individual statistics occurs in traditional item analysis, classical test theory (CTT), as on PUP Table 7 Test Performance Profile.
Thursday, February 10, 2011
Another simple method of equating involves printing out Ministep Table 1.0 for each test and then sliding one along side the other until they are in reasonable agreement. This post also introduces six equally spaced common items in preparation for common item equating. The 24 by 24 student/item nursing school test was divided into two sets of nine items and one set of six common items. This produced two 15-student by 24 item tests (A9C6 and B9C6).
The average measure for the six items selected as common items was zero, that is, they were indeed uniformly distributed on the linear logit scale.
Winsteps Table 1.0 printed out with 12 divisions between each measure for both Test A and Test B. The spacing was identical for the two tests between -2 and +3 logits. The six common items can be identified by their answer sheet code, x2x, plus the item number: x2x 4. Winsteps Table 1.0 is a uniform linear playing field.
Without using the six common items, you must slide Test B vertically past Test A until “the overall hierarchy makes the most sense”. With the common items, you slide until the common items are in a best registration location. “The relative placement of the local origins (zero points) of the two maps is the equating constant.” (The student ability zero points on the two tests also come together. This makes sense as equivalent student ability and item difficulty have been plotted at the same points on this linear logit scale.)
In my judgment, the distance between the Test A and Test B item zero points is about 4 divisions x 1/12 logits or -0.33 logits.
Test A was more difficult than Test B. The constant -0.33 can now be added to Table 13.1 measures in Test B. Test B 2.10 + (-0.33) = Test A 1.77. Equating reduces Test B student ability and item difficulty measures when placed within the Test A frame of reference.
This post presents a visual view of what equating involves. A single constant can be added to all measures, to equate two tests, as the measures are now on a linear logit scale. This constant is calculated when using [common item equating] rather than judging the value as above.[link next]
Wednesday, February 2, 2011
Concurrent or One-Step Equating
This is the simplest method of equating two tests. It puts all the data into one file. The Rasch model requires that the two sets have matching characteristics.
Two sets came from splitting the 24 by 24 student/item nursing school test into two 12-student by 24-item tests (A and B). Each was analyzed by Ministep.
Crossplotting values, from Winsteps Table 13.1 Items, verified that the two sets performed similarly, with Item difficulties from Test B on the vertical axis and from Test A on the horizontal (y = 09861x + 0.0484 = 1.03). Also the ratio of standard deviations (S.D.) or slope was B (1.34)/ A (1.28) = 1.05. Any value near one is acceptable.
Excel and Winsteps produced the same slope value. Excel produced a higher S.D. (1.37 instead of 1.34 on Group B) as Excel makes a correction for the small number of items.
The S.D. ratio near one is an indicator that the two tests are performing in a similar manner, not a determination that they are exactly alike. There are other statistics that complete a fuller view than just using S.D.
The values for median (half way between extremes) and mode (most frequent tally) are of little use with samples of only 12 students and 24 questions, illustrated on the Estimated Item Difficulty chart above, except to indicate that the data are skewed (mean, median and mode are not the same). Group B has almost no skew (0.05).
A value of one for kurtosis indicates the sample fits the relative height of the normal curve. Group B is very flat (-1.31). Four of the five Group B plot points are about the same height on the Item Difficulty Distribution chart. Most striking on this chart is that when two small sets of very similar data (A and B) are combined, the result (C) takes on a much different appearance. A 24 by 24 student/item matrix (about 500 data points) is near the minimum requirement for both Winsteps and PUP. Part of the change in appearance is captured in maximum, minimum and range (see top chart). All of these statistics deal with the characteristics of group performance rather than individual student or item performance.
There is a need to keep in mind, what these basic statistics capture in numbers, as an overall perspective to specific analyses. Winsteps captures individual estimated student ability and item difficulty measures. PUP captures what individual students trust they know and their ability to use what they know: quantity and quality (with Knowledge and Judgment Scoring). Once these numbers have been obtained they are easily manipulated. Many different stories can be told from the same data, especially when students are not permitted to exercise their own judgment in reporting what they trust, on paper tests and with computer adaptive testing (CAT).