Wednesday, March 30, 2011
PCRM - Stability
Data stability has a different meaning for classroom tests and standardized tests. Standardized tests seek predictive statistics based on more than 300 answer sheets. Classroom tests seek descriptive statistics based on 20 to 50 answer sheets. Standardized tests need to find the fewest questions that will produce the desired test reliability. Classroom tests need to find a rank for grading (summative assessment) or to reveal what each student knows and needs to know to be successful in the current instructional environment (in a formative-assessment process).
If the instructional environment is functioning at lower levels of thinking, the test must be given shortly after training to expect the highest scores. If functioning at higher levels of thinking, meaningful test results must be returned to the student shortly after the test to develop the highest scoring performance. Both timing and level of thinking influence data stability.
A bi-weekly general studies remedial biology course test with 100 students has been divided into four parts. This is roughly 25 answer sheets turned in after 20, 30, 40 and 50 minutes (Part 1, 2, 3, and 4).
The four Ministep bubble charts show Part 1, 3, and 4 to be very similar. Part 2 has measures with larger bubbles, lower reliability. When 25-answer-sheet files were combined into 50-answer-sheet files, the bubbles, in general, shrank, reliability increased. Items 19 and 20, 100% right, were edited into the Part 1, 2, and 1&2 charts as the Rasch model ignores all right and all wrong responses.
Scatter plots between Part 1&2 and 3&4 show practical stability for classroom descriptive results for both item scores (percent right) and item measures with two exceptions. Items 12 and 23 measures are outliers. The first received only right and omit marks; the second only unexpected wrong and omit marks.
PUP descriptive statistics show that the bubble chart for Part 2 had the lowest item discrimination (0.28 pbr) of the four parts. This low item discrimination resulted in the lowest test reliability (0.47 alpha) for the four parts. This effect then carried over into Part 1&2 (0.29 pbr discrimination and 0.62 alpha test reliability).
Winstep item measures reliability (0.91) was identical for Part 1&2 and 3&4 even though the person measures reliability varied from 0.55 to 0.71. Here is the evidence, in part, that, “Rasch measures represent a person’s ability as independent of the specific items, and item difficulty as independent of specific samples within standard error estimates.” (Bond & Fox, 2007, page 280)