Data stability has a different meaning for classroom tests and standardized tests. Standardized tests seek predictive statistics based on more than 300 answer sheets. Classroom tests seek descriptive statistics based on 20 to 50 answer sheets. Standardized tests need to find the fewest questions that will produce the desired test reliability. Classroom tests need to find a rank for grading (summative assessment) or to reveal what each student knows and needs to know to be successful in the current instructional environment (in a formative-assessment process).
If the instructional environment is functioning at lower levels of thinking, the test must be given shortly after training to expect the highest scores. If functioning at higher levels of thinking, meaningful test results must be returned to the student shortly after the test to develop the highest scoring performance. Both timing and level of thinking influence data stability.


PUP descriptive statistics show that the bubble chart for Part 2 had the lowest item discrimination (0.28 pbr) of the four parts. This low item discrimination resulted in the lowest test reliability (0.47 alpha) for the four parts. This effect then carried over into Part 1&2 (0.29 pbr discrimination and 0.62 alpha test reliability).
Winstep item measures reliability (0.91) was identical for Part 1&2 and 3&4 even though the person measures reliability varied from 0.55 to 0.71. Here is the evidence, in part, that, “Rasch measures represent a person’s ability as independent of the specific items, and item difficulty as independent of specific samples within standard error estimates.” (Bond & Fox, 2007, page 280)