The benchmark test is the standard for all future tests. It is constructed from calibrated questions and, generally, includes a set of common items. It has been administered at least once to verify that it satisfies statistical expectations. When scored by only counting right marks, the expectations only include the ranking of each item and each student. The analysis says little about the validity of the test to assess what students trust they know, as the basis for future instruction and learning, or for future job performance.
Calibrated questions are obtained by administering new questions to comparable groups of students. Field-testing presents a new set of questions to students without consequences. This is a common way to start the process of standardization. Operational testing presents new questions embedded in a test that has consequences, for more valid results. Winsteps estimates individual student ability and item difficulty measures. The measures are reported to provide sample-free item calibrations.
Bookmarking is the practice of placing calibrated questions in a book in an order of increasing difficulty. Experts and judges examine each question to find the one a student must answer to pass the test, a very political process. The outcome has been highly variable (as it is impossible for those with full knowledge to guess what answers students with incomplete knowledge will mark). The average judged item difficulty of the selected questions sets the raw cut score for the test.
Classroom tests, unlike standardized tests, include everyone and everything. PUP Table 7 lists Mastery, Unfinished, and Discriminating (the questions normally used in standardized tests). Even a classroom test can yield high test reliability (>0.90 KR20) if enough discriminating questions are included.
Reporters were concerned last year by the proximity of the random guessing score and the cut score. Ministep, implementing the Rasch model, does not recognize guessing. This gives students a handicap of 25% on 4-option questions. (I customized PUP, PUP7, to handle 7-option questions, a 14% handicap, last year for a faculty member with a cheating problem from answer copying.)
Winsteps performs as advertised estimating student and item measures, but it has no magic to correct errors in judgment in creating the standardized test, managing drift, and in setting cut scores. Time for testing places a limit on the number of questions on the benchmark test. The main goal of Winsteps is to help select the minimum number of questions that will yield acceptable test reliability.