Wednesday, March 30, 2011
Data stability has a different meaning for classroom tests and standardized tests. Standardized tests seek predictive statistics based on more than 300 answer sheets. Classroom tests seek descriptive statistics based on 20 to 50 answer sheets. Standardized tests need to find the fewest questions that will produce the desired test reliability. Classroom tests need to find a rank for grading (summative assessment) or to reveal what each student knows and needs to know to be successful in the current instructional environment (in a formative-assessment process).
If the instructional environment is functioning at lower levels of thinking, the test must be given shortly after training to expect the highest scores. If functioning at higher levels of thinking, meaningful test results must be returned to the student shortly after the test to develop the highest scoring performance. Both timing and level of thinking influence data stability.
A bi-weekly general studies remedial biology course test with 100 students has been divided into four parts. This is roughly 25 answer sheets turned in after 20, 30, 40 and 50 minutes (Part 1, 2, 3, and 4).
The four Ministep bubble charts show Part 1, 3, and 4 to be very similar. Part 2 has measures with larger bubbles, lower reliability. When 25-answer-sheet files were combined into 50-answer-sheet files, the bubbles, in general, shrank, reliability increased. Items 19 and 20, 100% right, were edited into the Part 1, 2, and 1&2 charts as the Rasch model ignores all right and all wrong responses.
Scatter plots between Part 1&2 and 3&4 show practical stability for classroom descriptive results for both item scores (percent right) and item measures with two exceptions. Items 12 and 23 measures are outliers. The first received only right and omit marks; the second only unexpected wrong and omit marks.
PUP descriptive statistics show that the bubble chart for Part 2 had the lowest item discrimination (0.28 pbr) of the four parts. This low item discrimination resulted in the lowest test reliability (0.47 alpha) for the four parts. This effect then carried over into Part 1&2 (0.29 pbr discrimination and 0.62 alpha test reliability).
Winstep item measures reliability (0.91) was identical for Part 1&2 and 3&4 even though the person measures reliability varied from 0.55 to 0.71. Here is the evidence, in part, that, “Rasch measures represent a person’s ability as independent of the specific items, and item difficulty as independent of specific samples within standard error estimates.” (Bond & Fox, 2007, page 280)
Wednesday, March 23, 2011
The bubble charts for the nursing school test, used in the first 20 posts, and for a general studies biology test are strikingly different. The nursing test had a cut score of 75%, was right count scored, and only three of the 24 students failed. The biology test had a cut score of 60%, was knowledge and judgment scored, and only one of the 24 students passed.
The active starting score on the nursing test was set at zero with one point for right and zero for omit and wrong. The active starting score for the biology test was set at 50% (knowledge, and the judgment to use that knowledge, had equal value) with one point for right, ½ point for omit, and zero for wrong. (Partial credit analysis used 2, 1, and 0.)
The averages scores were 84% for nursing and 50% for biology. Nursing students were preparing for a licensure test. Remedial course biology students only needed credit for the course. (It is customary for these freshman students to take this first test without studying to determine if attending lecture will suffice, a case of high school lag.)
The overfit students (1 and 8, for example) tended to omit (good judgment to not make a wrong mark) more than other students. The underfit students (17 and 24, for example) tended to not use good judgment (GJ) as often as other students.
Winsteps flagged four of the six omits on item 24 as unexpected. Rasch model analysis provides an indication of students omitting when the odds were more in favor of getting a right answer than a wrong answer, 4/576, or less than 1% of the time. This rate is less than the average marking error on paper tests.
Only two students, with test scores of 56.3% and 64.6%, are predicted to pass the biology course on PUP Table 3a. Neither student earned scores of 70% or higher (cut score plus one letter grade when using right count scoring). Instead both students demonstrated their ability to use what they knew with quality scores of 100% and 73%.
Ministep estimates the person ability of 21000053 (21) and 25000059 (25) at 0.17 measures. PUP test scores for both are 56.3%. Student 21 may pass because of a quality score of 100%. Student 25 is predicted to fail because of a quality score of only 60%. The quality score was used to break ties in the annual NWMSU Science Olympiad.
PUP Table 3 EGMD makes a selection more interesting. This table of student judgment is unique to Knowledge and Judgment Scoring. It guides students along the path from passive pupil to self-motivated high achiever (average quality score of 90%, always above 80%). Students 21 and 25 differ markedly among the four categories of items. Student 21 unexpectedly marked a misconception right. Student 25 marked three of the four misconceptions wrong. Student 21 omitted all three discriminating items, but student 25 marked all three right. These descriptive data, at the individual level, can yield many stories with unknown reliability. Reliability comes with repeated performance.
Partial credit Rasch model analysis and Knowledge and Judgment Scoring together produce a more meaningful and easy to use classroom and proficiency report than either alone. Caveat: This small sample size was used for illustrative purposes.
Wednesday, March 16, 2011
A variety of ways has evolved to report standardized test results with the “Lake Wobegon” goal of all schools being above average or improving each year: percent right score, ranked score, scaled score, standardized scaled score, and percent improvement.
The percent right score is the number of right marks divided by the number of questions on the test. It cannot be manipulated, however, Ministep calculates the number of right marks divided by the total number of marks when estimating student ability and item difficulty measures.
Ranked scores are obtained by sorting right mark scores.
Scaled scores are positive and arbitrary. In general, the central scaled score corresponds to zero logits. Scaling involves a multiplier and a constant. A logit scale running from -5 to +5 can be scaled by multiplying by 10 (-50 to + 50) and then adding a constant of 50 (0 to 100). The multiplier spreads out the scale. The constant sets the value related to zero logits. Winsteps Table 20.1 prints a logit scale (default) and a percent (0 to 100 point) scale. Table 20.2 prints a 500 point scale.
Standardized scaled scores go one step farther in adjusting the result to fit a predefined range. The central scaled score may not correspond to zero logits. The scaling multiplier and constant must be published to obtain the original logit scale and the original raw scores.
Percent improvement compares scaled scores from one year to the next. When the scaled score and the percent right scores are not published, there is no way to know the basic test results.
Each “improvement” in reporting test results, over the past few years, makes the output from Winsteps less transparent. The Rasch model and traditional CTT analyses provide data for these methods of reporting but are not responsible for the end uses. Most disturbing is a recent report that, in general, there is no way to audit the judgments made in developing annual reported school values.
Wednesday, March 9, 2011
A good cut score prediction (the benchmark and operational tests yield similar scores) results from skill and luck. If the common set of items (1/4 to all of the questions) is not stable, the prediction fails.
The Rasch model prediction is based on a sequence of transformations starting with observed benchmark student raw scores. The average score on the nursing school test, for example, was 84%. The cut score that every student was to achieve was 75%. Three students did not achieve the cut score.
Rasch Estimated Measures, Chapter 11, is summarized on the Rasch Model Playing Field chart. The average student right answer score (#1) of 84% is transformed into an estimated student ability measure of +1.7 logits. The average item wrong answer score (#2) of 16% is transformed into -1.7 logits and then into an estimated item difficulty measure that matches the student ability measure at that location. The degree of adjustment needed (1.7 logits) to match the measures is less the closer the average student right score is to 50%. At 50% there is no adjustment (#3).
The standard deviation of student scores and item difficulties needs to be in close agreement within each test (person 1.09 and item 1.30 logits on this test) and between tests. This requirement of the Rasch model, using Ministep, has been encountered several times in this blog: Item Discrimination, Person Item Bubble Chart, Standard Units, Perfect Rasch Model, One Step Equating, and Common Item Equating. A perfect match of person and item measures requires identical standard deviations.
The black lines on the charts represent the observed score ogive and the linear, score-from-measure static features of the Rasch model, Winsteps, Table 20.1, They are not changed by test data expressed in standardized units: Standard Units and Perfect Rasch Model.
Rasch model measures predict success half of the time by students on items with matching abilities. But in practice, cut scores are for ALL students to achieve ALL of the time on a test with mixed abilities and difficulties. A group of students with matching abilities for the item difficulties set for an expected score of 0.72 would be expected to fail the nursing test about half of the time. When the expected score of 0.84 was set with the cut score of 75%, all but the three students passed, on average.
PUP classroom predictions of success assume a minimum preparation, on average, of one letter grade above the cut score. Higher quality (understanding) is more reliable than higher quantity (rote memory), in general.
Rasch model predictions of expected scores are more precise. They are based on the unique property that estimated measures are person free and item free. The desired passing rate can be obtained by selecting the correct range and mix of calibrated questions about the cut score (assuming teachers, instruction, students, learning and attitude remain stable, which, one would hope, would not be the case).
Wednesday, March 2, 2011
A bad passing cut score prediction creates havoc for psychometricians, politicians, education officials, teachers, students, parents and taxpayers. It is an indication that for all the care used in selecting calibrated questions for the benchmark test, something went wrong. Psychometricians expect “something to go wrong” at some low rate of occurrence. Predictions based on probabilities are expected to fail occasionally. It is the same matter of luck that students count on to pass when cut scores are set very low.
Psychometricians must deal with two cases: Case 1, Test B, this year’s test, came in above the benchmark test, Test A. Too many students are passing. Or was it the other way? Case 2, Test A, this year’s test, came in below the benchmark test, Test B. Too many students are failing. In either case political credibility is lost and education officials discount the importance and meaningfulness of the assessment.
Case 1 occurred in the previous post on Common Item Equating where Test B values were equated into the Test A frame of reference, a change of -.036 measures. Let’s reverse the assumption for Case 2 and equate Test A values into the Test B frame of reference, a change of +0.36 measures. Both operations produce an over-correction with respect to the right answer (0.48) this audit obtained from Test AB. Over-correction is understandable since the correction is made either from a too high value to a too low value or visa versa. The best result (0.48) is again uniting all data into one benchmark analysis.
It is important to determine what is reality: the expected prediction or the observed results. A recent solution to this problem is to not make a prediction and/or to recalibrate the current year’s results. The objective is to safely make the results look right, a long-standing tradition in institutionalized education.
A cut score prediction is required in research work (the test for a significant difference between chance and the observed results demands a prediction). A cut score prediction is not needed in application. Research work deals with averages. Application deals with individual students, however, the way NCLB tests are administered as forced-choice tests, the results only have meaning as the average ranking of how groups (class, school, district, and state) performed on the test.
Given the quality of forced-choice data (a rank), simple methods for setting cut scores are appropriate. Traditional failing cut scores have been any score between one and two standard deviations below the mean, or scores ranking in the lower 10% to 20%. This allows students to compete for passing. A student’s rank determines the grade and passing. This is a bad way to set cut scores if passing implies being prepared for future course work or the workplace.
A good cut score should separate those who will succeed from those who will fail without additional learning. Anything less is a form of social promotion. Passing the test becomes a rite of passage. The nursing school test cut score of 75% was based on long term experience that students ranking below the cut score tended to fail the NCLEX licensure test on their first attempt.