Wednesday, April 27, 2011
Experience during the past few years with multiple-choice tests scored by only counting right marks has made it clear that cheating occurs at all levels from student to state house. The usual method for detecting this activity is to compare observations with probabilistic models. The down side of this approach is that the models are generally too simplistic to match the real world. Also, school administrators value “catching them in the act” (a very difficulty thing to do) far more than “statistics” applied to individual students.
An alternative is to make use of the information content in each student answer string. Answer strings can be matched by collating, filtering and sorting. Presumptive cheating is then a marked departure from the class norm. Confirmed cheating usually requires additional information that is accepted by students and administrators.
The PUP copy detector shows a suspect pair on Part 1&2 involving student 11 and 29 with a standardized index (Z) value of 3, a marginal level of detection. This individual pairing shows a string of 14 identical marks followed by strings of 2 and 7 identical marks. This is presumptive cheating.
The student counseling matrixes show identical strings within unfinished (-A@D@EE-) and within misconception and guessing (D@EE-A). No other of the fifty students marked in this fashion. Question 9 was flagged by Ministep as most unexpected right. Only two students with the lowest scores shared this classification.
I would not call this a confirmed case of copying, as six of the seven identical pairs were non-critical, that is the identical wrong marks were too common on the test. In my judgment, this pair did not fail the test for independent marking. Failure would require additional information. There is also no noticeable marked departure from the class norm.
A record of presumptive cheating is easy to keep on mark matrixes sorted by student ID and question number, PUP 3b, or by score and difficulty, PUP 3c. Answer sheets were coded with three spaces each, for test number and seat number. Students filled in their three-digit student number. This information generally permitted confirming cheating, without resorting to multiple test forms (a negative, time wasting, procedure). Relying on their written record (answer sheets), just as scientists do, modeled the ethics of science as these students explored and developed their ability and desire to make sense of biological literature for the rest of their lives. (On the Internet, it is even more important to have formed the habit of questioning and confirming the information encountered.)
The most successful classroom policy I used to manage copying was to clearly state that answer sheets would be checked for cheating to protect honest students. Any answer sheet that failed the check would receive a score of zero. I would help any student who wanted to protest this decision to student affairs (no student every protested, which was, in itself, a further confirmation of cheating). Two students were detected twice over a nine-year period. They readily admitted copying but were both unhappy with themselves over finding their “fool proof” methods in other courses did not work here.
Wednesday, April 20, 2011
The Rasch model does not include guessing. This does not make it go away. Multiple-choice, by design, has a built in average random guessing score of one part for the printed set of answer options. Active scoring starts at 25% for 4-option questions scored by counting right marks. This scores and rewards the lowest levels of thinking. Active scoring starts at 50% for Knowledge and Judgment Scoring where higher levels of thinking are assessed and rewarded. If a student elects to mark all questions, both methods of scoring, included in PUP, yield the same score. The two methods of scoring respond the same to guessing.
Knowledge and Judgment Scoring, however, gives students the responsibility to learn and to report what they trust they know and can do. It is one form of “student-centered” instruction. This is critical in developing high quality self-correcting students, and as a result, high scoring students.
Five guessing items were found on Part 1&2 and six on Part 3&4 of the biology fall 88 test. These are items that fewer students, than the average score on the test, elected to mark, but less than that portion who marked, were right. A few students believed they knew but they did not know.
The four [unfinished] items on Part 3&4 were also among the six guessing items. Most of the Ministep “most unexpected right responses” (dark blue) occurred on these items on Part 1&2 and 3&4. Are they guessing (chance or good luck) or just marking error that also occurs among the other groups of items?
Assuming that the most unexpected responses involve carelessness, guessing, and marking error, these then play a small part in determining a student’s score. The rate tends to increase as student performance decreases. Many unexpected wrong and right answers tend to occur in pairs. One cancels the effect of the other. Only consistent analysis is required to obtain comparable results.
If I interpret the above correctly, the partial credit Rasch model (PCRM) can ignore guessing in estimating person and item measures. However, a teacher or administrator cannot ignore the active starting score of a multiple-choice test in setting cut scores. A cut score set a few points above the range of random guessing is a bogus passing standard even if the test contains “difficult” items.
Thursday, April 14, 2011
Knowledge and Judgment Scoring allows students the option of reporting what, in their own judgment, they know, can do, and find meaningful and useful. This generates four options instead of the usual two (right and wrong) obtained from multiple-choice items by the traditional count of right marks.
A misconception is a question that most students believe they know a right answer to, and mark, when in fact they do not know (more students, than the average score on the test, elected to mark, but less than that portion who marked, were right). Only one item was flagged as a misconception, item 6, on both halves of the biology fall 88 test. This can be compared to four on an earlier test.
Top students who rushed through the test marked “A” as a “most unexpected wrong response” on item 6 on Part 1&2. Other students gave a mixed response on Part 1&2. Only two top students who took their time marked “A” on Part 3&4. Most of the remaining students marked “A” wrong on Part 3&4. This observation gives rise to several stories.
Did top students in a hurry pick the same answer as lower scoring students who took their time? Did students functioning at higher levels of thinking and taking their time reason out the correct answer? Is this just sampling error?
Misconceptions make for good class discussions. Lectures seem to have little effect on changing student misconceptions. One misconception question repeated in several bi-weekly exams stabilized by most students omitting. The one thing they did know was that they did not understand something that would allow them to trust marking an answer to the question correctly (nor did they have in mind a meaningless right answer that matched an option on the question as the answer options were not always the same and were always randomized).
The combined report from Ministep and PUP opens a new window of inquiry into the behavior of students and test items. In my experience, these student-counseling matrixes provided a better insight into how a class and students were performing than reading “blue book” essays. Winsteps adds predictive measures to otherwise descriptive classroom data.
Thursday, April 7, 2011
Item discrimination identifies the items that separate a group of students that know and can do from a group that cannot. The Rasch model identifies the estimated measure at which students with an ability that matches the item difficulty will make a right mark 50% of the time. Item discrimination is not a part of the partial credit Rasch model (PCRM), however, Winsteps and PUP both print out the point biserial r (pbr) that estimates item discrimination.
About 10 discriminating items are needed in a classroom test to produce a good range of scores with which to set grades. The two halves of the biology fall 88 test (Part 1&2 and 3&4) show 11 and 16 discriminating items in PUP Table 7. All 11 discriminating items in Part 1&2 are found among the 16 in Part 3&4 (average pbr of 0.29 and 0.33, and average alpha of 0.62 and 0.77). A test composed of 50 items with this discrimination ability is expected to have an average alpha of 0.92. This puts it into a standardized test range of test reliability. A practical standardized test uses fewer items with more discrimination ability.
Dropping down from averages of groups of items and students to individual items and students restricts the validity of PUP Table 3a printouts to descriptive statistics for each test. (The Rasch model printouts from Ministep for individual estimated person and item measures are valid predictions as well as descriptions.) What needs to be re-taught and what can students do to correct their errors?
A teacher can mix students who marked discriminating items correctly with a set of students who did not know, or marked wrong, to sort out their errors. This is in contrast to an unfinished item. Here is a problem in instruction, learning, and/or assessment. Here the teacher must take the lead. These are the only items I reviewed in a class that promoted the use of higher levels of thinking by way of Knowledge and Judgment Scoring.
End of course standardized tests scored at the lowest levels of thinking (only counting right marks) have only one valid use, ranking. There is no way for current students to benefit from the testing. New designs for 2011 will use “through-course” assessment. Even in low level of thinking environments there is time for meaningful corrections at the individual student and classroom levels. One plan (August 2010) spaces parts of the test evenly through the course, the other spaces parts over the last 12 weeks of the course.
Neither plan replaces the good teaching practice of periodic assessment in such detail that students cannot fall so far behind that they cannot catch up with the class. Self-correcting students find the student counseling matrixes helpful. Most of these biology students were functioning at and above an 80% right high-quality score by the end of the semester.