« Because it's International *Shoe* | Main | Adrian Vermeule’s Anti-Liberal Chic? »

Friday, May 11, 2018

How to evaluate multiple choice questions on your exam

Professor Matthew Bruckner asked about how to evaluate multiple choice questions, so I thought I'd share how I go about reading my analysis report for multiple choice exams. (It's also a much-needed move away from blogging about idiosyncratic preferences in legal education....) To do that, I'll offer a portion of a redacted analysis report, and how I use it.

McqWhew. This was actually the first run of my multiple choice on one exam, for reasons I'll explain in a moment. It's an excerpt from an exam with more than 30 multiple choice questions. (If you have the opportunity to take advantage of these reports, do so!)

Let's start at the top. The higher the reliability coefficient (a figure between 0 and 1), the better your exam is at distinguishing among test-takers. It's really evaluating how consistently individuals performed across questions. As a rough rule of thumb, I aim for a figure above 0.5, but I don't get down if it's below that. It's worth noting this is only a rough way to estimate the quality of the exam, and the figure is much less valuable in an exam with a relatively small number of multiple choice questions or few students. A low figure may mean the exam is too easy (too many people got too many answers correct and it's not differentiating students), too hard (problem in reverse), or poorly drafted (students have to guess at ambiguous questions)--low figures are a red flag. It also might mean you simply have a group of students that are all of like ability, but I've found that's less likely the explanation.

Along the left are a few clues to help evaluate each question. The first is the total percentage who answered each question correctly; you can see figures ranging from 89.58% to 18.75%. A high percentage of correct answers can be okay if you want to test simple or straightforward concepts; too many, and the test is too easy. And vice versa. Questions with a low answer rate--say, below 40%--are a flag for me to see if they're simply difficult or if I did something wrong.

The next three columns--upper 27%, lower 27%, and point biserial--are ways of evaluating the quality of individual questions. If those with high scores (i.e., upper 27%) on the overall exam typically answered a given question at a high rate, and those with low scores (i.e., lower 27%) on the overall exam typically answered a given question at a low rate, it means that it's helping separate performance in the class.

Take Question 5: about 71% of students got it right, but 100% of the top-performing students go it right, compared to 46% of the bottom students. That relationship translates to the point biserial. A 0.58 biserial on Question 5 means there's no red flag with this question--it's separating performance among test-takers. It may even be a sign it's a good question. (These are measured on a -1 to 1 scale.)

Now, that also means that low biserials are a red flag--if the questions aren't separating the top from the bottom students, there may be a problem with the question.

If there's a low biserial but a high total percentage correct, then it just means the question is pretty easy and there's not much that the question is doing to separate students. Consider Question 15: nearly 90% of the class got it right, with 100% of the top and 100% of the bottom, giving it a dismal -0.05 biserial. But it's not too harmful--it just means that it's an easy question, maybe something to weed out in the future. (Then again, sometimes I like to include easy questions to test certain basic concepts.)

But if there's a low biserial and a low total percentage correct, I may have a problem--perhaps my question has a defect, or I bubbled in the wrong answer on my master Scantron. Sure enough, look at Question 11: 15% of the top students and 8% of the bottom students got it right, for a biserial of 0.13 and a total percentage correct of just 18.75%. That's a big signal for me to go check my work. Sure enough, I bubbled in the wrong answer on my Scantron sheet, so I could send it back for a redo.

I feel much more confident in asserting that low biserials are a red flag, and high ones are the absence of a red flag. That's also a reason why I'd then correlate the essay answers with the multiple choice answers. With some concession that they're testing different things, we'd hope to see a strong relationship between the two elements of the exam.

In short, there's a wealth of information in these reports. They can help you troubleshoot problems on your exam. If you plan on re-using some questions again in the future, or have the opportunity to modify them, the results can help you improve them for future use.

Posted by Derek Muller on May 11, 2018 at 09:11 AM | Permalink


@MichaelRisch - We've got a new tool for Exam4 called ExamStats coming out this summer. It takes our multiple choice raw score data and gives you a nice array of analytics. I've contacted your exam administrator, and we'll get a preview copy to you via that route. Thanks in advance for any feedback you can provide.

Posted by: Greg Sarab / Extegrity | Jun 15, 2018 3:02:28 PM

Thank you.

Posted by: Matthew Bruckner | May 17, 2018 3:22:20 PM

Years ago, I was given the advice to create questions of differing levels of difficulty. Some easy questions to give students a chance to show that they learned *something* and should pass the course (though doing well only on these questions puts a student in the D/C- range). Some challenging questions to give students a chance to show that they have grasped a good bit of the course, getting them into the C/C+/B- range. Some tougher questions, to give students to rack up scores putting them into the B/B+ range. A handful of very challenging questions so that the A-/A students can demonstrate their mastery of the material. So, for me, doing this analysis helps to make certain there aren't too many questions in any of the categories.

Posted by: James Edward Maule | May 15, 2018 6:25:04 PM

We include answers to our multiple choice questions in our exam software, and get these reports as a matter of course. Like Derek, I use the report to double-check the validity of my multiple choice questions. I also make my own assessment of the consistency between the multiple choice and essay portions of the exam. For the most part, there is a strong correlation. But there are definitely some students who excel at answering one type of question and not the other.

Posted by: Rebecca Bratspies | May 13, 2018 1:27:12 PM

These are all good suggestions, and I do something very similar with my MC questions. One addition I'll make is that the approach Derek lays out is extremely cumbersome if you intend to use a variety of metrics (e.g, % correct, or Biserial Correlation) to adjust your key iteratively (which I do routinely). The last thing one wants to do is to re-send a new key over to the Scantron tech each time one wants to play with adjusting which answer(s) get partial/full credit on a question. RATHER, it's preferable simply to ask for an Excel / CSV file with the students' raw answers, inserting your own key with correct answer(s) just above or below the list of answers. From there it's easy to manipulate basic Excel functions to score and re-score the exam as you fiddle with your key. In particular, the use of the =IF(..), =AVERAGE(..) and =BCORREL(..) functions in Excel will allow you to reproduce most of columns in Derek's OP.

(FYI I usually write exams with MC, Short-Answer, and Issuer Spotter sections, and I have become a firm believer in this type of mixture of testing protocols; the correlation between parts is usually between about 0.25 and 0.65, depending on the year).

Posted by: Eric Talley | May 13, 2018 10:35:50 AM

I used multiple choice questions on a final exam this year for the first time and never again. If the material doesn't lend itself to a issue-spotting exam with unclear legal precedents, which was the case with my International Financial Regulation class, then I'm going back to papers and class presentations.

Posted by: Douglas Levene | May 12, 2018 2:35:22 AM

We get the same item analysis from our scantron vendor too...

Posted by: Aaron Tang | May 11, 2018 6:41:34 PM

I'm in Michael's boat, maybe there's a service here that would compile a report like this but I analyze my own exams in Excel. Does anyone know if there a function that will do reliability coefficient, and upper 27/lower 27/point biserial? I can't really throw any questions out as long as they are valid, but I'd like to know which question forms not to re-use.

Posted by: Bruce Boyden | May 11, 2018 1:09:18 PM

We get this kind of analysis from the outside vendor who does our multiple choice grading off of the Scantrons. I have not used them in several years because I switched to a method in which I give a multiple quiz online through Blackboard after every unit. (8 quizzes in the upper level business class; 20 quizzes over the course of the full year of contracts). Blackboard offers you a similar analysis if you click on the quiz and look for “Item Analysis."

Posted by: Jeff Lipshaw | May 11, 2018 12:25:13 PM

What is giving you this report? We use Exam4, and I have to do this type of analysis by hand.

Posted by: Michael Risch | May 11, 2018 12:07:53 PM

Speaking as a junior faculty member who just used multiple choice questions for the second time as part of an exam, let me thank you for this very, very helpful breakdown of how you read and use your item analysis!!

Posted by: Aaron Tang | May 11, 2018 12:02:13 PM

The comments to this entry are closed.