« Addressing Psychological Distress with Your Students | Main | SCOTUS Repeaters »

## Thursday, January 15, 2015

### Fair Grading in a World of Curves? Concepts and an Algorithm...

**A Candidate Algorithm for Fair Curve-Fitting**

Posted by Paul Gowder on January 15, 2015 at 12:45 PM in Life of Law Schools, Teaching Law | Permalink

## Comments

Thanks for the compliment, but running code always beats vaporware. Real artists ship.

Posted by: James Grimmelmann | Jan 21, 2015 10:28:03 PM

You're welcome. And for the record, Grimmelmann's solution is better.

Posted by: dmp | Jan 21, 2015 3:45:12 PM

(Although I suppose, given the infuriatingness of getting datasets in and out of anything, R and CSV are the least infuriating combination.)

Posted by: Paul Gowder | Jan 20, 2015 8:23:51 PM

Thanks DMP---that's an interesting approach. I don't know k-means clustering, but based on a quick wikipedia skim, it looks like a plausible approach---worth playing around with this code on some real grades.

(Although it pains me a little to hear R described as easy to import/export CSVs from. Every single time I've ever done stats, the hardest thing has been to get my data into R, regardless of original format. One time I even killed a whole project because I simply could not get someone's SPSS data into a form I could work with...)

Posted by: Paul Gowder | Jan 20, 2015 8:21:04 PM

It looks like there are different problems based on how the school sets up the curve. I had a go at the problem where you have a mean grade and a fixed number of buckets. A reasonable approach is to use k-means clustering to group the students into buckets, and then assign the buckets various grades until you've come up with a combination that gets you closest to the desired mean (i.e. one combo goes A, A-, B, B-, C, another goes A+, B, B-, C+,C, etc.). See the below implementation in the R language (which is very easy to import/export csvs from).

#User input numbers

numpartitions = 5

meangrade = 3.15

ranktogrademap = c(2, 2.3, 2.7, 3, 3.3, 3.7, 4, 4.3)

#Some constants

numpossiblegrades = length(ranktogrademap)

matrixofgradecombos = combn(1:numpossiblegrades,numpartitions)

#create a fake data set of scores

numstudents = 100

samplegrades = rnorm(numstudents,50,20)

#partition the data set into the number of buckets

x = kmeans(samplegrades,numpartitions)

y = rank(x$center)[x$cluster] #force the groupings into rank order

#loop through various combinations of grades to get avg grades for each combo

vectorofaverages = numeric(ncol(matrixofgradecombos)) #allocate the vector

for(i in 1:length(vectorofaverages)){ #get mean grade of all combinations

vectorofaverages[i] = mean(ranktogrademap[matrixofgradecombos[,i][y]])

}

#identify the combo that gets you closest to the desired mean

gradecomboindex = which.min(abs(vectorofaverages - meangrade))

#create the table of grades

DF = data.frame(RAWSCORE = samplegrades, RANKGROUP = y,

GRADE = ranktogrademap[matrixofgradecombos[,gradecomboindex][y]])

#print the output

DF

Posted by: dmp | Jan 20, 2015 6:45:08 PM

Joey - I'd hate to be the student with the F- so the rest of the GPA works out...

To answer your question, different schools do it different ways. My school does the bucket method throughout, but I know others that do it as you describe. The benefit of the bucket method, especially in the first year, is that bimodal graders don't dominate who gets on to law review.

Posted by: Michael Risch | Jan 17, 2015 4:11:29 PM

This is perhaps an obvious question, but do most law schools really specify "curves"?

Mine just specifies a required mean grade -- which faculty and students refer to colloquially as "the curve" -- but this leaves a lot of flexibility. If the grade distribution looks like a sharp pointy curve around that point, a bimodal distribution, a flat line, whatever, it's fine as long as the grades average out to the required mean. (For first-year courses it's a little more prescriptive -- the school has some requirements about the % of A's, B's, etc., which is closer to requiring a "curve".)

I like the way my school does it (for non-first-year courses) because let's face it, sometimes a class really does have a bimodal or otherwise other-than-normal distribution and it would be somewhat odd, creating arbitrary winners and losers, to require a fit to a pre-determined curve.

Posted by: Joey | Jan 17, 2015 12:02:45 PM

I would recommend you read and consult "Effective Grading" 2d ed. by Barbara Walvoord & Virginia Anderson. Really a comprehensive guide to grading and evaluating students.

Posted by: Eugene Pekar | Jan 16, 2015 12:49:53 PM

Not directly on point, but if I were putting together a tool to be distributed, I would not try to support arbitrary CSV files. It's an ill specified file format.

Posted by: brad | Jan 16, 2015 12:11:02 PM

What James said. This is how I do it, albeit by hand. I have a spreadsheet with the percentile sizes for each cut line and it tallies the score needed to hit that cutline. Because there are ties, that rarely works out evenly (though in my experience it is always awfully close). From there, I can manually adjust the score breaks to match the cut lines as closely as possible, or decide whether I'm going to deviate because I just don't think the difference at that cutline is the difference between an A and a B. If you want an example, let me know.

As for your transformations, it's been a while since I did hard core math, but my experience is that you want to keep most transformations exponential (log based) or multiplicative (percentage of top score), etc., to bring numbers closer or farther. I don't think adding and subtracting is a great transformation because it doesn't actually change the shape of the curve. In that sense, even multiplication doesn't do much.

Posted by: Michael Risch | Jan 15, 2015 4:54:11 PM

While it's not been popular in law schools, traditionally the way to minimize error in grading has been to increase the number of things graded, so that any particular error is less likely to make a significant difference in the final grade. Unfortunately, law courses—especially first year courses—have a long tradition of being graded solely on the basis of a single exam. Adding a midterm, a short brief, or some other assignment can help to minimize these errors. Plus, you have the added benefits that come from having more graded assignments: you can test on a greater variety of issues/skills, students are more likely to learn as they go along rather than cram at the end, etc.

Posted by: Charles Paul Hoffman | Jan 15, 2015 4:50:00 PM

Aah, I see, yes, James, excellent point. Any thoughts on how one might figure out the algorithm to carry that out?

Posted by: Paul Gowder | Jan 15, 2015 4:36:56 PM

Paul, my formulation of the problem automatically preserves the ordering of students. That's why I posed the optimization problem as a search for the best cutlines. If the cutline between A- and B+ is at 59, then any student who got a 59 or above gets at least an A- and any student who got a 58 or below gets at most a B+. Students with different raw scores might get the same grade, but they will never get reversed grades. Someone who implemented an optimization algorithm as I describe wouldn't need a step corresponding to your #7.

Posted by: James Grimmelmann | Jan 15, 2015 4:20:16 PM

Phil - a big problem (and the main justification as far as I can see for curves) is that absent a curve, different professors will end up with different means for their grades. This is highly problematic as there is no a priori reason to believe that the students assigned to Professor A are markedly different from the students assigned to Professor B. Indeed, assuming that (like at my school) curves are used mostly for the first year courses and students are randomly assigned to sections in their first year, your assumption would probably be the opposite - that the average ability of the students in Professor A and Professor B's classes ought to be the same. If this is the case, then having different means for Professors A and B means that students in one Professors class are getting better/worse grades than they would for exactly the same work in the other class. That's bad. Thus the curves are designed to ensure that a person who makes a A in Professor A's class would also make an A in Professor B's class. There is no "grade they earned" in an absolute sense. Rather, we try to make sure that they are getting a fair grade relative to other students.

Posted by: anon | Jan 15, 2015 3:56:30 PM

Thanks Former Editor and anon (both very helpful). Phil, the problem is that law schools often have rules requiring a curve, which cannot be changed by an individual faculty member. Now, your answer might be "then show up at the next faculty meeting demanding a change," but that's a much longer conversation. (In the kind of employment world students face, where grades really matter, it makes sense from the standpoint of fairness to ensure that every class is on a consistent curve.) In the world we're in, grades must be fit to curves.

Posted by: Paul Gowder | Jan 15, 2015 2:05:48 PM

Regarding literature, you might find James Chen, Self-Adjusting Weighted Averages in Standard Scoring, available at http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2397637, to be helpful.

Posted by: Former Editor | Jan 15, 2015 1:57:16 PM

I spotted your problem: "student performance never exactly matches those curves, and so some tweaking is required"

Tweaking is required to do what, exactly? Support the idea that your students are a statistically representative subset of what you think the general population looks like? The notion is preposterous. There is probably an error in your conception of the general population and your students are not a random sample.

Here's your solution: Develop an appropriate grading rubric and give the students the grade they earned. Stop.

Posted by: Phil | Jan 15, 2015 1:53:55 PM

I think another problem might be that you assume your raw scores are errorless (at least I don't see any treatment of error in your algorithm). There is likely error in your raw scores that may come from a number of different sources, particularly if you use essay tests rather than multiple choice. Even if you use multiple choice questions, it is unlikely that the raw scores you have perfectly match whatever we might think of as the students' understanding of the course material. Thus, if given a situation where you have two students who have different raw scores but whose error ranges overlap, I don't think you can be sure that the student with the higher raw score really is a better student than the student with the lower raw score. For this reason, I try to make sure that students whose grades are quite close receive the same grade. I believe there is a way to estimate the errors in the scores, although I don't formally do this when computing my own grades. Nevertheless, if I were going to write an algorithm for this, I would probably want to calculate error ranges and try (as much as possible) to make sure that students whose error ranges overlap receive the same grade.

P.S. I am not formally trained in this stuff, so I could well be barking up the wrong tree :)

Posted by: anon | Jan 15, 2015 1:47:37 PM

Hmm... I like the correlation approach, but wonder if it would innappropriately allow for extreme changes in individual cases. For example, what if it turns out that the raw scores could be perfectly fit to the curve just by taking the highest-scoring student and moving her/him to the bottom? It seems like that would likely count as a v. high correlation between raw and scaled scores, but really screw the one student....

Posted by: Paul Gowder | Jan 15, 2015 1:39:22 PM

This strikes me as an ad hoc way of tackling the problem. "Minimum number of piecewise linear transformations" is not a metric that tracks the goals of curving. It prioritizes simplicity over fairness. Given your view that a curving algorithm should minimize the loss of "information" in the students' raw scores, a better metric of algorithmic quality would measure the distortion introduced by fitting raw grades to the curve. The greater the correlation between the raw scores and the curved grades, plotted against each other on their respective intervals, the better the curve.

If that's the case, then this is an optimization problem: select the N-1 cutlines between the N grading buckets in a way that maximizes a measure of correlation between raw scores and curved grades. You could treat the size of the curve-mandated buckets as a constraint on the optimization, or you could create another metric to evaluate the degree to which an assignment deviates from ideal bucket sizes. If the latter, then you need a function that weights the two metrics -- distortion of raw scores and fidelity to ideal distribution -- to produce an overall quality metric. Then you look for an algorithm that optimizes this overall quality metric in a computationally feasible way. (If necessary, go back to earlier stages and select different metrics to make the optimization algorithm more computationally tractable.)

I wouldn't be surprised if someone has already done this.

Posted by: James Grimmelmann | Jan 15, 2015 1:28:30 PM