Monday, May 05, 2014
Student Teaching Evaluations
As classes wrap up, many of us are wondering how our teaching evaluations came out. How did we do? Not only are these evaluations deeply personal commentary on us as teachers, they can be important for promotion and tenure. And I’m here to suggest, along with many others including William Arthur Wines and Terence J. Lau, Observations on the Folly of Using Student Evaluations of College Teaching for Faculty Evaluation, Pay, and Retention Decisions and Its Implications for Academic Freedom, that in most colleges, universities, and law schools—they shouldn’t be. Or at least not until we know a lot more about what we want to measure, how to ask the right questions to do that, and how to understand the information we get back.
I think getting information from our students about their experiences in the classroom is important, but at the risk of spoiling what will be a series of posts on this topic, I suggest that few of us as law professors can do this well without considerable assistance from people who do educational assessment for a living.
Here are a few things to get on the table.
First, there is a substantial literature on teaching evaluations. Thanks to Professor Deborah J. Merritt who published a terrific article summarizing the data, we all should be aware that evaluations done based on 30 seconds of teaching the first day of class have been found to correlate closely to evaluations made after an entire semester. In other words, students make up their minds very quickly.
Second, (and thanks to Prof. Merritt for this too—I assume everyone is reading her blog on law school reform, the law school café) women, women of color and men of color, respectively, get lower teaching evaluations than white men. Across the board. In every subject.
Third, teaching evaluations are a classic example of the dangers of not understanding some basics about statistics. I suggest reading this three part blog post by Professor Philip Stark at Berkley “Do teaching evaluations measure teaching effectiveness”—but here are some of my favorite distortions:
1—We use averages—that means adding up all the scores and dividing them by the number of students responding—when we should be using medians (numbers reflecting the middle of the scores) or modes—(the most common score). Lets say on a scale of 1-7, two teachers both come in at a 5. But one gets all 5s and the other a wide range of scores that contained both 1s and 7s. Are these the same?
Also, by not looking at all the scores, we can miss important themes—like consistently low scores for availability or respect for students.
2. We think we know how to compare one faculty member’s scores with another’s—but we really don’t.
That’s because the scores are presented as numbers, but they’re really categories. Unlike a thermometer or a ruler where we know that the numbers all have an equal amount of “space” between them, the teaching scores are an “ordinal categorical” variable. As Prof. Stark explains, “We could replace the numbers with descriptive words and no information would be lost: The ratings might as well be “not at all effective”, “slightly effective,” “somewhat effective,” “moderately effective,” “rather effective,” “very effective,” and “extremely effective.”
As he asks, “does it make sense to take the average of “slightly effective” and “very effective” ratings given by two students?”
Also, without any information about the scores of the faculty as a whole, we can’t assign relative meaning to these numbers. So, if every faculty member teaching first year courses has a score of 4.5 or above, then someone with a 4 is outside of the mainstream. On the other hand, if the numbers cluster very tightly between 3.9 and 4.2—with 4 being the most common score—than it would be fair to say that someone getting a 4 is succeeding about as well as everyone else—in terms of achieving scores.
This problem (not knowing the scores of other faculty members in similar courses) becomes even worse when looking at the teaching evaluations of a faculty member at another institution.
There is a lot of information out there on how we can set goals for ourselves as communities of law teachers and how we can measure the results of those goals. And more on that tomorrow.
It's an interesting topic, and I can well believe that student evaluations as generally practiced are a terrible evaluation tool. That said, is the only solution to not evaluate teaching as part of the hiring, retention, pay and tenure process? It seems a strange outcome to have such a core part of the job have no role in employment decisions. If student evaluations are unscientific and statistically unreliable how much more so are the vague impressions of (potential) colleagues based on -- at best -- some sort of personal narrative?
Posted by: brad | May 5, 2014 1:46:42 PM
Also the W&M link doesn't lead to the Wines & Lau article in a way that I could find. Here's a direct link:
Posted by: brad | May 5, 2014 1:50:22 PM
My anecdotal sense is that, for better or worse, teaching evaluations play little role in tenure and pretty much zero role in promotion. Do others have different experiences?
Posted by: Orin Kerr | May 5, 2014 2:06:09 PM
I am also an evaluation skeptic. But, as brad suggests, what are the alternatives? Peer evaluations are the obvious option, but they have a number of drawbacks.
As much as law profs don't like student evaluations, I can't imagine how they would react to any evaluative method that tries to measure outcomes.
Posted by: carissa | May 5, 2014 2:44:33 PM
As a now former student, these concerns come off as yet another excuse for hiring and tenure decisions to be divorced from teaching quality. It already seems like teaching students about the practice of law takes a back seat to hiring and retaining professors who write on topics of interest to existing faculty. That is the impression my fellow students on our school's hiring committee had.
I perceive law schools as a centers for academic research and writing that fund themselves by extracting $100,000+ from each student who wants the piece of paper necessary to actually practice law. And divorcing the system even further from the concerns and interests of the students law schools purportedly exist to instruct only serves to reinforce my perception. But perhaps Professor Bard's follow-up post will change my mind.
Posted by: Andrew | May 5, 2014 3:58:46 PM
Like Orin, I've seen little evidence that student evaluations play a major role in P&T, although they play a cosmetic one. The exception to this would be, perhaps, those instances in which a set of student evaluations raises a significant red flag about someone's teaching; even then, they are likely to spark further action, not to be dispositive in themselves. (It's incidental, but I should add that the proper response to problems with a particular format of student evaluation is not to abandon interest in student evaluations of their professors, but to find better ways of getting that information, whether it plays a role in promotion and tenure or not.)
I would add that both of the schools where I've been on the tenure track used peer evaluations. Although it's true that peer evaluations have drawbacks, at least in the sense that everything has drawbacks, I have had no problem with them, either on the giving or the receiving end. To the contrary, they strike me as an excellent, if insufficient, practice.
Posted by: Paul Horwitz | May 5, 2014 4:17:16 PM
Did peer evaluation play a role in employment decisions? Were they ever outcome decisive?
Posted by: brad | May 5, 2014 4:31:30 PM
@Andrew, I think the goal in exploring these questions is not (or not only) to argue that evaluations shouldn't be considered, but that we need to revisit how we evaluate teaching, including what questions we put to students and how we interpret the numbers.
I served on my school's hiring committee this year and we took teaching evaluations (in this case, from teaching during fellowships) very seriously. Two candidates had phenomenal evaluations and that played a major role in how we ranked them as against others.
Posted by: calista | May 5, 2014 4:54:28 PM
Thanks for the comments (and Brad for the link). This is a big topic--and my goal is to start a conversation.
Regarding how (and how often) teaching evaluations are used--I think it's as we see here: different schools use them differently.
My concern is that we all understand their limits--especially when we aren't familiar with the scores of others in an institution. But more than that, that we appreciate we aren't get as much information about what goes on in the classroom as we could if we used different methods.
Posted by: Jennifer Bard | May 5, 2014 6:20:35 PM
Brad, in answer to your question, yes to the first and hard to say on the second. As to both, remember that promotion and tenure are about development, not just decision points. Peer visits and evaluations certainly played a role in those decision points, but they played an even stronger role in faculty development: in providing a basis for members of the faculty development committee, mentors, deans, etc. to offer advice about strengths and weaknesses in a junior faculty member's teaching and how to improve the quality of classroom instruction. The goal along the way is to make the actual eventual decision easier. Whether one thinks law faculties in general are too soft on tenure, which I put to one side here, it is certainly the case that the goal shouldn't be to withhold help and wait for an up-or-down vote, but to monitor and mentor the junior faculty member throughout his or her development and improve that person's teaching along the way. At the schools at which I've been part of that process, from one side or the other, that definitely happens and the peer evaluations play an important role. The actual decision points involve a mix of scholarship, teaching, and service, so it's difficult to say whether the peer visits play a decisive role at that point. But they are certainly discussed.
Posted by: Paul Horwitz | May 6, 2014 8:37:30 AM
If law schools were serious about evaluating faculty, we'd be looking at output measurements rather than running semi-annual popularity contests. Personally, I favor some form of No Law Student Left Behind.
Posted by: Steve Bainbridge | May 6, 2014 6:42:20 PM
I'm most familiar with business schools. Cynically, I think teaching evaluations are measuring pretty accurately just what the administration wants to measure: student contentment. The theory fits the facts pretty well. There's very little effort made to use accurate measures of how much students learn, or to look at what the professor actually teaching. The ratings ask students lots of questions about how happy the students are with grading, respect for their opinions, difficulty level, relevance, and so forth. Teachers are discouraged from being unorthodox in their teaching styles or innovative in their approaches, since students like uniformity and predictibility. The evaluations impose a big penalty on any professor who makes his students feel by the end of the course that they're not world experts on the subject by, for example, exposing them to more material than they can fully master. There's no adjustment (in business schools) for easy grading. And student contentment is a big factor, and one more easily changed than others, in the magazine ratings that incentivize deans.
Posted by: Eric Rasmusen | May 7, 2014 11:19:42 AM
Another point: don't restrict your thinking to tenure cases. Non-tenure track faculty are increasingly important. Deans look at their student evaluations. In our business school, we've had memos warning that teachers shouldn't bring cookies or pizza to class on evaluation day, and I think that this was aimed at adjuncts desperate to keep their jobs.
Posted by: Eric Rasmusen | May 7, 2014 11:22:42 AM