Removing bias from student evaluations of faculty members (essay)

The Bias in Student Course Evaluations

We should work to reduce the harm of bias in student course evaluations, argues Joey Sprague.

You have /5 articles left.
Sign up for a free account or log in.

Once again, as the school year was coming to a close, discussions about student evaluations and their inadequacy were frequent on social media. By now, many of us know about the research that shows that college students’ ratings of their professors are influenced by expectations associated with professors’ gender, class, race and age. Because these ratings influence hiring, promotions, raises and opportunities for awards, we cannot simply dismiss them; instead, we must deal with them head-on.

There are better approaches to assess teaching effectiveness than the typical student ratings, but that’s a topic for another day. In the meantime, I want to address how marginalized faculty members -- especially people of color and women -- can mitigate the damage to their ability to be effective until we achieve a better system. Short answer: let’s show our colleagues how to approach this like good social-science methodologists. We should:

Focus on the measures that have higher levels of reliability. A student rating instrument is a survey. A basic rule for constructing questions for a survey is to minimize the errors that we make in collecting the data. An important strategy for doing that is to ask people only those questions that they can answer. Most respondents will try to provide an answer even when they lack adequate information, and there is no way to tell what criteria they are using when they do.

For example, undergraduate students consistently give me high ratings when asked to assess my knowledge of the field. Great, but given that they do not have a clue, what can they be thinking of? Another question many student evaluation surveys include is one that asks students how much they have learned. Yet research shows that students are not good at assessing their own learning, at least in the short term.

Administrators tend to like the items that ask for a global assessment -- for example, “over all, this person is an effective teacher.” But these are the most likely to activate bias, because they leave it to the student to decide which of the many components of teaching, and teachers, are the most important to them. For example, research I conducted with Kelley Massoni shows that students value women who are more nurturing and men who are more amusing. Are these gendered expectations their fallback standards when called on to respond to such a global question? (Probably.)

Similarly, questions that ask students to indicate qualities such how “available” or “responsive” professors are do not allow us to know what standard they are using for comparison. Research on perception shows that the standards that people apply shift depending on the target’s gender and race. For example, one study found that students in an online class rated the instructor’s promptness in returning assignments lower when they thought their instructor was a woman (3.6 out of five) than when they thought that instructor was a man (4.4 out of five).

In general, the most reliable measures will focus on concrete behaviors and practices about which students have direct knowledge and provide guidance about a reasonable standard. For example, “The instructor return graded assignments within two weeks of when you handed them in.”

Given such common problems with course evaluations, I recommend that instructors (to the extent that they can) only use items on these evaluations that students can accurately answer. Look at each question that your institution’s rating instrument raises with a critical eye and ask yourself whether students have the information to provide an accurate response. Which of the items offer a concrete behavior and time frame?

Chances are that you are going to be asked to report scores on global items. When you do so, you should also report more specific ratings. If you are getting less favorable ratings on items that are linked to cultural expectations for gender and race, call your colleagues’ attention to how students’ expectations vary with these categories. Address your colleagues as scholars who respect careful interpretations of the data, which should make them open to a shared analysis. “Notice how students rate me lower on how flexible I am than they do on how clear the expectations were. Could this be another demonstration of the findings that students resent women’s exercise of authority more than men’s?”

Apply sound statistical analysis. People who have taken a basic course in statistics have probably learned that if a distribution of scores includes a few extreme scores, the median (the middle value) will be a much more accurate indicator of the central tendency than the mean (the average of all of the scores). Extreme scores in student ratings can be the output of racism, sexism, homophobia and other social biases. Even if your institution only provides the means on items, you can usually come up with a decent estimate of the median of the distribution if you can see the frequencies. If you are asked to report the mean scores, report the median ones alongside them, pointing out any inconsistencies and noting how the vast majority of students agree on the more favorable score.

When institutions focus on central tendencies, they are missing important information about the distribution of student ratings: the degree and pattern of variation among students’ ratings of an instructor. That data can help establish that bias is at work.

Look at the distribution on each item. Is it bimodal -- that is, are there two modes (i.e., the most frequently occurring value[s])? If so, other factors may be in play, such as different likelihoods of applying race or gender expectations, depending on whether students match the instructor demographically or not.

Look for patterns across different items in the same course. Do you see any variation? If students are responding to the specific content of varying items, it seems reasonable that there will be some variation in their answers across items. Few of us do everything equally well in the classroom. Ratings of all fives or all threes (out of five) are the outcome of some other factor dominating the ratings. Racism, sexism, homophobia and/or ageism could be creating uniformity in the students’ judgments across items.

Compare ratings on the same item across your classes. Things like an instructor’s clarity, openness to questions and availability outside of class are likely to be consistent across every class that the person is teaching. If students’ ratings vary significantly on the same item from one class to another, one culprit could be the match between the content of the class and the social characteristics of the instructor. Do students rate professors of color as less fair when they teach about racial inequality compared to courses they teach in other areas? Do men get higher ratings for being effective in classes on gender than they do in their other classes? (Probably.)

Researchers have found two other circumstances can influence ratings: whether the class is required and the grade that students expect to receive. Students are likely to punish women instructors even more when they are unhappy about required classes or disappointed with the grades they have received. Given findings in other areas of social-science research, it is a reasonable to expect that a faculty member’s gender is interacting with other markers of marginalization in influencing the degree of student retribution.

Controlling for such variables (that is, statistically accounting for their effects) would require access to the raw data, but few of us will be in that position. At the very least, in examining the data we do get, we should take into account whether the class is required.

We have been trained to be good researchers, and we often teach students how to do research well. It’s time to teach our colleagues. This is a matter of equity in higher education.