What's really going on with respect to bias and teaching evals?

You have /5 articles left.
Sign up for a free account or log in.

Andrii Yalanskyi/iStock/Getty Images Plus

Many studies criticize student evaluations of teaching as biased or a poor measure of teaching effectiveness, or both. But none of these papers are as expansive as a new metastudy of more than 100 articles on these student evaluations, or SETs.

The new study’s breadth means its authors can cut through the sometimes contradictory research on SETs. And instead of looking at just measurement bias (how well SETs reflect good teaching, or don’t) or just equity bias (how SETs advantage certain groups of instructors over others, or don’t), the study contextualizes both.

Co-author Rebecca Kreitzer, assistant professor of public policy at the University of North Carolina at Chapel Hill, said Tuesday that “our conclusions are more nuanced than previous research, particularly on equity bias.” Indeed, where many studies have found evidence of gender bias against women in student evaluations, Kreitzer and co-author Jennie Sweet-Cushman, associate professor of political science at Chatham University, found that the equity bias effect is “conditional,” as “sometimes women and people of color do benefit.”

Yet the effect of gender varies “considerably across disciplines,” with women receiving lower scores in the natural and social sciences compared to the humanities, Kreitzer added. She and Sweet-Cushman also found “an affinity effect,” whereby women tend to prefer female instructors and men prefer male instructors.

Perhaps most important, Kreitzer said, she and Sweet-Cushman found conforming to prescribed gender roles has a more significant effect than gender itself. This is “deeply concerning because students prefer professors with masculine traits, yet penalize women for not conforming to stereotypes.”

All told, Kreitzer said, equity bias exists. But its effect is hard to pin down.

Just as important as its literature review, the new article makes numerous suggestions as to how administrators should use SETs to evaluate professors. Kreitzer and Sweet-Cushman also call out this corner of research for its relative lack of attention to issues of racial and intersectional identity bias, as most of the equity bias research is about gender.

This is in “no small part because of the underrepresentation of people of color among faculty,” Kreitzer said, pointing to larger issues of systemic bias within academe. “In quantitative analyses of SETs, there are often too few people of color to make reasonable inferences from the data.”

Measurement and Equity Bias

As for measurement bias, the study finds that evaluations are impacted by characteristics unrelated to actual instructor quality. Classes with lighter workloads or higher grading distributions do have better scores from students. Students also rate nonelective and quantitative courses lower. Evaluations for upper-level, discussion-based classes are higher than those for larger introductory courses.

Ratings vary across disciplines, with students rating natural science courses lowest and humanities highest. Oh, and bringing chocolate cookies to class actually results in higher ratings.

As for equity bias, the study finds that factors including an instructor’s gender, race, ethnicity, accent, sexual orientation or disability status affect impact student ratings. Compared to women, male instructors are perceived as more accurate in their teaching, more educated, less sexist, more enthusiastic, competent, organized, easier to understand, prompt in providing feedback, and they are less penalized for being tough graders, according to the study. In studies involving identical online course designs involving a hypothetical male or female instructor, students rate the male instructor more highly than the female one.

Key to understanding how gender impacts student ratings, the authors say, is that both male and female students expect women and men to conform to prescribed gender roles. Moreover, students seem to prefer professors with masculine traits and penalize women who don’t conform to feminine stereotypes. Women are rated highly when they exhibit traditionally feminine traits such as sensitivity, for instance. Men are rewarded for seeming intelligent. In one study, women were rated 0.5 standard deviation lower than male instructors.

At the same time, the researchers found a “gender affinity” aspect in which students preferred professors of the same gender as themselves. They say that this affinity probably exists with respect to race, as well, but that there isn’t enough research on that.

To this point, there is much less research on equity bias in teaching evaluations for faculty members of color, the study says, in part because of their “severe underrepresentation” across academe. Black and Asian professors tend to be evaluated more poorly than their white peers, with Black men faring worst. Faculty members with accents and Asian last names are also penalized. Latinx women are judged more harshly than white women.

Some evidence suggests that LGBTQ professors fare worse than their peers in general, but there is almost no research on other intersectional identities, including disability and pregnancy, the researchers say.

Recommendations for Reform

As for recommendations for reform, Kreitzer and Sweet-Cushman say that SETs should be used to contextualize students’ experiences in the classroom, not evaluate teaching. Students arguably shouldn’t or can’t rate professors’ teaching, but they can provide useful feedback on students’ perceptions and experiences, the study says.

Kreitzer and Sweet-Cushman also urge administrators to proactively increase the validity of these assessments by working to raise traditionally low response rates, as low student response rates mean low representativeness of the student experience.

The authors also caution in interpreting SET results. These ratings were not designed to be used as a comparative metric across the faculty, or to judge faculty members against one another, the paper says, and instead should be used to compare a faculty member’s own teaching “trajectory” over time -- ideally, within a single course. Most teachers get relatively good scores. But because the distribution of faculty members’ reviews has a negative skew, the authors note, administrators should look at the median or modal response instead of the probably biased mean.

Crucially, the paper urges administrators to restrict or eliminate the use of qualitative or write-in comments, which have the “strongest evidence of equity bias.” Over and over again, women and faculty members of color have been shown to receive more negative comments about personality traits, appearance, mannerisms, competence and professionalism, according to the study. Instead of general comments, assessments should ask for student feedback on specific prompts.

Echoing many experts on what SETs can and can’t do, the paper says that administrators should never rely on these evaluations as a sole measure of teaching effectiveness. Alternatives and complementary assessments are peer evaluations of teaching, comprehensive teaching portfolios and reviews of course materials. While these options are potentially susceptible to bias, the authors note, they’re not vulnerable to the same kinds of systemic biases as SETs. Several “imperfect measures” are always better than using just one, Kreitzer and Sweet-Cushman say.

Going forward, Kreitzer and Sweet-Cushman urge more research on interventions to reduce equity bias. Current interventions include reducing the size of the rating scale and making students aware of their potential biases prior to completing the evaluations. At the same time, one study found this last intervention ineffective -- possibly due to what’s known as the “backlash effect” -- and so more work is needed.

“It is clear that teaching evaluations are poor metrics of student learning and are, at best, imperfect measures of instructor performance,” Kreitzer and Sweet-Cushman wrote. “SETs disproportionately penalize faculty who are already marginalized by their status as minority members of the discipline. Across the existing literature, using different data, measures, and methods, scholars in many disciplines have documented problems with student evaluations of teaching in ways that are abundantly relevant to faculty in all disciplines.”

Until “feasible, reliable and fair methods for evaluating teaching and learning are established,” they added, “more caution should be taken in the use of SETs in hiring, tenure, and promotion decisions and alternative assessments of teaching should be further utilized. “

Ultimately Kreitzer told Inside Higher Ed, “We don’t believe that SETs are completely meaningless. However, they are rightfully demonized for failing to account for known sources of bias.”

The research Kreitzer and Sweet-Cushman reviewed was all published pre-COVID-19. Yet COVID-19 has, at least for now, dramatically altered how teaching happens and how it’s assessed. Asked about this, Kreitzer said that some instructors have used technology to teach in the past and some haven’t, and that some kinds of coursework transfer over to a remote format more easily than others. These differences “might not have been relevant on SETs before but will be overemphasized now,” she continued. “Younger faculty may have found the transition to online teaching to be easier than their more senior colleagues.”

There are also “gendered circumstances, particularly family and home life,” that could exacerbate students’ gender bias, Kreitzer said. It’s possible, for instance, “that instructors who have kids interrupting lectures or in the background are perceived by students to be less organized or dedicated to teaching.”

Students are also struggling with the “twin public health crises of the pandemic and mental health,” she added, which could shift their experiences and impressions of the class or instructor. It’s possible, then, that SETs are “systematically higher or lower than usual.” But reality is that “we don’t know how large of an impact any of those things might have.”

Early in the pandemic, last spring, many institutions said that SETs would continue but that they would not be used in any punitive way against instructors, including in tenure and promotion decisions. Since then, some colleges and universities have not clarified whether such a policy will continue for the duration of the pandemic.

Kreitzer said it’s critical that administrators “be particularly cautious in how SETs during the pandemic are incorporated into personnel decisions.” They should not make direct comparisons between professors’ scores during the pandemic and before or after “because of the multitude of ways in which remote teaching during the pandemic is different,” she added.

Faculty and administrators may, however, “be able to glean some useful insight into how students experienced class from the open-ended response sections, such as what worked or didn’t in an online format.” Of course, Kreitzer said, the use of these kinds of qualitative comment, where equity bias is most apparent, “should be limited and cautious.”

Joshua Eyler, director of faculty development at the University of Mississippi, said campus conversations last spring about how to handle and count SETs during COVID-19 led the Faculty Senate to charge a task force with reviewing if and how Ole Miss needs to change how it does SETs in general.

Eyler is now chairing that task force. While it’s too early to talk about possible recommendations for long-term change, he said, “We have been systematically going through all the possible purposes of SETs and discussing how well our current form fulfills those purposes.” Equity is “at the forefront of all of our conversations -- as is the goal of minimizing bias to the utmost extent.”