Skynesher/iStock/Getty Images Plus
Two new studies on gender bias in student evaluations of teaching look at the phenomenon from fresh—and troubling—angles. One study surveyed students at the beginning of the semester and after their first exam and found that female instructors faced more backlash for grades given than did male instructors. The other study examined how ageism relates to gender bias in student ratings, finding that older female instructors were rated lower than younger women. The second study was longitudinal, so students were rating the same women more poorly over time, even as these professors were gaining teaching experience.
Both studies suggest that as women become more “agentic,” demonstrating agency via stereotypically male-associated traits, they are punished for violating gender norms with lower student ratings.
Whitney Buser, associate director of academic programs in economics at the Georgia Institute of Technology, and co-author of the first study, told Inside Higher Ed that she and her colleagues “were unsure if we would find any bias at the beginning of the semester, but we did find a bit. We found that bias widened after receiving grades, making this the first study to our knowledge that confirms that gender bias is fueled by feedback. Our evidence seems to indicate that women receive more backlash for grades than male professors.”
Jennifer A. Chatman, Paul J. Cortese Distinguished Professor of Management and associate dean of academic affairs at the University of California, Berkeley, and co-author of the second article, said, “Our findings show that women are rated significantly lower as they age from younger to middle age, with their lowest teaching ratings emerging at age 47. Men do not experience this drop in ratings.”
That gender bias impacts student ratings of instruction is hardly news: much research to this effect already exists. Just a few some examples: a 2014 paper found that students in online classes rated a female teaching assistant more highly when they thought she was a man and a male instructor lower when he assumed a female identity; a 2016 paper found that bias against female instructors was so strong that it impacted students’ perceptions of even seemingly objective measures, such as how quickly assignments are graded; and a 2021 metastudy of more than 100 papers on student evaluations found that while bias levels vary across disciplines, students seem to prefer professors with stereotypically masculine traits but penalize women for not conforming to female stereotypes.
The 2021 metastudy recommended that until “feasible, reliable and fair methods for evaluating teaching and learning are established,” more “caution should be taken in the use of SETs [student evaluations of teaching] in hiring, tenure, and promotion decisions and alternative assessments of teaching should be further utilized.” Some institutions have made progress on this front—most recently, for instance, West Virginia University is rewriting its tenure and promotion guidelines (in part) to urge “holistic” assessment of instructors instead of “over-reliance” on student evaluations.
Still, many institutions use student evaluations of teaching as key evidence in high-stakes personnel decisions, not just as collective feedback that can be used to improve one’s teaching over time. These colleges and universities generally argue that students’ experiences with a given instructor must be measured at some kind of scale, and that end-of-course evaluations remain the most practical method. There’s also no clear bias-proof alternative to student ratings. Peer evaluations, for instance, may also be problematic, since student bias is symptomatic of bias in greater society. And so the research on bias in student ratings continues, to help institutions at least manage this dynamic in faculty performance evaluations.
Angela Linse, executive director of the Schreyer Institute for Teaching Excellence and associate dean for teaching at Pennsylvania State University, who was not affiliated with either of the two new studies, said that “the bottom line is that getting rid of student ratings is not going to eliminate the bias.” And while survey instruments should be reviewed for particularly bias-prone questions, she said, “The instruments are not even close to being the biggest problem.”
Instead, the problem lies in how ratings data are interpreted, Linse said. “They need to be interpreted within the context of a biased society and biased higher ed, with an understanding of how those biases impact student ratings for those who are underrepresented in higher ed. This is an issue when white male faculty ratings are considered the quote-unquote norm and all other faculty are compared those ratings. All too often interpreters of student ratings data do not know what bias might look like in student ratings, written feedback or peer review of teaching letters.”
The First Study
Buser’s study, published in Sex Roles, involved some 1,190 undergraduates (696 men and 494 women) enrolled in introductory-level economics courses taught by seven different faculty members—three men and four women—at five institutions: one state university, one large regional university and three private liberal arts colleges. The material, pace and assessment patterns for all of the courses were nearly identical, the paper says.
The authors created a seven-item survey, in which students were asked to rate the following on a zero-to-four scale (strongly disagree to strongly agree): whether they’d recommend the course, whether they’d recommend the instructor and whether the instructor was interesting, knowledgeable, challenging, approachable and caring. The authors say that the first few questions are gender-neutral. “Knowledgeable” and “challenging” are male-associated traits related to agency, meanwhile, “approachable” and “caring” are gendered terms related to communality.
In addition to hypothesizing that women would be assessed lower than men on all traits for being in a gender-incongruous role, the authors anticipated that students would rate women even worse on the second round of evaluations for acting on this gender incongruity by providing critical feedback: exam grades. The study controlled for factors including students’ expected grades in the class.
On the first survey, administered on the second day of class, researchers found significant gender differences for two of the three gender-neutral items (recommend instructor and interesting) and one of the two agentic items (challenging), but no differences between male and female instructors for the communal qualities. These differences did not remain, however, when controlling for other potentially explanatory factors. So students didn’t appear to be punishing female professors for violating gender norms at the outset of the course.
While results of the first survey defied the researchers’ expectations, results of the second survey did not. That is, Buser and her colleagues found evidence that women faced more backlash than men for providing critical feedback in the form of grades. There were significant gender differences for the three supposedly gender-neutral items and for both agentic items—but still not the communal items. And the observed differences remained when controlling for various factors.
“From a practical standpoint, this serves yet as one additional piece of evidence that should call into question the extensive reliance upon SETs in tenure and promotion decisions,” the paper says. Practical strategies include limiting the role SETs place in these decisions, rethinking the timing of SETs (perhaps holding them prior to final exams) and doing significance testing when comparing evaluation scores between faculty members.
As Buser and her colleagues wrote, “As researchers, we value and recognize the importance of significant findings. It is important to note that in the real-world application of these data, small, and at times, nonsignificant, differences in means are used to make decisions around selection, pay, promotion, and tenure decisions. Due to the tight distribution of course evaluation scores among faculty, any differences, though commonly small and often not statistically significant, are used to make consequential decisions.”
The Second Study
Chatman’s study, published in Organizational Behavior and Human Decision Processes, is part of a larger paper on gender and ageism in society. The experiment involving student evaluations of teaching drew on thousands of student ratings of 126 professors of business in an unnamed graduate program from 2003 to 2017. Researchers coded open-ended comments in a subset of the evaluations for agentic and communal traits and otherwise paid particular attention to a survey question that is used in the university’s faculty merit and promotion reviews: “Considering both the limitations and possibilities of the subject matter and course, how would you rate the overall teaching effectiveness of this professor?” (Students rate professors on a scale of one [not effective] to seven [extremely effective].)
The analysis controlled for whether the courses were qualitative or quantitative, program, and professor tenure status, citation counts and childcare leaves, among other variables. In the end, women received more negative evaluations when they were middle-aged compared to when they were younger or older. There was a significant decline in women’s teaching evaluations from young adulthood to middle age, and a rebound from middle age to older adulthood. This contrasted with the pattern for men, whose student ratings increased from young adulthood to middle age. Women’s ratings bottomed out at age 47, on average.
For women, being middle-aged also was associated with higher perceived communal deficits (such as being less warm), and this corresponded with lower performance evaluations. For men, however, being middle-aged was associated with higher perceived agency, which is associated with higher teaching evaluations.
Among men and women, age was significantly associated with a higher likelihood of positive agency comments, such that middle-aged professors had a higher likelihood of receiving agency comments compared to young adult and older professors.
“In the teaching context, one in which knowledge and experience should be a benefit, performance would likely increase or, at the very least, remain relatively steady from young adulthood to middle age,” the paper says. “But this intuitive pattern emerges for men only, while women are viewed as performing worse in middle age, even accounting for parental status and research productivity, which points to deviation from gender prescriptions as the culprit.”
Joe Bandy, interim director of the Center for Teaching at Vanderbilt University, said that not much has changed with respect to student ratings in the decade since he wrote a cheeky Dear Ann Landers–style blog post about them for Vanderbilt.
“Unfortunately, despite some shifts, we see that traditional gender roles are quite stubborn due to the deeply rooted patriarchal foundations of many of our social institutions,” he said last week, calling both Buser’s and Chatman’s and their colleagues’ work “not terribly surprising.”
Yet while it might be tempting to think that student ratings are “so problematic as to be useless,” he said, they have “some utility, and the position that students should have no voice in evaluating teaching would risk imposing different inequities.”
It makes sense, then, to think about interventions, Bandy said. Most importantly, these include evaluating teaching “in a more holistic way, incorporating more rigorous self- and peer-review processes, student focus groups, as well as assessments of student work and growth.”