Large study shows little difference between human and robot essay graders

You have /5 articles left.
Sign up for a free account or log in.

Education technology has long since delivered on its promise of software that can grade most student work in lieu of instructors or teaching assistants. These days, debates about artificial intelligence in education are more likely to revolve around whether automatons can be relied upon to teach students new concepts.

Yet when it comes to English composition, the question of whether computer programs can reliably assess student work remains sticky. Sure an automaton can figure out if a student has done a math or science problem by reading symbols and ticking off a checklist, writing instructors say. But can a machine that cannot draw out meaning, and cares nothing for creativity or truth, really match the work of a human reader?

In the quantitative sense: yes, according to a study released Wednesday by researchers at the University of Akron. The study, funded by the William and Flora Hewlett Foundation, compared the software-generated ratings given to more than 22,000 short essays, written by students in junior high schools and high school sophomores, to the ratings given to the same essays by trained human readers.

The differences, across a number of different brands of automated essay scoring software (AES) and essay types, were minute. “The results demonstrated that over all, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items,” the Akron researchers write, “with equal performance for both source-based and traditional writing genre.”

“In terms of being able to replicate the mean [ratings] and standard deviation of human readers, the automated scoring engines did remarkably well,” Mark D. Shermis, the dean of the college of education at Akron and the study’s lead author, said in an interview.

The Akron study asserts that it is the largest and most comprehensive investigation of its kind, although it is hardly the first. Smaller studies of specific automated-essay-scoring products have reported similarly high fidelity between human and machine scores on writing samples.

But independent reviews of AES products have been rare and occasionally critical. Les Perelman, director of the Writing Across the Curriculum program at the Massachusetts Institute of Technology, has crusaded against automated essay grading by writing and speaking widely of his own, successful efforts to fool the Educational Testing Services’ e-Rater, which has been used to grade the GRE and the Collegiate Learning Assessment (CLA), into giving good scores to incoherent essays carefully crafted by Perelman to exploit its flaws.

In higher education, AES products are still used primarily to grade students’ writing on standardized tests and placement exams, and have not yet found their way into many composition classrooms, Perelman told Inside Higher Ed in an interview. But with demand for writing education rising amid a surge in enrollments among non-native English speakers, triumphant studies such as the Akron researchers’ might embolden some overenrolled, understaffed community colleges to consider deploying AES for its composition classes, he says.

That would be a mistake, Perelman says, pointing to a 2008 study by researchers in southern Texas. Those researchers compared machine scores to human ones on essays written by 107 students in a developmental writing course at South Texas College, a community college near the Mexico border that is 95 percent Hispanic. They found no significant correlation.

Shermis, the lead author of the Akron study, says thrift-minded administrators and politicians should not take his results as ammunition in a crusade to replace composition instructors with AES robots. Ideally, educators at all levels would use the software “as a supplement for overworked [instructors of] entry-level writing courses, where students are really learning fundamental writing skills and can use all the feedback they can get.”

The Akron education dean acknowledges that AES software has not yet been able to replicate human intuition when it comes to identifying creativity. But while fostering original, nuanced expression is a good goal for a creative writing instructor, many instructors might settle for an easier way to make sure their students know how to write direct, effective sentences and paragraphs.

“If you go to a business school or an engineering school, they’re not looking for creative writers,” Shermis says. “They’re looking for people who can communicate ideas. And that’s what the technology is best at” evaluating.

For the latest technology news and opinion from Inside Higher Ed, follow @IHEtech on Twitter.