Professors at odds on machine-graded essays

Humans Fight Over Robo-Readers

Prominent writing instructor challenges a much-discussed study that found machines can grade student writing about as well as humans.

You have /5 articles left.
Sign up for a free account or log in.

Les Perelman, the former director of writing at the Massachusetts Institute of Technology, known for taking on the writing test of SAT, is challenging a well-publicized study that claims machines can grade writing exams about as well as humans.

During a presentation today to what is likely to be a friendly crowd at the Conference on College Composition and Communication convention in Las Vegas, Perelman plans to talk about his “critique” of the 2012 paper by Mark D. Shermis, the dean of the college of education at the University of Akron.

Shermis and co-author Ben Hamner, a data scientist, found automated essay scoring was capable of producing scores similar to human scores.

Perelman argues a close examination of the paper’s methodology and the data show "that such a claim is not supported by the data in the study.” Perelman claims the authors didn’t run statistical tests on the data, looked at random variables and essentially compared apples to oranges.

The new flashpoint in the machine grading dispute comes as the vast majority of states are planning to introduce new high-stakes tests for K-12 students with writing sections slated to be graded by machines.

Perelman thinks teachers will soon teach students to write to please robo-readers, which Perelman argues disproportionately give students credit for length and loquacious wording, even if they don't quite make sense. “The machine is rigged to try to get as close to the human scores as possible, but machines don’t understand meaning,” he said.

Shermis, who said he worked under machine grading pioneer Ellis Page, said Perelman’s criticisms mostly don't add up and accused Perelman of sniping from the sidelines.

“We’re at least doing some research and our critics seem to be a little bit short on that,” he said.

Shermis did concede one point: he did not do a regression analysis on the data in his study. He said that was part of the conditions he accepted in order to be able to test the essay grading software produced by a number of major vendors, including McGraw-Hill and Pearson. He said someone who knew what they were doing could take his work and do further analysis, though.

Shermis said his work is set to come out as a book chapter and that Assessing Writing, founded in 1994, is still reviewing his paper, which is in its third revision there; Perelman said he will submit his paper to the Journal of Writing Assessment, which was founded in 2003. (This paragraph has been corrected.)

Perelman, in particular, argues there is an iceberg ahead at the K-12 level, where two consortia of states are preparing to introduce totally new high-stakes standardized exams to match the Common Core curriculum, which has swept the nation. The two consortiums -- the Partnership for Assessment of Readiness for College and Careers and Smarter Balanced Assessment Consortium -- are eyeing machine-graded essays as a way to reduce the time it takes to grade exams and to drive down costs to states.

Perelman wants a better vetting of the science, particularly Shermis’s paper, before that happens.

“Because of the widespread publicity surrounding this study and that its findings may be used by states and state consortia in implementing the Common Core State Standards Initiative, the authors should make the test dataset publicly available for analysis,” Perelman wrote in his paper.

(Shermis said some of the data are already online and have been used in an online competition to make improvements to existing essay grading technology.)

Smarter Balanced has actually already scaled back its plans for grading writing with machines because artificial intelligence technology has not developed as quickly as it had once hoped.

In 2010, when it was starting to develop the new Common Core exams for its 24 member states, the group wanted to use machines to grade 100 percent of the writing.

“Our initial estimates were assuming we could do everything by machine, but we’ve changed that,” said Jacqueline King, a director at Smarter Balanced.

Now, 40 percent of the writing section, 40 percent of the written responses in the reading section and 25 percent of the written responses in the math section will be scored by humans.

“The technology hasn’t moved ahead as fast as we thought,” King said.