English teachers' group criticizes machine scoring

Does Not Compute

A new position statement from the National Council of Teachers of English says machine scoring of essays is easily "gamed" and can't grasp the same elements people can.

You have /5 articles left.
Sign up for a free account or log in.

A new report out from the National Council of Teachers of English criticizes the practice of using machine scoring for writing assessments.

"Machine Scoring Fails the Test," NCTE’s new position statement, argues that computers lack the capacity to accurately grade essays and other writing assignments. The council draws its conclusions from various pieces of scholarship on machine scoring, cited in full in the statement.

Computer grading of essays has become a hot-button issue amid a broader debate over the expanded role of technology in education, and whether the increased presence of technology comes at the cost of autonomy. This has been especially evident as new ideas such as massive open online courses (MOOCs) are tested, as advocates seek to expand the number of students educated without significantly increasing spending on instruction.

“Writing is a highly complex ability developed over years of practice, across a wide range of tasks and contexts, and with copious, meaningful feedback,” the statement reads. “Students must have this kind of sustained experience to meet the demands of higher education, the needs of a 21st-century workforce, the challenges of civic participation, and the realization of full, meaningful lives.”

According to the NCTE, computer scoring fails to live up this for myriad reasons. A major factor is a computer’s inability to recognize elements of writing such as clarity, humor or accuracy; thus, machine scoring “denies students the chance to have anything but limited features recognized in their writing” and “compels teachers to ignore what is most important in writing instruction in order to teach what is least important.”

In the absence of a human perspective, the report says, machine scoring uses “different, cruder methods,” such as the average length of words used or the length and number of sentences per paragraph. Narrow, overly objective criteria like these, according to the NCTE, also “reduc[e] the incentive for teachers to develop innovative and creative occasions for writing, even for assessment.”

“The very qualities that we associate most strongly with good writing are qualities that it’s extremely different for a computer to recognize,” Chris Anson, chair of NCTE’s Conference on College Composition and Communication, said in an interview. “The computers can’t make inferences. They can’t understand what they’re reading; they can only look for specific features they’ve been programmed to look at, and those are mostly surface kind of features."

Furthermore, computer grading makes it easier for students to “game the system,” according to both the report and Anson. “For example if a computer is programmed to look for certain features of text … and if it awards higher scores to words that are more rare in its lexicon, and if students know this, then it’s fairly easy to drop in a few of those rare words to an essay and receive a higher score as a result,” Anson, who was also the head of the task force that wrote the statement, said in an interview.

The University of Akron’s Mark Shermis, lead author of a 2012 study suggesting that computers can grade essays as effectively as human beings can, said via e-mail that the statement “fails to make the distinction between scoring used for summative assessment and that employed in the process for providing feedback in the instruction of writing." Shermis said that the study, sponsored by the William and Flora Hewlett Foundation, "showed the capacity of the scoring software to meet, and sometimes exceed, the distributional and agreement metrics that are commonly used to evaluate human raters in a high-stakes testing environment.”

Shermis also took issue with the idea of computer grading being easy to fool. “[O]ne has to be a good writer to construct the ‘bad’ essay that gets a good score,” Shermis said. “A Ph.D. from MIT can do it, but a typical 8th grader cannot.”

Shermis asserted that the NCTE’s findings were driven by the group's agenda rather than hard evidence. “Taken together it would appear as if the NCTE statement is more political posturing than based on scientific merit,” he said. “Rather than be characterized as irrelevant, the proponents of the statement might consider a more careful outline of the current limitations of the technology and where the next wave of development and research should take place.”

The report makes numerous suggestions with regard to alternative forms of assessing writing, such as teacher assessment teams, portfolio-based grading and assessments that are created by incorporating individual districts and classrooms.