Can You Trust Automated Grading?
FAIRFAX, VA. -- If a computer can win at "Jeopardy," can one grade the essays of freshmen?
At George Mason University Saturday, at the Fourth International Conference on Writing Research, the Educational Testing Service presented evidence that a pilot test of automated grading of freshman writing placement tests at the New Jersey Institute of Technology showed that computer programs can be trusted with the job. The NJIT results represent the first "validity testing" -- in which a series of tests are conducted to make sure that the scoring was accurate -- that ETS has conducted of automated grading of college students' essays. Based on the positive results, ETS plans to sign up more colleges to grade placement tests in this way -- and is already doing so.
But a writing scholar at the Massachusetts Institute of Technology presented research questioning the ETS findings, and arguing that the testing service's formula for automated essay grading favors verbosity over originality. Further, the critique suggested that ETS was able to get good results only because it tested short answer essays with limited time for students -- and an ETS official admitted that the testing service has not conducted any validity studies on longer form, and longer timed, writing.
Chaitanya Ramineni, an ETS researcher, outlined the study of NJIT's use of the testing service's E-Rater to grade writing placement essays. NJIT has freshmen write answers to short essay prompts and uses four prompts, arranged in various configurations of two prompts per student, with 30 minutes to write.
The testing service compared the results of E-Rater evaluations of students' papers to human grading, and to students' scores on the SAT writing test and the essay portion of the SAT writing test (which is graded by humans). ETS found very high correlations between the E-Rater grades and the SAT grades and, generally, to the human grades of the placement test.
In fact, Ramineni said, one of the problems that surfaced in the review was that some humans doing the evaluation were not scoring students' essays on some prompts in consistent ways, based on the rubric used by NJIT. While many writing instructors may not trust automated grading, she said, it is important to remember that "human scoring suffers from flaws."
Andrew Klobucar, assistant professor of humanities at NJIT, said that he has also noticed a key change in student behavior since the introduction of E-Rater. One of the constant complaints of writing instructors is that students won't revise. But at NJIT, Klobucar said, first-year students are willing to revise essays multiple times when they are reviewed through the automated system, and in fact have come to embrace revision if it does not involve turning in papers to live instructors.
Students appear to view handing in multiple versions of a draft to a human to be "corrective, even punitive," in ways that discourage them, he said. Their willingness to submit drafts to E-Rater is a huge advance, he said, given that "the construction and revision of drafts is essential" for the students to become better writers.
After the ETS and NJIT presentations encouraging the use of automated grading, Les Perelman came forward as, he said, "the loyal opposition" to the idea. Perelman, director of writing across the curriculum at MIT, has a wide following among writing instructors for his critiques of standardized writing tests -- even when graded by people.
He may be best known for his experiments psyching out the College Board by figuring out which words earn students high grades on the SAT essay, and then having students write horrific prose using those words, and earn high scores nonetheless.
Perelman did not dispute the possibility that automated essay grading may correlate highly with human grading in the NJIT experiment. The problem, he said, is that his research has demonstrated that there is a flaw in almost all standardized grading of short essays: In the short essay, short time limit format, scoring correlates strongly with essay length, so the person who gets the most words on paper generally does better -- regardless of writing quality, and regardless of human or computer grading.
In four separate studies of the SAT essay tests, Perelman explained, high correlations were found between length and score. Other writing tests -- with times of one hour instead of times of 25 minutes -- found that the correlation between length and score dropped by half. In more open-ended writing assignments, the correlation largely disappeared, he said.
After reviewing these nine tests, he said that for any formula to work (grading by humans in short time periods, but especially grading by computer), the values that are rewarded are likely to be suspect.
Perelman then critiqued the qualities that go into the ETS formula for automated grading. For instance, many parts of the formula look at ratios -- the ratio of grammar error to total number of words, ratio of mechanics errors to word count, and so forth. Thus someone who writes lots of words, and keeps them simple (even to the point of nonsense), will do well.
ETS says its computer program tests "organization" in part by looking at the number of "discourse units" -- defined as having a thesis idea, a main statement, supporting sentences and so forth. But Perelman said that the reward in this measure of organization is for the number of units, not their quality. He said that under this rubric, discourse units could be flopped in any order and would receive the same score -- based on quantity.
Other parts of the formula, he noted, punish creativity. For instance, the computer judges "topical analysis" by favoring "similarity of the essay's vocabulary to other previously scored essays in the top score category." "In other words, it is looking for trite, common vocabulary," Perelman said. "To use an SAT word, this is egregious." Word complexity is judged, among other things, by average word length, so, he suggested, students are rewarded for using "antidisestablishmentarianism," regardless of whether it really advances the essay. And the formula also explicitly rewards length of essay.
Perelman went on to show how Lincoln would have received a poor grade on the Gettysburg Address (except perhaps for starting with "four score," since it was short and to the point). And he showed how the ETS rubric directly contradicts most of George Orwell's legendary rules of writing.
For instance, he noted that Orwell instructed us to "never use a metaphor, simile, or other figure of speech which you are used to seeing in print," to "never use a long word where a short one will do" and that "if it is possible to cut a word out, always cut it out." ETS would take off points for following all of that good advice, he said.
Perelman ended his presentation by flashing an image that he said represented the danger of going to automated grading just because we can: Frankenstein.
Paul Deane, of the ETS Center for Assessment, Design and Scoring, responded to Perelman by saying that he agreed with him on the need for study of automated grading of longer essays and of writing produced over longer periods of time than 30 minutes. He also said that ETS has worked safeguards into its program so that if someone, for instance, used words like "antidisestablishmentarianism" repeatedly, the person would not be able to earn a high score with a trick.
Generally, he said that the existing research is sufficient to demonstrate the value of automated grading, provided that it is used in the right ways. The computer "won't tell you if someone has written a prize essay," he said. But it can tell you if someone has "knowledge of academic English" and whether someone has the "fundamental skills" needed -- enough information to use in placement decisions, along with other tools, as is the case at NJIT.
Automated grading evaluates "key parts of the writing construct," Deane said, even if it doesn't identify all of the writing skills or deficits of a given student.
Search for Jobs