An Ultra-Sophisticated Language AI Shows the Way Toward Better Writing Assessment

You have /5 articles left.
Sign up for a free account or log in.

The Generative Pre-trained Transformer 3 (GPT-3), an AI language processing tool that can interpret and translate text,[1] is pretty amazing.

It’s so amazing that thinking about how it performed on an experiment to see if artificial intelligence could pass a college-level writing assessment helped me better articulate something I’ve known, but haven’t precisely expressed, about how we teach, assign and, most importantly, assess writing.

The GPT-3 has all kinds of practical applications, including translation, rapid data processing from text commands and other things that are illustrated in this short demo.

Kind of like how IBM sought to demonstrate the power of its supercomputing applications by creating Deep Blue to play chess, one of the interesting challenges for GPT-3 is to see if it can successfully “write” in ways that pass for human.

It can. GPT-3 does quite well at producing writing that passes for human-generated. The AI processes and arranges natural language with far more felicity than any previous applications.

The folks at EduRef.net took the challenge a step further, to see what kind of grades GPT-3 could get on college assignments designed by professors and graded by a panel where GPT-3’s work is anonymously mixed among efforts by human writers.

For three of the four subjects -- research methods, U.S. history and law -- the AI passed with a C or better grade. The top human scored better than the AI in each category, but in two categories (research methods, law), the AI beat one or more of the humans.

For the fourth subject -- creative writing -- the AI tanked, earning an F.

Where the GPT-3 does well and where it does poorly point the way toward a better understanding of the kinds of written assessments we should want students to do, and how we should assess those assignments once they’re written. A good rule of thumb is that if the AI is competitive with the humans, we’re probably doing something wrong.

A good writing assessment should work in two dimensions:

It should allow the author to demonstrate existing knowledge.
The writing itself should lead to the creation of additional new knowledge made possible only through the work of a unique intelligence (a person).

In other words, the writing itself creates knowledge that the writer did not possess prior to the writing the piece. The writing is an act of discovery for the author which in turns delivers that discovery to the reader. Now, it's possible that the reader, particularly a highly informed expert one, may not be blown away by the insight, but that a genuine insight forged through critical engagement with the material by the student was achieved should be apparent.

Please do not think this is a high bar. It isn’t if the writing experience is properly designed, the author is properly engaged and the tools of assessment are appropriate to our goals.

On the U.S. history essay assignment on “American exceptionalism,” the GPT-3 earned a B, doing well with the part of the prompt that required it to demonstrate understanding of the concept. It employed an opening that restates the prompt and makes the scope of the answer clear before moving on define “American exceptionalism.”

“In this paper, I will be addressing the issue of American exceptionalism. I will first define American exceptionalism and then give examples of what it silences. Finally, I will discuss the possibilities of creating productive communities and identities that defy a tradition of subjugation.”

The GPT-3 struggles, however, with the interpretive part of the prompt, which requires the author to think critically and theorize around a series of propositions meant to spur thinking that hopefully lead the author toward some original discovery or interpretation. The GPT-3 is sort of hopelessly banal here, arguing that the key moving forward is to have an “open dialogue about American exceptionalism. Open dialogue allows people from all walks of life to come together and discuss their differences without judgement or prejudice. This helps people become more tolerant of each other’s beliefs.”

This is the kind of conclusion where the human assessor sighs and awards a B and moves on to the next essay, knowing that this student pretty much paid attention in class and maybe picked up a couple of things, but also … sigh. That the student sounds like a somewhat stilted version of the kicker of a typical David Brooks column is sufficient proof that they deserve an above-average grade, but let’s ask ourselves, did this student truly learn anything as they were writing?

This effort does not fulfill the second part of my proposition. It is clear that while information was regurgitated, nothing was discovered.[2]

Had this been produced by a human student, was this assignment worth the time it took the student to write and for the professor to read? Is this student capable of truly engaging with the deep questions around this topic, or have they done just enough to pass a rather uninspired threshold?

Couldn’t we do better?

The answer is yes.

In this case, the problem is not in the assignment, which is actually well designed in requiring the synthesis of ideas and concepts that should lead to the creation of new knowledge.

In this case the problem is in our well-trodden patterns of how we assess student work in the context of school. The response is grammatical, it demonstrates some familiarity with the course and it is not wrong in any significant way.

It is also devoid of any signs that a human being wrote it, which, unfortunately does not distinguish it from the kinds of writing students are often asked to do in school contexts, which is rather distressing to consider, but let’s put that aside for the moment.

When confronted with this kind of work, what if we did something differently?

What if we replaced that … sigh … B with a “not complete, try again”?

Because honestly, isn’t that a more appropriate grade than the polite pat on the head that the B signals in this case?

The “try again” would of course be accompanied with feedback and exemplification which demonstrates the gap between what has been produced and what is desired, but if we are serious about students demonstrating genuine, meaningful learning, shouldn’t we make them go back to the well until they return with evidence of it?

The human grader response to the creative writing prompt shows a different approach to assessment. The assignment asks for a place-based narrative, a story that utilizes “plot, conflict, characterization, dialogue and imagery.”

The GPT-3 response is perfectly grammatical and sensible as it describes how amazing it is to experience the grounds of Yale University and its Gothic architecture, which is reminiscent of Hogwarts. It also slips in that it was accepted to Stanford in addition to Yale. If the GPT-3 were a human student, rather than AI, this would’ve been their first failing grade ever.

The response also reads like it was done by a student who barely read the prompt phoned in a personal essay. It has none of the elements that the assignment asks for and is rated as failing because it is nonresponsive to the goals of the assignment -- to produce a narrative that demonstrates those storytelling elements.

The creative writing assignment essentially has a single criteria rubric to merit deeper consideration: Is it a story?

If story = NO, then grade = F.

Many will recognize this approach as a form of contract grading, and it is far more rigorous than a traditional grading system that allows student somnambulance to pass through with passing grades for efforts that resemble writing but are the literal definition of going through the motions.

In my creative writing classes, I often used a similar one-point criteria for the detection of what I called “life” or “energy,” some spark that indicated a lively and creative intelligence was at work in the story.

Life or energy could be found in any one of those elements: plot, conflict, dialogue, characterization, imagery. Ideally, energy would be present in all elements, but as long as it was present in at least one, we knew we had something to work with, a spark that could become a flame, then a fire and, over time, even an inferno.

If we want students to demonstrate that spark, we have to create assessments that make that it possible. More importantly, we need to hold them accountable to the criteria that’s truly meaningful as evidence of knowledge creation. Allowing proficiency surface-level criteria to substitute for passing when deeper engagement and knowledge is absent doesn't do students any favors.

As of yet, and I think probably unless and until AI can achieve actual sentience, the algorithm will not be able to produce writing that sparks.

The more important question for me, though, is why we’ve created a system that allows so much student writing to flow by that looks like it could’ve been written by this super AI.

We can do better than that.

[1] It’s more sophisticated than that sounds. I recommend the linked video to see how that concept applies in practical settings.

[2] If you’re at all curious, I strongly encourage you to read the whole thing.

An Ultra-Sophisticated Language AI Shows the Way Toward Better Writing Assessment

Next Story

Written By

Sign up for a free account or log in.

An Ultra-Sophisticated Language AI Shows the Way Toward Better Writing Assessment

Next Story

A Re-Engagement Strategy for Administrators

Written By

Share This Article

Sign up for a free account or log in.