When humans write, they leave subtle signatures that hint at the prose’s fleshy, brainy origins. Their word and phrase choices are more varied than those selected by machines that write. Human writers also draw from short- and long-term memories that recall a range of lived experiences and inform personal writing styles. And unlike machines, people are susceptible to inserting minor typos, such as a misplaced comma or a misspelled word. Such attributes betray the text’s humanity.
For these reasons, AI-writing detection tools are often designed to “look” for human signatures hiding in prose. But signature hunting presents a conundrum for sleuths attempting to distinguish between human- and machine-written prose.
“If I’m a very intelligent AI and I want to bypass your detection, I could insert typos into my writing on purpose,” said Diyi Yang, assistant professor of computer science at Stanford University.
In this cat-and-mouse game, some computer scientists are working to make AI writers more humanlike, while others are working to improve detection tools. Academic fields make progress in this way. But some on the global artificial intelligence stage say this game’s outcome is a foregone conclusion.
“In the long run, it is almost sure that we will have AI systems that will produce text that is almost indistinguishable from human-written text,” Yoshua Bengio, the “godfather of AI” and recipient of the Turing Award, often referred to as the Nobel of computer science, told Inside Higher Ed in an email exchange. Bengio is a professor of computer science at the University of Montreal.
Nonetheless, the scientific community and higher ed have not abandoned AI-writing detection efforts—and Bengio views those efforts as worthwhile. Some are motivated to ferret out dishonesty in academic pursuits. Others seek to protect public discourse from malicious uses of text generators that could undermine democracies. (Educational technology company CEOs may have dollar signs in their eyes.) Still others are driven by philosophical questions concerning what makes prose human. Whatever the motivation, all must contend with one fact:
“It’s really hard to detect machine- or AI-generated text, especially with ChatGPT,” Yang said.
The ‘Burstiness’ of Human Prose
During the recent holiday break, Edward Tian, a senior at Princeton University, headed to a local coffeeshop. There, he developed GPTZero, an app that seeks to detect whether a piece of writing was written by a human or ChatGPT—an AI-powered chat bot that interacts with users in a conversational way, including by answering questions, admitting its mistakes, challenging falsehoods and rejecting inappropriate requests. Tian’s effort took only a few days but was based on years of research.
His app relies on two writing attributes: “perplexity” and “burstiness.” Perplexity measures the degree to which ChatGPT is perplexed by the prose; a high perplexity score suggests that ChatGPT may not have produced the words. Burstiness is a big-picture indicator that plots perplexity over time.
“For a human, burstiness looks like it goes all over the place. It has sudden spikes and sudden bursts,” Tian said. “Versus for a computer or machine essay, that graph will look pretty boring, pretty constant over time.”
Tian and his professors hypothesize that the burstiness of human-written prose may be a consequence of human creativity and short-term memories. That is, humans have sudden bursts of creativity, sometimes followed by lulls. Meanwhile, machines with access to the internet’s information are somewhat “all-knowing” or “kind of constant,” Tian said.
Upon releasing GPTZero to the public on Jan. 2, Tian expected a few dozen people to test it. But the app went viral. Since its release, hundreds of thousands of people from most U.S. states and more than 30 countries have used the app.
“It’s been absolutely crazy,” Tian said, adding that several venture capitalists have reached out to discuss his app. “Generative AI and ChatGPT technology are brilliantly innovative. At the same time, it’s like opening Pandora’s box … We have to build in safeguards so that these technologies are adopted responsibly.”
Tian does not want teachers use his app as an academic honesty enforcement tool. Rather, he is driven by a desire to understand what makes human prose unique.
“There is something implicitly beautiful in human writing,” said Tian, a fan of writers like John McPhee and Annie Dillard. “Computers are not coming up with anything original. They’re basically ingesting gigantic portions of the internet and regurgitating patterns.”
Detectors Without Penalties
Much like weather-forecasting tools, existing AI-writing detection tools deliver verdicts in probabilities. As such, even high probability scores may not foretell whether an author was sentient.
“The big concern is that an instructor would use the detector and then traumatize the student by accusing them, and it turns out to be a false positive,” Anna Mills, an English instructor at the College of Marin, said of the emergent technology.
But professors may introduce AI-writing detection tools to their students for reasons other than honor code enforcement. For example, Nestor Pereira, vice provost of academic and learning technologies at Miami Dade College, sees AI-writing detection tools as “a springboard for conversations with students.” That is, students who are tempted to use AI writing tools to misrepresent or replace their writing may reconsider in the presence of such tools, according to Pereira.
For that reason, Miami Dade uses a commercial software platform—one that provides students with line-by-line feedback on their writing and moderates student discussions—that has recently embedded AI-writing detection. Pereira has endorsed the product in a press release from the company, though he affirmed that neither he nor his institution received payment or gifts for the endorsement. He did, however, acknowledge that his endorsement has limits.
“We’re definitely worried about false positives,” Pereira told Inside Higher Ed. “I’m also worried about false negatives.”
Beyond discussions of academic integrity, faculty members are talking with students about the role of AI-writing detection tools in society. Some view such conversations as a necessity, especially since AI writing tools are expected to be widely available in many students’ postcollege jobs.
“These tools are not going to be perfect, but … if we’re not using them for gotcha purposes, they don’t have to be perfect,” Mills said. “We can use them as a tool for learning.” Professors can use the new technology to encourage students to engage in a range of productive ChatGPT activities, including thinking, questioning, debating, identifying shortcomings and experimenting.
Also, on a societal level, detection tools may aid efforts to protect public discourse from malicious uses of text generators, according to Mills. For example, social media platforms, which already use algorithms to make decisions about which content to boost, could use the tools to guard against bad actors. In such cases, probabilities may work well.
“We have to fight to preserve that humanity of communication,” Mills said.
A Long-Term Challenge
In an earlier era, a birth mother who anonymously placed a child with adoptive parents with the assistance of a reputable adoption agency may have felt confident that her parentage would never be revealed. All that changed when quick, accessible DNA testing from companies like 23andMe empowered adoptees to access information about their genetic legacy.
Though today’s AI-writing detection tools are imperfect at best, any writer hoping to pass an AI writer’s text off as their own could be outed in the future, when detection tools may improve.
“We need to get used to the idea that, if you use a text generator, you don’t get to keep that a secret,” Mills said. “People need to know when it’s this mechanical process that draws on all these other sources and incorporates bias that’s actually putting the words together that shaped the thinking.”
Tian’s GPTZero is not the first app for detecting AI writing, nor is it likely to be the last.
OpenAI—ChatGPT’s developer—considers detection efforts a “long-term challenge.” Their research conducted on GPT-2 generated text indicates that the detection tool works approximately 95 percent of the time, which is “not high enough accuracy for standalone detection and needs to be paired with metadata-based approaches, human judgment, and public education to be more effective,” according to OpenAI. Detection accuracy depends heavily on training and testing sampling methods and whether training included a range of sampling techniques, according to the study.
After-the-fact detection is only one approach to the problem of distinguishing between human- and computer-written text. OpenAI is attempting to “watermark” ChatGPT text. Such digital signatures could embed an “unnoticeable secret signal” indicating that the text was generated by ChatGPT. Such a signal would be discoverable only by those with the “key” to a cryptographic function—a mathematical technique for secure communication. The work is forthcoming, but some researchers and industry experts have already expressed doubt about the watermarking’s potential, citing concerns that workarounds may be trivial.
Turnitin has announced that it has an AI-writing detection tool in development, which it has trained on “academic writing sourced from a comprehensive database, as opposed to solely publicly available content.” But some academics are wary of commercial products for AI detection.
“I don’t think [AI-writing detectors] should be behind a paywall,” Mills said.
Higher Ed Adapts (Again)
“Think about what we want to nurture,” said Joseph Helble, president of Lehigh University. “In the pre-internet and pre-generative-AI ages, it used to be about mastery of content. Now, students need to understand content, but it’s much more about mastery of the interpretation and utilization of the content.”
ChatGPT calls on higher ed to rethink how best to educate students, Helble said. He recounted the story of an engineering professor he knew years ago who assessed students by administering oral exams. The exams scaled with a student in real time, so every student was able to demonstrate something. Also, the professor adapted the questions while administering the test, which probed the limits of students’ knowledge and comprehension. At the time, Helble considered the approach “radical” and concedes that, even now, it would be challenging for professors to implement. “But the idea that [a student] is going to demonstrate ability on multiple dimensions by going off and writing a 30-page term paper—that part we have to completely rethink.”
Helble is not the only academic who floated the idea of replacing some writing assignments with oral exams. Artificial intelligence, it turns out, may help overcome potential time constraints in administering oral exams.
“The education system should adapt [to ChatGPT’s presence] by focusing more on understanding and creativity and using more expensive oral-based evaluations, like oral exams, or exams without permission to use technology,” Bengio said, adding that oral exams need not be done often. “When we get to that point where we can’t detect if a text is written by a machine or not, those machines should also be good enough to run the [oral] exams themselves, at least for the more frequent evaluations within a school term.”