New: Institutional Licensing, deploy across your district or college. Read the framework →
A aiessaydetector.ai

Pillar guide · 2026 edition

How AI detectors actually work.

Perplexity, burstiness, stylometry, classifiers, and why the same essay can score 87% on one tool and 12% on another.

Published 2026-02-11 · Updated 2026-04-18 · Editorial Team

1. What AI detectors are actually measuring.

An AI detector doesn't "know" whether a passage was written by a human. It measures statistical properties of the text and compares them to patterns it's seen from large language models. The two workhorse signals are perplexity (how surprised a language model is by the next word, averaged across the text) and burstiness (how much sentence-to-sentence variation there is in that perplexity).

Human writing tends to be bursty: a short punchy sentence, then a long meandering one, then a fragment. Large language models, fine-tuned to produce smooth, defensible prose, tend to flatten this variation. A human essay with ten sentences that all land in a narrow perplexity band is a statistical outlier, even if it was written entirely by a careful human.

On top of perplexity and burstiness, modern detectors add stylometric features (average sentence length, word-length distribution, punctuation rhythm, hedging-phrase frequency, passive-voice rate) and sometimes a second classifier trained end-to-end on labeled human vs. AI samples.

2. Where detectors fail.

Three failure modes matter in a classroom:

  • Formal register false-positives. Non-native English writers, ESL students, and writers trained in rigid academic conventions often produce text with the same low-burstiness, low-perplexity signature as an LLM. A careful, rule-following student writes "flat" prose. The detector sees flat prose. Flag.
  • Light-touch editing evasions. Someone who writes an AI first draft and then rewrites every fifth sentence by hand will usually pass. The burstiness of the hand-rewritten sentences is enough to break the signal.
  • Short passages. Below about 250 words, none of these signals are statistically stable. Detectors that claim high confidence on a paragraph are overreaching.

A responsible detector tells you its confidence and tells you the confidence interval. A detector that returns "99% AI-generated" on a 120-word paragraph is selling you a number it can't actually estimate.

3. Why sentence-level evidence matters more than a single percentage.

An essay-level score, "this document is 68% AI-generated", is almost useless to a teacher. 68% of what? Which sentences? Which paragraphs? What am I supposed to say to the student when they ask me where I think they cheated?

Sentence-level highlighting (which we do, and which the research literature calls "token-level attribution") shifts the conversation. Instead of "your essay was flagged," the conversation becomes "these three paragraphs look statistically similar to AI output, can you walk me through how you wrote them?" That's a teachable conversation. The essay-level number is a fight.

4. What to tell students about detectors.

Three things are worth being honest about:

  1. Detectors make mistakes, especially on non-native English writers and on formal academic prose.
  2. A detector output is evidence, not a verdict. It prompts a conversation; it doesn't replace one.
  3. If you used AI to help you draft, the honest move is to disclose it. The dishonest move is to run the output through a "humanizer" and hope the detector doesn't catch you, which sometimes works, and sometimes lands the student in far worse trouble than disclosure would have.

5. Further reading.

  • Liang et al., "GPT detectors are biased against non-native English writers," Patterns 4(7), 2023, the foundational paper on the ESL false-positive problem.
  • Sadasivan et al., "Can AI-generated text be reliably detected?," 2023, shows that light paraphrasing reliably evades most detectors.
  • Mitchell et al., "DetectGPT," ICML 2023, the curvature-based classifier most commercial detectors build on.

Frequently asked questions

Can I show this article to my students?
Please do, it's written for that. Link them to /blog/how-ai-detectors-work directly.
Why do different detectors disagree so much?
They're trained on different corpora, use different base language models for perplexity scoring, and weight stylometric features differently. A 10–40 percentage-point disagreement between two reputable detectors on the same essay is normal.
Is there any detector that's right 100% of the time?
No. Anyone claiming that number is selling a product, not science. The realistic ceiling on essay-level classification accuracy on real-world student writing is roughly 90–95% precision, and recall is lower.

Check an essay now.

Open the detector →