Glossary

Recall.

Of the AI-generated essays in a batch, how many did the detector catch?

Recall = true positives / (true positives + false negatives). A detector with 80% recall catches 80% of AI-generated essays in a test set.

There's a tradeoff between recall and precision: lowering the threshold catches more AI (higher recall) but also flags more honest essays (lower precision). Every deployment decision is a precision-recall tradeoff, and it should be made explicitly.

Recall from the institutional side

An institution running detection at scale wants to catch a meaningful fraction of AI-assisted submissions. Recall is the number that quantifies "meaningful." If recall is 60%, four out of every ten AI submissions get through. If recall is 90%, only one in ten. The institution's tolerance for missed cases sets the operating threshold, which sets the recall, which trades off against precision.

The recall-precision tradeoff in practice

For classroom integrity, most institutions tune toward higher precision (fewer false flags, accepting more missed AI). For initial editorial screening at academic journals, a different choice, higher recall, accepting a higher review burden because the cost of missing AI is publishing-record damage. Both are defensible; the choice should be deliberate and disclosed. See /for-institutions/compliance for our institutional configuration options.

How Recall interacts with precision and F1 score

Recall operates in tension with precision, which measures the proportion of flagged content that is genuinely AI-generated. A detector optimized purely for recall will flag nearly every document, including human writing, to ensure it catches all AI cases. This produces high recall but catastrophically low precision. Conversely, a system tuned for extreme precision may flag only the most obvious AI outputs, missing sophisticated cases and yielding poor recall. The F1 score attempts to reconcile this tradeoff by computing the harmonic mean of precision and recall, penalizing models that neglect either metric. In institutional settings, administrators must decide which matters more: catching every case of AI use (high recall) or ensuring every accusation is justified (high precision).

The relationship between recall and threshold settings reveals practical constraints. Lowering the confidence threshold required to flag a document as AI-written increases recall but admits more false positives, degrading precision. Research from 2024 demonstrated that many commercial detectors default to thresholds producing recall below 0.6 on paraphrased outputs, even while maintaining precision above 0.9 on unmodified GPT-4 text. This asymmetry matters because students aware of detection systems often employ paraphrasing tools, synonym replacement, or iterative editing. Institutions relying on recall figures measured only on raw model outputs may severely overestimate their detection systems' effectiveness in adversarial classroom conditions.

Practical implications for institutions and educators

Educational institutions purchasing AI detection services often receive vendor-reported recall statistics without understanding their measurement conditions. A recall of 0.95 on a benchmark dataset of unedited ChatGPT essays does not translate to 0.95 recall on student submissions that blend AI drafts with human revision, citation insertion, or strategic paraphrasing. Procurement officers and academic integrity committees should demand recall figures stratified by evasion technique, model version, and subject domain. A detector with 0.9 recall on science writing may achieve only 0.5 recall on creative essays or code documentation, yet vendors rarely disclose domain-specific performance breakdowns.

Teachers interpreting individual detection reports must recognize that system-level recall does not predict case-level reliability. If a detector has 0.8 recall across thousands of essays, it misses approximately one in five AI-generated submissions, but educators cannot know which specific flagged or unflagged document falls into the error category. This uncertainty demands corroborating evidence before academic sanctions. Practical protocols now recommend using detection tools as screening mechanisms that trigger deeper investigation, including student interviews and draft history review, rather than as standalone adjudication instruments. The 2025 joint statement from educational assessment organizations explicitly cautioned against disciplinary action based solely on recall-limited detection systems.

Back to the full glossary.

All terms