Glossary
Precision.
When the detector flags an essay, how often is the flag right?
Precision
Precision = true positives / (true positives + false positives). A detector with 95% precision means that 95% of flagged essays really were AI-generated; 5% were wrongly flagged humans.
Precision is the metric that matters most for the teacher side of a flag. For the student side, false-positive rate is the more intuitive metric. Both describe the same error, just normalized differently.
Precision from the teacher's side
When a detector flags an essay and the teacher opens the report, the question they're really asking is: "of the essays this tool has flagged for me, how many turn out to be real cases?" That's precision. A 95% precision detector wastes the teacher's time on one out of every twenty flags. A 70% precision detector wastes it on six out of twenty, and after enough wasted reviews, the teacher stops trusting the tool altogether.
Why precision and recall trade off
Move the threshold up: fewer essays get flagged, but the ones that are flagged are more likely to be real AI. Precision rises, recall falls. Move the threshold down: more flags, but more of them wrong. Every deployment makes this tradeoff explicitly or implicitly. Reputable detectors expose the threshold as a configurable so institutions can tune to their tolerance, and ours documents the recommended operating points on /methodology.
How Precision Interacts with Related Metrics
Precision exists in constant tension with recall, the complementary metric that measures what proportion of actual AI-generated text a detector successfully identifies. A detector optimized purely for precision may flag only the most obvious AI patterns, achieving 95% precision but missing 70% of AI content (30% recall). Conversely, a system tuned for high recall will cast a wider net, catching more AI text but also incorrectly flagging more human writing. The F1 score attempts to balance these trade-offs by computing the harmonic mean of precision and recall, though institutional needs often require prioritizing one metric over the other.
Accuracy, though superficially similar, measures overall correctness across both positive and negative predictions. In imbalanced datasets where human writing vastly outnumbers AI submissions, a detector could achieve 90% accuracy simply by labeling most documents as human, even with poor precision on the AI class. This explains why precision and recall offer more actionable insight for AI detection systems than raw accuracy figures. Practitioners evaluating detection tools should request class-specific precision and recall values rather than relying on aggregated accuracy percentages that obscure performance on the minority class.
Edge Cases and Known Limits
Precision metrics degrade substantially when detection systems encounter adversarial techniques or hybrid authorship. Students who lightly edit AI-generated text, paraphrase model output through secondary tools, or blend their own sentences with AI paragraphs create documents that fall into ambiguous boundary regions. A detector might classify such hybrid work as AI-generated based on statistical patterns, technically counting as a true positive if any AI assistance occurred, yet the precision metric fails to capture the degree of human contribution. This binary classification framework proves inadequate for real-world academic scenarios where partial AI use represents the more common violation.
Sample size constraints further complicate precision interpretation. A detector reporting 100% precision after evaluating only 20 flagged documents provides far less confidence than one maintaining 87% precision across 500 flagged cases. Small denominators in the precision formula amplify the impact of individual misclassifications and prevent reliable comparison between tools tested on different corpus sizes. Institutions should require vendors to report precision alongside the number of positive predictions and employ confidence intervals when sample sizes fall below statistical significance thresholds, typically around 100 instances per class for meaningful metric stability.