Glossary
Accuracy.
The share of predictions a detector gets right. Alone, it hides bias, pair with precision and recall.
Accuracy
Accuracy is (true positives + true negatives) / all predictions. It is the single most-quoted detector metric and usually the least informative. A detector that classifies everything as "human" would score 90% accurate on a dataset that's 90% human, while catching zero AI.
When a vendor quotes a single accuracy number without class breakdowns, ask for precision and recall (or F1) on both classes. Those are the numbers that tell you how the detector behaves on the cases you actually care about.
A worked example
Imagine a benchmark of 1,000 essays where 200 are AI-generated and 800 are human. A detector that simply labels everything "human" achieves 80% accuracy without catching a single AI essay. A detector that labels everything "AI" achieves 20% accuracy. Both are useless, but accuracy alone won't tell you that. Class-stratified precision and recall will: the all-human classifier scores 0% recall on AI; the all-AI classifier scores 0% precision on AI.
How it shows up in detection
When you see "94% accuracy" in a detector ad, ask three questions: on what test set, with what class balance, and at what threshold. Different choices yield different accuracy numbers from the same model. Reputable detector benchmarks publish a confusion matrix or, at minimum, paired precision-recall figures. We publish ours quarterly on /transparency with stratified breakdowns by writer population.
Where Accuracy Is Most Often Misunderstood
The most common misconception about accuracy in AI detection arises from class imbalance. When human-written essays vastly outnumber AI-generated ones in a dataset (for example, a 95:5 ratio), a model that labels every essay as human-written achieves 95% accuracy while providing zero useful detection capability. This phenomenon misleads institutions into deploying systems that appear highly accurate on paper but fail to identify the minority class they were designed to catch.
Another frequent misunderstanding involves conflating accuracy with reliability across different writing contexts. A detector may achieve 92% accuracy on formal academic essays but drop to 68% on creative writing or technical documentation. Vendors sometimes report only the best-case scenario without specifying the corpus composition, leading educators to expect consistent performance across all assignment types. The metric treats all errors as equivalent, obscuring whether the system fails primarily on borderline cases or makes glaring mistakes on obviously human or AI-generated work.
How Accuracy Interacts with Precision and Recall
Accuracy must be interpreted alongside precision and recall to form a complete picture of detector performance. Precision answers whether flagged essays are truly AI-generated, while recall measures what proportion of actual AI essays the system catches. A detector optimized solely for accuracy might achieve 88% by confidently identifying obvious cases while ignoring ambiguous submissions, resulting in high precision but low recall. Conversely, flagging any essay with repetitive phrasing could boost recall but devastate precision, even as accuracy remains superficially acceptable due to the true negative count.
The F1 score, which harmonizes precision and recall, often reveals weaknesses that accuracy masks. In academic integrity workflows, the cost asymmetry between false positives (wrongly accusing a student) and false negatives (missing AI use) means institutions should prioritize different metrics depending on their tolerance for each error type. A system with 85% accuracy might combine 78% precision with 62% recall, making it unsuitable for high-stakes decisions despite an apparently solid accuracy figure. Evaluating all three metrics together exposes whether a detector achieves balance or sacrifices one dimension for another.