Glossary
F1 score.
The harmonic mean of precision and recall, a single number that balances both.
F1 score
F1 = 2 × (precision × recall) / (precision + recall). It's what you report when you want one number that punishes both false positives and false negatives.
AI-detection benchmarks usually report F1 per class (F1 for "AI" predictions, F1 for "human" predictions), and the macro-average across classes. The macro-F1 on a balanced test set is the cleanest single-number summary.
When F1 is the right summary
F1 is the right number to optimize when false positives and false negatives have similar costs and you need one number to compare detectors. In AI detection for classroom use, the costs are not symmetric, a false positive against a student is more damaging than a false negative letting AI through, so F1 is necessary but not sufficient. Pair F1 with the precision number (which is what teachers actually feel) and the false-positive rate (which is what students feel).
Macro vs. micro F1
Macro-F1 averages the F1 score across classes equally; micro-F1 weights by class frequency. For balanced AI-vs-human test sets they're similar. For imbalanced sets (most real-world classroom data is 90%+ human), macro-F1 is the more honest summary because micro-F1 collapses toward the human-class score and hides AI-detection performance.
Where F1 Score Is Most Often Misunderstood
A common misconception treats F1 score as a measure of overall accuracy, when it actually prioritizes the positive class exclusively. In AI essay detection, this means F1 quantifies how well a system identifies AI-generated text, but reveals nothing about how accurately it labels human writing. A detector could flag every third human essay as AI-generated yet still achieve a respectable F1 score if it also catches most actual AI content. Users who assume a 0.85 F1 score guarantees 85 percent correctness across all predictions will be surprised when false positives remain high.
Another source of confusion arises when comparing F1 scores across datasets with different class distributions. An F1 of 0.80 on a balanced test set (50 percent AI, 50 percent human) carries different real-world implications than the same score on a dataset with 10 percent AI content. The metric does not account for prevalence, so practitioners evaluating detection tools must verify that benchmark datasets reflect the actual ratio of AI to human submissions they encounter. Relying on F1 alone without examining precision and recall separately can mask whether a system errs toward over-flagging or under-flagging content.
Edge Cases and Known Limits
F1 score becomes undefined when both precision and recall equal zero, which occurs if a classifier produces no positive predictions at all or if the test set contains no positive examples. In AI essay detection, this edge case can surface during cross-validation folds where random splits yield subsets with only human-written samples. Automated evaluation pipelines must handle division-by-zero errors gracefully, typically by returning a score of zero or skipping that fold. Some research teams exclude folds with fewer than five positive instances to prevent artificially volatile metrics.
The harmonic mean structure of F1 also means the metric penalizes large disparities between precision and recall more severely than arithmetic averaging would. A detector with 95 percent precision and 50 percent recall yields an F1 of only 0.66, even though the arithmetic mean of those components is 72.5 percent. This sensitivity makes F1 ill-suited for applications where one error type costs far more than the other. Academic integrity offices that face severe consequences for false accusations may prefer to monitor precision independently rather than accept the equal weighting implicit in F1, while institutions focused on catching every instance of AI misuse might track recall as the primary metric.