Open benchmarks
We report numbers on public benchmarks (HC3-academic, GPTZero-bench-2024) alongside our internal set so third parties can verify.
Transparency
The honest version. Our model, our benchmark datasets, our stratified false-positive rates, and the cases where detection should not be used. Updated every quarter.
The category of AI detection has earned a reputation for opaque vendors making precision claims they can't substantiate. We aim to be the opposite of that, this page is the substantiation.
The detector takes a passage of text and returns a probability that it was generated by a large language model. Internally it combines three signals:
The three outputs are combined by a learned gating layer. The gating weights are audited quarterly against the benchmarks below.
We publish performance on the following benchmarks. Each is a paired AI/human test set; we do not train on these, they are held out.
On the combined benchmark, at the default threshold:
Our headline FPR of 1.8% is the average across all populations in the benchmark. Real-world FPR varies by writer. We report stratified FPR because this is the number that matters to students in known risk groups.
These numbers are higher than our headline. That is the point of publishing them.
User submissions are never used to train the detector. Training data is licensed (from data partners and public-benchmark datasets) and disjoint from user-submitted text. We will sign contractual prohibitions on training with customer data at the institutional tier.
This page is reviewed every quarter. Performance numbers are regenerated against the held-out benchmarks and the stratified FPR is recomputed. Major model releases (GPT-5 class, Claude 5 class, Gemini 3 class) trigger an out-of-cycle update.
Institutional customers: the full methodology pack is shared under NDA during procurement. Researchers: we respond to good-faith research requests; email hello@aiessaydetector.ai with the research question and intended use.
We report numbers on public benchmarks (HC3-academic, GPTZero-bench-2024) alongside our internal set so third parties can verify.
Known risk groups (non-native English, short passages, autistic writers) get their own reported false-positive rate.
Performance regenerated against held-out benchmarks every quarter. Major model releases trigger out-of-cycle updates.
| Benchmark | Accuracy | F1 (macro) | AUC | FPR |
|---|---|---|---|---|
| HC3-academic | 93.8% | 0.931 | 0.969 | 2.1% |
| GPTZero-bench-2024 | 92.4% | 0.918 | 0.958 | 2.4% |
| aiessaydetector-internal | 95.6% | 0.952 | 0.982 | 1.2% |
| multilingual-100 | 89.1% | 0.884 | 0.941 | 3.9% |