New: Institutional Licensing, deploy across your district or college. Read the framework →
A aiessaydetector.ai

Transparency

How the detector works, and how it fails.

The honest version. Our model, our benchmark datasets, our stratified false-positive rates, and the cases where detection should not be used. Updated every quarter.

Last updated 2026-04-01

The category of AI detection has earned a reputation for opaque vendors making precision claims they can't substantiate. We aim to be the opposite of that, this page is the substantiation.

1. What the detector actually does

The detector takes a passage of text and returns a probability that it was generated by a large language model. Internally it combines three signals:

  1. Statistical features. Perplexity (under a reference language model), burstiness (sentence-length variance), punctuation density, and n-gram frequency patterns.
  2. Learned embeddings. A fine-tuned classifier operating over sentence embeddings from a multilingual encoder.
  3. Model-family fingerprinting. Separate classifiers for the major model families (GPT, Claude, Gemini) that identify text patterns specific to each.

The three outputs are combined by a learned gating layer. The gating weights are audited quarterly against the benchmarks below.

2. Benchmark datasets

We publish performance on the following benchmarks. Each is a paired AI/human test set; we do not train on these, they are held out.

  • HC3-academic, 3,000 essay-length human/AI pairs across six undergraduate disciplines. Public benchmark.
  • GPTZero-bench-2024, 1,200 mixed human-and-AI academic essays. Public benchmark.
  • aiessaydetector-internal, 5,000 essay-length pairs we collected through paid-consent data partnerships with community colleges. Licensed, not scraped. Not public.
  • multilingual-100, 100 essays per language across five non-English languages, with native-speaker matched AI pairs. Used for stratified FPR reporting.

3. Performance (last updated 2026-Q1)

On the combined benchmark, at the default threshold:

  • Accuracy: 94.1%
  • Precision (AI class): 93.2%
  • Recall (AI class): 91.7%
  • F1 (macro-avg): 0.935
  • AUC: 0.971
  • False positive rate (overall): 1.8%

Stratified false-positive rate

Our headline FPR of 1.8% is the average across all populations in the benchmark. Real-world FPR varies by writer. We report stratified FPR because this is the number that matters to students in known risk groups.

  • Native English writers, essays ≥ 400 words: 1.1%
  • Non-native English writers, essays ≥ 400 words: 4.7%
  • Essays under 300 words (any writer): 6.2%
  • Highly formal academic writing (edited ≥ 3 times): 3.4%
  • Writers who self-identified as autistic: 3.8%

These numbers are higher than our headline. That is the point of publishing them.

4. What the detector should not be used for

  • As the sole evidence in an academic-integrity proceeding.
  • On passages under 100 words for emails, or under 300 words for essays. The confidence bands are too wide.
  • On non-English text without the language-appropriate classifier enabled.
  • On text that's been machine-translated, the translation artifacts interact with detection features.

5. What we do not use training data for

User submissions are never used to train the detector. Training data is licensed (from data partners and public-benchmark datasets) and disjoint from user-submitted text. We will sign contractual prohibitions on training with customer data at the institutional tier.

6. Update cadence

This page is reviewed every quarter. Performance numbers are regenerated against the held-out benchmarks and the stratified FPR is recomputed. Major model releases (GPT-5 class, Claude 5 class, Gemini 3 class) trigger an out-of-cycle update.

7. How to ask for more

Institutional customers: the full methodology pack is shared under NDA during procurement. Researchers: we respond to good-faith research requests; email hello@aiessaydetector.ai with the research question and intended use.

At a glance

Open benchmarks

We report numbers on public benchmarks (HC3-academic, GPTZero-bench-2024) alongside our internal set so third parties can verify.

Stratified FPR

Known risk groups (non-native English, short passages, autistic writers) get their own reported false-positive rate.

Quarterly review

Performance regenerated against held-out benchmarks every quarter. Major model releases trigger out-of-cycle updates.

Q1 2026 benchmark performance (default threshold).
BenchmarkAccuracyF1 (macro)AUCFPR
HC3-academic93.8%0.9310.9692.1%
GPTZero-bench-202492.4%0.9180.9582.4%
aiessaydetector-internal95.6%0.9520.9821.2%
multilingual-10089.1%0.8840.9413.9%

Frequently asked questions

How does this compare to other detectors?
Published third-party comparisons (Cornell 2024, OpenAI withdrawal notice 2023) show detector performance varies substantially across vendors. We link to those comparisons rather than running our own head-to-head, because self-published vendor comparisons are not credible on their own.
Why publish the higher stratified FPR numbers?
Because they are the honest answer to 'how often will this be wrong on a student in my class?' Headline numbers that average away known risk groups mislead.
Can I see the benchmark datasets?
HC3-academic and GPTZero-bench-2024 are public, linked from this page. Our internal benchmark is licensed and non-distributable; researchers can request access under NDA.
Do you change the threshold over time?
The default threshold is reviewed quarterly and moves only with strong evidence. We publish the specific threshold alongside each metric so year-over-year numbers are directly comparable.