Benchmark refresh
Monthly. Last refresh: 2026-04-01. See /stats for the numbers.
Open methodology
Benchmark composition, model architecture, confidence intervals, and how we measure and mitigate ESL bias. If anything here is unclear, email us.
Every submission runs through three stages: (a) a perplexity-and-burstiness scorer using a small open-weight reference model, (b) a stylometric feature extractor that computes ~30 features per sentence (length distribution, passive-voice rate, hedge-word frequency, punctuation density, transition density, etc.), and (c) a transformer-based classifier that combines those inputs into a per-sentence probability that the sentence was generated by a large language model.
Per-sentence probabilities are aggregated to an essay-level score using a trimmed-mean aggregator that downweights the top and bottom 10% of sentences, because those tails are most likely to be either trivially short human fragments or quoted AI samples. The result is a more stable essay-level number than a raw average.
The classifier is trained on ~180,000 labeled passages, split roughly 55% human-written (balanced across native English, ESL, and formal academic registers) and 45% AI-generated (balanced across GPT-4, Claude 3.5/4, Gemini 2, Llama 3/4, Mistral, and DeepSeek, at temperatures 0.3–1.2). Human samples come from public essay corpora, licensed academic datasets, and Creative-Commons-licensed student writing (with consent).
No customer-submitted text is used for training. Ever. This is a hard policy, we would rather have a weaker classifier than break that commitment.
Monthly benchmark on a held-out corpus of 3,042 essays not used in training. Precision and recall are computed at the default threshold (p > 0.60). Per-language, per-register, and native/ESL breakouts are computed separately because a rolled-up "accuracy" number hides the signal that matters.
Confidence intervals are 95% bootstrap intervals over 1,000 resamples of the benchmark. We refuse to return a confident score on any passage under 250 words, below that length, the intervals are wide enough to be meaningless.
We track the ratio of ESL false-positive rate to native-speaker false-positive rate monthly. Our goal is to drive that ratio below 1.5x (currently 2.3x). We use two primary mitigations: (a) stratified training with oversampling of ESL human samples, and (b) a post-hoc calibration layer that adjusts the classifier threshold based on detected stylistic register.
Full per-month numbers on /stats.
We retrain monthly with the most recent month of LLM output samples (retrained with the newest GPT, Claude, Gemini, etc. releases). Each retrain is validated against the benchmark corpus before being promoted to production. If a retrain degrades any of (a) overall precision, (b) overall recall, or (c) the ESL-to-native FP ratio, it's rejected.
Monthly. Last refresh: 2026-04-01. See /stats for the numbers.
< 1.5x. Currently 2.3x. We publish the gap monthly whether it's improving or not.
Always on. Essay-level scores without sentence evidence are a bad pattern.
250 words. Below that we return 'signal insufficient', not a number.
English (US/UK), Spanish, French, German, Portuguese. Per-language accuracy published on /stats.
Monthly retrain. Rollbacks allowed if any of precision, recall, or ESL fairness degrades.
Audited performance