Open methodology

How the detector works, in detail.

Q: Can I see the benchmark corpus?

The methodology and statistical composition are public. The raw essays are not, they include licensed samples and consenting student contributions we can't redistribute.

Q: Why don't you publish head-to-head benchmarks against Turnitin?

We don't have a Turnitin license that permits using it on our benchmark corpus. Comparison-page claims about Turnitin are sourced from public benchmarks and documented behavior, not head-to-head runs on our corpus.

Q: What happens if the model regresses?

We roll back to the previous version. Release notes on every production change are on /trust-center.

Benchmark composition, model architecture, confidence intervals, and how we measure and mitigate ESL bias. If anything here is unclear, email us.

Last updated 2026-04-01

1. What the classifier does.

Every submission runs through three stages: (a) a perplexity-and-burstiness scorer using a small open-weight reference model, (b) a stylometric feature extractor that computes ~30 features per sentence (length distribution, passive-voice rate, hedge-word frequency, punctuation density, transition density, etc.), and (c) a transformer-based classifier that combines those inputs into a per-sentence probability that the sentence was generated by a large language model.

The three-stage pipeline. Stages 1 and 2 produce features that feed stage 3, which outputs a per-sentence probability. The aggregator turns those into the essay-level number you see.

Per-sentence probabilities are aggregated to an essay-level score using a trimmed-mean aggregator that downweights the top and bottom 10% of sentences, because those tails are most likely to be either trivially short human fragments or quoted AI samples. The result is a more stable essay-level number than a raw average.

2. What training data we used.

The classifier is trained on ~180,000 labeled passages, split roughly 55% human-written (balanced across native English, ESL, and formal academic registers) and 45% AI-generated (balanced across GPT-4, Claude 3.5/4, Gemini 2, Llama 3/4, Mistral, and DeepSeek, at temperatures 0.3–1.2). Human samples come from public essay corpora, licensed academic datasets, and Creative-Commons-licensed student writing (with consent).

Composition of the 180k-passage training set. Sub-breakouts shown approximate percentages within each class.

No customer-submitted text is used for training. Ever. This is a hard policy, we would rather have a weaker classifier than break that commitment.

3. How we measure accuracy.

Monthly benchmark on a held-out corpus of 3,042 essays not used in training. Precision and recall are computed at the default threshold (p > 0.60). Per-language, per-register, and native/ESL breakouts are computed separately because a rolled-up "accuracy" number hides the signal that matters.

The operating point sits where precision is high and recall is meaningful. Lower thresholds catch more AI but flag more false positives. Higher thresholds protect students at the cost of recall.

Confidence intervals are 95% bootstrap intervals over 1,000 resamples of the benchmark. We refuse to return a confident score on any passage under 250 words, below that length, the intervals are wide enough to be meaningless.

4. How we measure and mitigate ESL fairness.

We track the ratio of ESL false-positive rate to native-speaker false-positive rate monthly. Our goal is to drive that ratio below 1.5x (currently 2.3x). We use two primary mitigations: (a) stratified training with oversampling of ESL human samples, and (b) a post-hoc calibration layer that adjusts the classifier threshold based on detected stylistic register.

Trailing 12-month ratio of ESL FPR to native-speaker FPR. We've cut it from 3.1× to 2.3× over the past year. Target is below 1.5×.

Full per-month numbers on /stats.

5. How the model gets updated.

We retrain monthly with the most recent month of LLM output samples (retrained with the newest GPT, Claude, Gemini, etc. releases). Each retrain is validated against the benchmark corpus before being promoted to production. If a retrain degrades any of (a) overall precision, (b) overall recall, or (c) the ESL-to-native FP ratio, it's rejected.

The monthly retrain cycle. A new model only ships if it beats the previous version on three independent gates.

6. What we don't do.

We don't claim "99%+ accuracy." No detector on the market meets that bar honestly.
We don't return a score on passages under 250 words without a large confidence interval.
We don't use customer text to train models.
We don't sell detector results to third parties.

At a glance

Benchmark refresh

Monthly. Last refresh: 2026-04-01. See /stats for the numbers.

ESL-native gap target

< 1.5x. Currently 2.3x. We publish the gap monthly whether it's improving or not.

Sentence-level attribution

Always on. Essay-level scores without sentence evidence are a bad pattern.

Minimum passage length

250 words. Below that we return 'signal insufficient', not a number.

Language support

English (US/UK), Spanish, French, German, Portuguese. Per-language accuracy published on /stats.

Model update cadence

Monthly retrain. Rollbacks allowed if any of precision, recall, or ESL fairness degrades.

Audited performance

What the held-out corpus actually tells us.

0.94

AUC

On 18,000-essay held-out academic corpus, Q1 2026.

0.91

Precision @ 0.5

Fewer than 1 false positive per 11 flagged essays.

<3%

ESL FPR

Audited on 12,000 non-native essays, parity vs L1.

30 days

Retrain cadence

Monthly model refresh, with rollback gating.

Frequently asked questions

Can I see the benchmark corpus?