Glossary
Training data.
The dataset a detector learned from. Determines what it generalizes to, and where it fails.
Training data
A detector's training data is the paired AI-and-human corpus it saw during fine-tuning. Two detectors can share the same architecture and differ dramatically on real-world performance because their training sets differ.
We train on licensed human corpora plus AI-generated text sampled across GPT, Claude, and Gemini families. We do not use user submissions. See /transparency for the dataset documentation.
What good training data looks like
Three properties matter. Breadth, paired samples across all major LLM families, not just one. Diversity, academic, journalistic, technical, and casual writing styles in the human samples, so the human-class isn't implicitly defined as "academic prose." Recency, model output from 2024 looks different from output in 2026, and detectors trained on stale samples drift.
What we use and what we don't
Our training data is a licensed academic-essay corpus paired with AI samples generated across GPT, Claude, Gemini, and Llama families and refreshed quarterly. We do not use customer submissions to train the detector, that's a contractual commitment in the institutional Data Processing Agreement and a default for all users. The dataset documentation is on /transparency.
Where this concept is most often misunderstood
A common misconception holds that training data represents a fixed library of texts that AI models memorize and recombine. In reality, large language models learn statistical patterns across billions of parameters during training, then discard the original texts. The model retains relationships between tokens (word fragments) rather than storing passages verbatim. This distinction matters because educators sometimes assume detected AI writing means the tool copied from its training corpus, when the model actually generated novel sequences based on learned probability distributions.
Another frequent misunderstanding involves the cutoff date of training data. Users often believe models have zero knowledge beyond their training window, but this oversimplifies how pattern recognition works. A model trained on data through September 2023 still applies linguistic structures and reasoning frameworks to prompts about later events, though it lacks factual grounding for those topics. Detection systems account for this by analyzing writing patterns rather than checking whether content references post-cutoff information, since stylistic signatures remain consistent regardless of subject matter chronology.
How training data interacts with related concepts
Training data quality directly influences perplexity scores, a core metric in AI detection. Models assign lower perplexity (higher predictability) to text that closely resembles patterns in their training corpus. When a model encounters academic writing similar to the millions of scholarly articles in its training set, it produces grammatically smooth output with vocabulary distributions matching that domain. Detection tools exploit this relationship by measuring whether submitted text exhibits the statistical regulararity characteristic of AI-generated content trained on formal writing datasets, versus the higher perplexity typical of human-authored work with idiosyncratic phrasing.
The composition of training data also determines burstiness patterns that detectors analyze. Because training corpora aggregate text from sources that favor consistent sentence structures (Wikipedia articles, published papers, web content optimized for readability), models learn to generate uniform sentence lengths and syntactic complexity. Human writers trained in the same academic conventions may produce similarly low-burst text, creating overlap that complicates detection. This interaction explains why detection accuracy varies across disciplines, since fields with highly standardized writing conventions present training data distributions that more closely match human expert output, narrowing the statistical gap detectors rely upon.