Glossary
Classifier.
The machine-learning model that takes text features (perplexity, burstiness, embeddings) and returns an AI-likelihood score.
Classifier
A classifier in AI-detection is the part that does the actual classifying. It takes inputs, perplexity, burstiness, n-gram patterns, learned sentence embeddings, and outputs a probability that the passage is AI-generated.
The classifier's training data determines what it generalizes to. A classifier trained only on GPT-3.5 output will underperform on Claude or Gemini. Cross-family performance is the hardest generalization problem in the field and is why we run model-family-specific detectors alongside the generic one.
Three classifier flavors used in detection
Most production AI detectors are one of three architectures. Logistic regression over hand-engineered features (perplexity, burstiness, n-gram counts), fast, interpretable, weak on cross-model generalization. Fine-tuned transformer classifiers (RoBERTa, DeBERTa, or similar), strong within-distribution, sometimes brittle out-of-distribution. Hybrid stacks that combine a transformer for embedding-level features with a meta-classifier for the final decision. We run a hybrid stack because it lets us update model-family-specific submodules without retraining the whole pipeline.
What a good classifier evaluation includes
Cross-family generalization (train on one LLM, test on another), confidence calibration (do scores of 0.85 actually mean 85% chance), and stratified false-positive rates by writer population. Our methodology is documented on /methodology.
Where this concept is most often misunderstood
The most common misunderstanding about classifiers in AI detection systems involves the belief that they operate as binary decision makers with perfect certainty. In reality, classifiers produce probability distributions across potential labels. A classifier might assign a 0.73 probability that a text is AI-generated and 0.27 that it is human-written, but this nuance often disappears when platforms display only a single categorical label. Users frequently interpret a "AI-generated" label as absolute certainty rather than the statistical best guess it represents, leading to overconfidence in individual predictions.
Another persistent confusion centers on the relationship between training data and classification boundaries. Many assume that classifiers learn explicit rules or feature checklists that define AI versus human writing. Instead, modern classifiers identify complex, high-dimensional patterns in their training corpus that correlate with labels. This means a classifier trained primarily on GPT-3.5 outputs may perform poorly on GPT-4 or Claude-generated text, not because it lacks rules for those models, but because the statistical distributions differ in ways the original training set did not capture. The classifier has learned associations, not definitions.
Edge cases and known limits
Classifiers face significant challenges when processing texts that fall outside their training distribution. Heavily edited AI outputs, where a human has rewritten 40 percent of the generated text, often confuse classifiers because the result contains a hybrid statistical signature. Similarly, non-native English speakers who use formal, template-driven writing structures may trigger false positives, as their natural linguistic patterns can resemble the regularized output of language models. Domain-specific technical writing, legal documents, and scientific abstracts present additional difficulties because their constrained vocabularies and standardized formats reduce the discriminative features available to the classifier.
Short texts below 100 words represent another well-documented limitation. With fewer tokens to analyze, classifiers have reduced statistical evidence for their predictions, leading to higher error rates and lower confidence scores. Multilingual texts and code-switched writing further complicate classification, as most detectors train predominantly on monolingual English corpora. Adversarial techniques, including strategic synonym replacement and sentence restructuring designed specifically to evade detection, can degrade classifier performance by 30 to 50 percent according to 2024 research from Stanford and MIT, though such evasion typically requires deliberate effort rather than occurring naturally in student work.