Glossary

Sentence-level scoring.

Reporting an AI-likelihood for each sentence, not just one score for the whole essay.

Sentence-level scoring surfaces which specific sentences look AI-generated, rather than collapsing the whole essay to a single number. It's especially important for hybrid drafts, where the overall score is pulled toward the middle by mixing AI and human sentences.

Sentence-level scoring is an account feature on aiessaydetector.ai, it's behind a free account because it doubles the inference cost and benefits from rate-limiting.

Why sentence-level changes the conversation

An essay-level score of 73% AI-likelihood tells you the paper might be AI. A sentence-level heatmap tells you which six sentences in the paper drove the score, which paragraphs are clean, and where to focus a follow-up conversation. The first is a verdict-shaped object that students can dispute and teachers can't easily defend. The second is evidence, the kind of evidence that makes an academic-integrity conversation productive instead of adversarial.

Pricing and rate-limit context

Sentence-level scoring requires per-sentence inference, which is roughly 5-10x the compute of essay-level scoring. We gate it behind a free account (raising the per-day limit from 5 to 20) primarily for rate-limit reasons. Institutional plans get unlimited sentence-level scoring as the default. The output format used on every detector subpage is sentence-level by default; the essay-level number is the summary.

How sentence-level scoring interacts with related metrics

Sentence-level scoring operates as a granular component within broader document-level detection frameworks. While document-level classifiers assign a single probability score to an entire text, sentence-level systems generate per-sentence probabilities that are then aggregated using various strategies (mean, median, maximum, or weighted voting). This relationship creates analytical tension: a document may receive a low overall AI probability score despite containing clusters of high-scoring sentences, or conversely, a few human-written sentences can lower the aggregate score of an otherwise machine-generated essay. Researchers have documented cases where threshold-based aggregation methods misclassify documents containing mixed authorship, particularly when human edits are confined to introductory or concluding sentences.

The interaction between sentence-level scores and perplexity metrics reveals additional complexity. Perplexity measures how surprised a language model is by a given text, with lower perplexity often indicating AI generation. However, sentence-level classifiers trained on stylometric features may flag formulaic human writing (such as legal boilerplate or technical documentation) as AI-generated due to low lexical diversity, even when perplexity remains high. This divergence becomes pronounced in domain-specific writing, where specialized terminology constrains sentence structure. Institutions employing multiple detection methods must therefore reconcile conflicting signals when sentence-level scores and perplexity-based assessments produce opposite conclusions about the same passage.

Edge cases and known limits

Sentence-level scoring encounters systematic difficulties with short sentences, where insufficient lexical and syntactic evidence exists for reliable classification. Sentences under five words often receive near-random probability scores because statistical classifiers lack adequate feature vectors, while transformer-based models struggle with minimal context windows. Research from 2024 demonstrated that single-clause sentences below seven tokens exhibit classification accuracy rates barely exceeding chance (53 percent), regardless of training corpus size. This limitation becomes critical when analyzing bullet points, numbered lists, or fragmented notes common in student outlines and brainstorming documents.

Another documented edge case involves code-switching and multilingual sentences, where detection models trained predominantly on monolingual English corpora misclassify intra-sentential language mixing as AI-generated. A 2025 study found that Spanish-English code-switched academic writing received false positive rates of 68 percent when evaluated by sentence-level classifiers, as the models interpreted syntactic boundary violations and lexical borrowing as indicators of machine generation. Similarly, sentences containing mathematical notation, chemical formulas, or extensive citation formatting produce unreliable scores because tokenization schemes fragment symbolic elements inconsistently. Practitioners working with STEM disciplines or multilingual student populations must account for these structural blind spots when interpreting sentence-level detection outputs.

Back to the full glossary.

All terms

Sentence-level scoring.