What signals do AI detection tools check most heavily?

Hapax legomena ratio (the proportion of words that appear exactly once in the text) is the most predictive single signal, with 98.5% classification accuracy according to 2025 research. Secondary signals include sentence length variation, frequency of specific transition phrases, em dash density, and passive voice rate. Tools like AI Sentinel check all four tiers and provide specific feedback on which patterns are triggering flags.

Does writing content with AI and then editing it help with detection scores?

Editing AI-generated content improves detection scores proportionally to how much you change. Light editing that fixes typos and adjusts a few sentences leaves most AI patterns intact. Substantive editing that replaces generic language with specific claims, shortens long averaged sentences, removes filler transitions, and adds concrete examples from direct knowledge produces content that scores significantly more human. The amount of work required for substantial improvement is often comparable to writing from scratch.

Is a zero AI score possible for content that used AI tools in the writing process?

Yes. The score measures statistical patterns in the final text, not the process used to produce it. Content that used AI for research or initial drafting but was substantially rewritten with specific language, varied structure, and domain expertise can score in the human range consistently. The score reflects the text itself, not the workflow.

What is the Shopify em dash finding and why does it matter?

Shopify engineering documented that em dash density above 2 per 1,000 words appeared as a consistent pattern in AI-generated content across their publisher network, contributing to their 2025 deindexing decision for that content. Em dashes themselves are not a problem. Elevated density is the signal, because AI models use em dashes as a syntactic crutch in ways that produce statistically abnormal frequency. Keeping em dash usage to approximately 1 per 1,000 words avoids this specific flag.

Does AI Sentinel work on content from any source or only DotTheta content?

AI Sentinel works on any text you paste into it or any URL you point it at. It has no knowledge of where the content originated. You can check competitor pages, client content, your own articles, or any public URL. The sitemap scanning feature checks all pages on a domain simultaneously, which is useful for auditing large content libraries or monitoring a site for drift toward AI patterns over time.

How to Use AI Content Checks Without Making Writing Sound Robotic

⏱ 8 min read·May 7, 2026

💡

Quick Answer

AI detection tools check lexical diversity, sentence length variation, overused transition phrases, and grammar patterns like em dash density. Writing with specific language, varied sentence lengths, and concrete examples naturally avoids these patterns.

AI detection tools do not read for originality or ideas. They look for statistical patterns in language that appear more frequently in AI-generated text than in text written by humans. Understanding exactly which patterns they check makes it straightforward to produce content that scores as human without sanitising your writing of anything useful or distinctive. The goal is not to game detection tools. The goal is to write more naturally, and the patterns that detection tools flag are generally the same patterns that make content feel formulaic and low-value to human readers.

What Specific Patterns Do AI Detection Tools Actually Look For?

AI detection tools typically check four categories of signals. Lexical diversity signals measure how many unique words appear relative to total word count, and how rarely unusual words appear. AI-generated text tends toward a narrower vocabulary than human writing, clustering words near the statistical average for a given topic rather than using the specific, precise language a domain expert would naturally choose.

Statistical regularity signals measure whether sentence lengths vary naturally or follow a suspiciously consistent rhythm. Human writers vary sentence length organically, mixing short punchy sentences with longer compound structures based on what the content requires. AI tends to produce sentences of similar length in clusters, creating a prose rhythm that trained tools can detect.

Vocabulary pattern signals look for specific phrases that appear at elevated rates in AI-generated text. These include hedge phrases like “it is important to note that” and “it is worth mentioning”, transition phrases like “furthermore” and “additionally”, and closing patterns like “in conclusion” and “to summarise”. These phrases are not wrong in isolation, but appearing at high frequency in a single document is a detectable signal.

Grammar pattern signals check for elevated passive voice rates, frequent em dash usage as a substitute for varied sentence structure, and unnaturally uniform paragraph length. Shopify engineering documented in 2025 that their internal content review identified em dash density above 2 per 1,000 words as a consistent marker of AI-generated content across their publisher network.

How Does AI Sentinel Check These Signals?

AI Sentinel runs four tiers of analysis. Tier 1 checks hapax legomena ratio, the percentage of words that appear exactly once in your text, which is a strong predictor of whether writing is genuinely varied or statistically averaged. Human writing typically has a hapax ratio between 55 and 75 percent. AI-generated text typically falls between 35 and 50 percent. Tier 1 signals carry the highest accuracy weight in the overall score, with 2025 research by Kovalevskii finding 98.5% classification accuracy from this signal alone.

Tier 2 checks statistical regularity including sentence length coefficient of variation, which measures how much your sentence lengths vary. Tier 3 checks vocabulary against over 200 specific phrases associated with AI output and provides replacement suggestions for each. Tier 4 checks grammar patterns including em dash density and passive voice rate.

The tool also offers sitemap scanning, which checks all pages on a domain simultaneously. This is particularly useful for agencies reviewing client sites or content managers auditing large content libraries for cross-URL vocabulary consistency, which was the pattern identified in the Shopify deindexing incident.

What Practical Changes Improve Scores Without Damaging Content Quality?

Replace filler transition phrases with specific connective logic. Instead of “Furthermore, it is important to note that keyword research is essential”, write “Keyword research determines whether content can rank before you spend a day writing it.” The second version carries the same meaning, removes two flagged phrases, and is more useful to the reader.

Vary sentence length deliberately. After writing a long complex sentence, follow it with a short one. This is good writing practice independent of AI detection. It improves readability, creates natural emphasis, and happens to produce the kind of statistical variation that human text exhibits.

Use specific, concrete language instead of qualified generalities. Instead of “This approach can potentially improve various aspects of your content performance”, write “This approach typically increases click-through rate from featured snippets by narrowing the answer to a single extractable sentence.” Specific language requires domain knowledge, which is something AI models average out of their outputs.

When AI Sentinel flags a phrase, the replacement suggestions in the Vocabulary tab are worth reviewing carefully. They are not rewrites. They are alternative framings of the same idea that avoid the pattern being flagged. Using them alongside your own judgment produces text that reads naturally and scores cleanly.

Does Passing AI Detection Mean Your Content Is High Quality?

No, and this is an important distinction. AI detection tools identify statistical patterns associated with AI generation. They do not evaluate accuracy, helpfulness, depth of expertise, or originality of thought. It is entirely possible to write content that passes every AI detection check and is still low-value, thin, and unhelpful.

The correct relationship between AI detection and content quality is that the practices which produce genuinely useful content, including specific language drawn from direct experience, varied and purposeful sentence structure, concrete examples, and precise claims, also happen to produce content that scores as human. Detection scores are a proxy for naturalness. Naturalness is a proxy for quality. But the causal chain runs from quality to naturalness to detection scores, not the other way around.

Use AI Sentinel as a diagnostic tool to identify the specific patterns in your writing or your team’s writing that have drifted toward AI-like averages, then improve those patterns with better writing practices. The goal is content that is genuinely more useful and more authoritative, with a clean detection score as a byproduct.

📌 Key Facts

Kovalevskii 2025 research: hapax legomena ratio has 98.5% classification accuracy for AI vs human text
Human hapax ratio: 55-75%. AI hapax ratio: 35-50%
Shopify 2025: em dash density above 2 per 1,000 words flagged as consistent AI content marker

Written by

DotTheta

DotTheta publishes practical guides about SEO, AEO, GEO, AI search visibility, keyword research, content optimization, and website growth for creators, marketers, agencies, and fast-moving builders.