Automated Content Gap Analysis: Python, Semantic Clustering, and GPT to Find Keyword Opportunities at Scale

If your content gap analysis still ends with a sorted-by-volume spreadsheet, you’re doing the bottleneck wrong. The export isn’t the problem. The problem is what happens after it: a human being manually grouping 1,500 keyword variants into topics, guessing which clusters are worth pursuing, and handing a content team a list that’s already three weeks old.

There’s a better way, and it doesn’t require a data science team. A four-stage Python pipeline — keyword data via API, set-difference gap identification, BERT-based semantic clustering, and GPT strategy output — converts that multi-day triage into something that runs while you make coffee. The output isn’t a keyword list. It’s a formatted strategy report with semantically coherent topic clusters, ready to hand to a content team or drop directly into a client deliverable.

This is what automated content gap analysis actually looks like when it’s built to produce revenue-focused SEO output, not just a prettier spreadsheet.

Why Manual Content Gap Analysis Has a Structural Ceiling

The workflow most teams follow has a hard scale limit. Pull keyword gap data from Ahrefs or Semrush, sort by volume, pick the terms that look winnable, group them by hand. Against two competitors and 200 keywords, that’s manageable. Against four competitors and a domain ranking for 15,000 terms, the raw gap dataset runs to several thousand rows. At that point, human triage becomes the actual constraint — and human triage at scale is where subjective bias creeps in. High-volume terms get picked because they look impressive. Mid-volume clusters with genuine opportunity get overlooked because they’re buried in row 800.

Two other failure modes compound this. Standard keyword gap tools — Semrush’s Keyword Gap, Ahrefs’ Content Gap — return keyword-level data without semantic grouping. A team looking at 2,000 gap keywords sees individual terms. They don’t see that “beginner fishing equipment,” “starter fishing pole recommendations,” and “best first fishing rod” are the same content opportunity, expressed three different ways by three different searchers. Grouping those manually takes hours. A sentence transformer model does it in under 90 seconds.

The second failure mode is how most teams handle prioritization. Raw keyword volume is a lagging signal. High-volume terms are saturated by definition; that’s why they’re high volume. The genuinely winnable opportunities tend to cluster in mid-volume ranges where the target domain has zero footprint but competitors rank consistently in positions 1–20. Filtering by position range before running gap analysis — keeping only competitor keywords where the target ranks outside the top 20 or doesn’t appear at all — produces a working dataset rather than a vanity list.

A third: a 800-keyword gap list is not a content plan. Grouped into 15 semantic clusters with plain-language descriptions and prioritized by commercial relevance, it is one.

The Four-Stage Pipeline

The architecture uses Semrush’s API for keyword data, BERT sentence transformers for semantic embeddings, K-means with automatic elbow detection for clustering, and GPT for labeling and strategy generation. Each stage has a specific decision point that determines output quality downstream.

Stage 1: Pull and Filter the Keyword Data

The pipeline opens two API calls to Semrush: the target domain’s current organic footprint and the organic footprints of up to four competitor domains. Position filtering is applied at this stage, not later.

The configuration that matters: competitors are queried for keywords in positions 1–20. The target domain is queried for keywords in positions 21–100 or unranked. This intersection — competitor presence, target domain absence — produces the gap dataset. Pulling without position filtering floods the output with terms the target domain already ranks competitively for, which contaminates everything downstream.

A keyword difficulty cap of 70 (on Semrush’s 0–100 scale) is applied by default. Gap keywords above this threshold get dropped. The reasoning is simple: if the domain doesn’t have the authority to realistically reach page one within 6–12 months, including those terms inflates the opportunity count without adding actionable output. The difficulty threshold is configurable — a domain with DR 75 has different realistic targets than a domain with DR 35.

Stage 2: Identify True Content Gaps

Gap identification runs on set difference: keywords ranked by at least one competitor that the target domain doesn’t rank for within the configured position threshold. Both conditions must be true for a keyword to appear in the gap set.

This is also where deduplication and minimum volume filtering can be applied, depending on how clean the Semrush export is. A mid-sized domain competing in a single niche against four rivals typically generates 500–3,000 gap keywords after filtering. That’s the dataset passed to the clustering stage.

Stage 3: Semantic Clustering with BERT and K-Means

This is where the pipeline diverges from what any off-the-shelf SEO tool currently produces.

Instead of grouping keywords by shared root words or n-gram overlap — the method used by most keyword clustering tools — a BERT-based sentence transformer generates a semantic embedding for each gap keyword. Specifically, the all-MiniLM-L6-v2 model from HuggingFace produces a 384-dimensional vector representing each keyword’s meaning in context. “Best beginner fishing rod” and “starter fishing pole recommendations” share no words. Their embedding vectors sit close to each other in 384-dimensional space because their meaning is similar. That proximity is what K-means groups on.

The non-trivial decision is cluster count. Hard-coding a fixed number produces clusters that are either too broad or too granular. Automatic elbow-point detection via the kneed library solves this: the algorithm plots inertia against cluster count and identifies the inflection point where adding more clusters returns diminishing reductions in within-cluster variance. The pipeline determines the correct number of topic clusters from the data itself.

The alignment with how Google evaluates content relevance is not coincidental. BERT-based embeddings are the same underlying mechanism behind Google’s semantic search infrastructure. Clusters built on embedding similarity map more accurately to Google’s topical groupings than clusters built on lexical overlap. This matters for topical authority: a content plan built on embedding clusters produces articles that Google already treats as related, which accelerates the compounding organic equity that comes from comprehensive topical coverage.

Stage 4: GPT Labels and Strategy Output

Two separate LLM calls handle the final stage.

GPT-3.5-turbo processes each cluster’s keyword list and generates a plain-language description of the topic theme. This is a labeling task: turning a list of 40 semantically similar keywords into a readable cluster name like “Entry-level fishing gear for beginners” or “Saltwater trolling rod comparisons.” These labels make the cluster data usable by anyone on the team without requiring them to read the underlying keyword list.

GPT-4 then receives the complete labeled cluster set and generates the full SEO strategy report. The output includes roughly 25 content recommendations, each tied to a specific cluster, with rationale covering why that cluster represents a rankable opportunity given the competitive gap data. The report format is structured for direct delivery to a content team or client — no reformatting, no additional commentary layer required.

What the Pipeline Produces

Three deliverables come out of a completed run.

The first is a CSV containing all gap keywords, their cluster IDs, and cluster labels. This is the reference file for brief-level content planning — the SEO team’s working document.

The second is the cluster summary: a structured list of topic themes, each with keyword count, representative terms, and the AI-generated description. A mid-sized domain running against four competitors at keyword difficulty ≤ 60 typically produces 12–25 clusters. A fishing gear domain at those settings might surface clusters covering fly fishing for beginners, saltwater trolling comparisons, fishing subscription boxes, and a dozen other topic areas that manual triage would take days to identify and organize.

The third is the strategy report. GPT-4 synthesizes the cluster data into a consultant-style document with prioritized content opportunities, estimated traffic potential by cluster, and specific article recommendations. This is the deliverable that gets handed to the client or the content director.

When to Build a Custom Pipeline vs. Using Off-the-Shelf Tools

Off-the-shelf gap tools are the right choice when the analysis covers a single domain against one or two competitors, the gap keyword dataset is small enough for manual review, and the output doesn’t need to match a specific client reporting format.

A custom Python pipeline is worth the setup investment when any of these apply: the analysis needs to run across multiple client domains in a batch, the gap dataset regularly exceeds 1,000 terms and manual grouping is the bottleneck, the team needs cluster-level output rather than keyword-level lists, or the strategy report format needs to match a client template without post-processing.

The architecture is modular. Swapping Semrush for the Ahrefs API requires changing the data collection function — the clustering and strategy layers don’t care about the data source. Adjusting the difficulty cap or position filters requires changing two configuration variables. The pipeline runs in under 10 minutes end to end on a standard laptop.

Frequently Asked Questions

Q: How many competitors should I include? Two to four competitor domains produces a gap dataset that’s both thorough and manageable. Including more than four risks flooding the gap set with high-difficulty terms from dominant sites that aren’t realistic targets for the subject domain. The pipeline’s difficulty filter handles some of this, but cleaner competitor selection at the input stage produces more useful output.

Q: How often should the pipeline be re-run? Quarterly covers most use cases. Competitors publish new content, search demand shifts, and gap datasets that were current six months ago will have drifted. Monthly re-runs are justified for fast-moving niches or clients with active competitor monitoring programs. The automation removes the time cost, making quarterly practical where manual analysis typically ran once a year.

Q: What’s the actual difference between BERT clustering and standard keyword grouping? Standard keyword grouping uses string matching — shared words, shared root terms, n-gram overlap. “Best fishing rod for beginners” and “starter fishing pole recommendations” share nothing, so string matching puts them in different groups. BERT generates a vector representing each keyword’s meaning; K-means then groups vectors that are geometrically close. Those two keywords land in the same cluster because their meaning is close, not their text. This produces clusters that map to search intent rather than surface-level keyword patterns.

Q: Can the pipeline handle non-English keywords? HuggingFace offers multilingual sentence transformer variants — paraphrase-multilingual-MiniLM-L12-v2 covers 50+ languages and requires one configuration change to swap in. The Semrush API supports keyword data across 140+ countries and regional databases. The full pipeline adapts to non-English SEO programs without architectural changes.

Q: Does this pipeline replace Ahrefs or Semrush? No — it uses Semrush as the data source. Ahrefs and Semrush remain the authoritative sources for organic ranking data. What the pipeline adds is a semantic analysis and AI strategy layer on top of that data: cluster-level output and a formatted strategy report that neither platform generates natively.

Run It

The open-source notebook implementing this pipeline is on GitHub. Run it against your domain and competitor set, adjust the difficulty and position filters for your competitive landscape, and treat the cluster output as the foundation for your next content quarter.

If you’d rather have someone else build and run it — or build a content strategy on top of the output — that’s what we do.

About the author

SEO Strategist with 16 years of experience