Manually grouping 1,500 keywords by search intent takes an experienced SEO specialist up to three days. A Python pipeline using sentence-transformers, scikit-learn’s agglomerative clustering, and GPT completes the same task in under 30 minutes — and the cluster descriptions it outputs are ready to paste directly into a content brief.
This guide documents a production-grade keyword clustering workflow built on entirely free embedding models from HuggingFace, removing the cost barrier that makes intent-based clustering inaccessible at scale. The pipeline covers every step from a raw Ahrefs CSV export to a labeled cluster map: GPT-predicted intent per keyword, all-MiniLM-L6-v2 semantic embeddings, 200-cluster agglomerative grouping, and GPT-generated cluster summaries that describe what each group of searchers actually wants.
- Sale!

SEO Content Audit
Original price was: 1999,00 €.1799,00 €Current price is: 1799,00 €. Select options - Sale!

Search Rankings and Traffic Losses Audit
Original price was: 3500,00 €.2999,00 €Current price is: 2999,00 €. Select options - Sale!

Full-Scale Professional SEO Audit
Original price was: 5299,00 €.4999,00 €Current price is: 4999,00 €. Select options
Keyword clustering is now a prerequisite for competitive content strategy. Google does not rank isolated pages — Google rewards sites that demonstrate topical authority across semantically related queries. A single well-clustered content hub can rank for dozens of related search terms where a siloed page targets only one. Sites that shift from individual keyword targeting to tightly grouped topic clusters see 25% faster ranking improvements on average, according to aggregated performance data from AI-powered clustering platforms.
The workflow documented here scales to any keyword dataset exported from Ahrefs, Semrush, or Search Console, and runs entirely in Google Colab using a free runtime. The full notebook is open-source at the bro.ee Marketing Automations repository on GitHub.
Why Manual Keyword Clustering Doesn’t Scale
An SEO specialist grouping 1,000 keywords manually requires up to three days of work and produces inconsistent categories because human judgment on search intent drifts across sessions. For datasets of 5,000+ keywords — typical for e-commerce sites, SaaS platforms, or content-heavy publishers — manual keyword clustering is not operationally viable.
Spreadsheet-based grouping by head term matches surface-level word similarity rather than semantic intent. A keyword like “content audit template” and “how to audit your blog posts” share no overlapping words but express the same search intent and should target the same page. Head-term matching misses this connection every time.
Machine learning-based keyword clustering solves both the scale and the accuracy problem. Automated clustering using semantic embeddings groups keywords by the meaning of their search intent — not by surface-level word overlap — producing groups that map directly to content architecture decisions: one page per cluster, one primary keyword per page.
The Search Engine Journal guide to automating search intent clustering confirms that combining intent prediction with semantic embeddings outperforms SERP-overlap-only approaches for datasets where many keywords share the same ranking URL.
The Pipeline Architecture: HuggingFace + Agglomerative Clustering + GPT
The open-source notebook processes a real Ahrefs dataset of 1,584 keywords in five sequential stages. Each stage adds a layer of semantic understanding that the next stage builds on.
Step 1 — Export Keyword Data from Ahrefs
The pipeline takes a standard Ahrefs CSV export as input, containing each keyword alongside its ranking URL, search volume, keyword difficulty, and active SERP features. No custom preprocessing is required beyond the raw export — the notebook ingests the file with pandas and validates the schema before any API calls are made, avoiding wasted GPT tokens on malformed rows.
Step 2 — Use GPT to Predict Search Intent Per Keyword
For each keyword-URL pair, the pipeline sends a structured prompt to the OpenAI API asking GPT-3 to infer the specific intent behind the search — not the broad category (informational/transactional/navigational), but the precise job the searcher is trying to accomplish. Given the keyword “tangential content” and its ranking URL, GPT-3 returns a statement such as: “find information about how to use tangential content in content marketing strategy.” This granular intent description is what makes the embedding step semantically meaningful — two keywords expressing the same searcher goal will produce similar intent descriptions, and similar intent descriptions will embed near each other in vector space.
The notebook then tokenizes each GPT response using GPT2TokenizerFast to measure response length and filter outliers before the embedding step.
Step 3 — Generate Semantic Embeddings with all-MiniLM-L6-v2
The GPT-predicted intent descriptions are passed to the sentence-transformers library, which generates dense 384-dimensional vector embeddings using the all-MiniLM-L6-v2 model from HuggingFace. This 22-million-parameter model is optimized for semantic similarity tasks: two intent descriptions that express the same underlying search goal produce a high cosine similarity score regardless of whether they share any surface-level vocabulary.
all-MiniLM-L6-v2 runs locally inside Google Colab at zero per-token cost — no API key, no rate limits, and no incremental cost per keyword embedded.
Step 4 — Apply Agglomerative Clustering
Scikit-learn’s AgglomerativeClustering algorithm groups the 1,584 intent embeddings into 200 clusters by hierarchically merging the most similar vectors from the bottom up, without requiring any predefined cluster shapes. Each keyword in the original CSV receives a cluster ID. Keywords that share a cluster ID address the same underlying search intent and are candidates for coverage on a single page.
The 200-cluster count was calibrated for a dataset of this size. For datasets under 500 keywords, 50–100 clusters typically produces groups that remain focused enough for a content brief. For datasets over 5,000 keywords, 300–500 clusters prevents distinct intents from collapsing into over-broad groups.
Step 5 — Generate Cluster Descriptions with GPT
For each of the 200 clusters, the pipeline passes the full list of intent descriptions in that cluster back to GPT-3, which returns a single human-readable sentence summarizing the shared theme. The final output is an enhanced CSV where every row includes the original keyword, its search volume and difficulty, its cluster ID, and the natural-language cluster description — a file ready to import directly into any content planning tool.
Why Agglomerative Clustering Outperforms K-Means for Keyword Data
K-means clustering requires specifying the number of clusters in advance and assumes spherical cluster shapes — an assumption that breaks down for keyword intent data, which forms irregular, variable-density groups in high-dimensional embedding space.
Agglomerative clustering makes no shape assumptions. The algorithm builds a cluster hierarchy from the data itself, allowing intent groups with tight conceptual scope (3–5 keywords) and intent groups with broader scope (30+ keywords) to both receive appropriate cluster boundaries rather than being forced into uniform sizes.
DBSCAN, which automatically identifies noise points, performs well for datasets containing many branded or navigational queries with no content opportunity, but requires manual tuning of the epsilon and min_samples parameters — adding calibration complexity that agglomerative clustering avoids.
Agglomerative clustering with cosine-similar sentence embeddings represents the best balance of accuracy and operational simplicity for SEO keyword datasets in the 500–10,000 keyword range.
Why Free HuggingFace Embeddings Are Sufficient for Keyword Clustering
OpenAI’s text-embedding-ada-002 model costs $0.0001 per 1,000 tokens. Embedding 1,584 intent descriptions averaging 20 tokens each costs approximately $0.003 for a single run — negligible in isolation. But for workflows that re-embed updated keyword exports monthly, or that process 50,000+ keywords across multiple client sites, embedding API costs compound into a real line item.
Showing 4–5 of 5 resultsSorted by popularity
all-MiniLM-L6-v2 eliminates this cost entirely. The Massive Text Embedding Benchmark (MTEB) placed all-MiniLM-L6-v2 within 5–7% of text-embedding-ada-002 on semantic similarity task performance. For keyword clustering — where clusters are reviewed by an SEO practitioner before any content decision is made — a 5–7% similarity gap is operationally irrelevant.
The free embedding model also eliminates rate limit interruptions, which matter when processing datasets over 10,000 rows in batches. The OpenAI embedding API applies per-minute rate limits that can stall a clustering job mid-run at scale.
Turning Cluster Output into a Content Architecture
The CSV output from this pipeline maps directly onto a content architecture decision framework. Each cluster represents a distinct search intent, which determines one of three actions:
Create a new page: The cluster covers real search demand (combined keyword search volume above 100 monthly searches) with no existing page addressing it. Each cluster description becomes the brief title for the new page.
Consolidate existing pages: Multiple existing pages target keywords from the same cluster, creating keyword cannibalization. The weaker pages should be merged into or redirected to the primary page targeting that cluster.
Update an existing page: A current page ranks for some keywords in the cluster but misses others. Expanding the page’s content coverage to address the full cluster intent strengthens topical authority for all keywords in the group.
A 200-cluster output for 1,584 keywords typically surfaces 30–60 net-new content opportunities, 15–25 cannibalization cases to resolve, and 40–80 update candidates — a complete content roadmap generated in a single pipeline run.
Keyword clustering executed this way builds the information architecture designed for crawl efficiency that topical authority requires: every page resolves to a clear, singular intent; internal links connect semantically related clusters; and Google’s crawlers encounter a site where no two URLs compete for the same searcher. Google has evaluated content at the passage level since the Passage Ranking update in 2021, meaning well-scoped cluster pages are more likely to earn passage-level visibility for long-tail queries within their cluster.
Frequently Asked Questions
What is keyword clustering in SEO? Keyword clustering in SEO is the process of grouping keywords that share the same search intent so they can be targeted on a single page. Google evaluates topical coverage rather than individual keyword matches, so keyword clustering ensures each page addresses a complete intent rather than a single isolated query — allowing one page to rank for dozens of semantically related search terms simultaneously.
How many clusters should I create for my keyword dataset? For datasets of 500–2,000 keywords, 100–200 clusters typically produces groups of 5–15 keywords with coherent intent — focused enough to inform a content brief, broad enough to aggregate meaningful search volume. For datasets over 5,000 keywords, use 300–500 clusters to preserve granularity. Treat the cluster count as a tunable parameter: review a random sample of 20 clusters after the first run and increase the count if clusters feel too broad.
Can I run this pipeline without an OpenAI API key? The HuggingFace embedding step requires no API key and runs entirely locally in Google Colab. The GPT steps — intent prediction per keyword and cluster description generation — require an OpenAI API key. Both GPT steps can be replaced with open-source models (Mistral 7B, Llama 3) served through the HuggingFace Inference API, eliminating the OpenAI dependency entirely at some cost to description quality.
What is the difference between keyword clustering and topical clustering? Keyword clustering groups individual keywords by shared search intent to determine which keywords should target the same page. Topical clustering (topic clustering) groups pages into pillar-and-spoke architectures where one pillar page covers a broad subject and multiple supporting pages cover subtopics, each linking back to the pillar. Keyword clustering informs which keywords belong on which page; topical clustering informs how pages relate to each other structurally. Both are required for a complete content architecture.
Is agglomerative clustering better than K-means for SEO keyword data? Agglomerative clustering outperforms K-means for SEO keyword intent data because agglomerative clustering makes no assumptions about cluster shape, does not require a predefined number of clusters when using a distance threshold, and handles the irregular, variable-density structure of intent embedding space more accurately than K-means. K-means performs well for balanced datasets with spherical clusters — a distribution that keyword intent data rarely satisfies.
Ready to Cluster Your Keyword Dataset?
The notebook behind this workflow is open-source and runs in Google Colab without any local setup. Fork it from the bro.ee Marketing Automations repository on GitHub, connect your Ahrefs CSV export and OpenAI API key, and run the cells in sequence. The output CSV is ready to import into any content planning or project management tool.
- Sale!

SEO Content Audit
Original price was: 1999,00 €.1799,00 €Current price is: 1799,00 €. Select options - Sale!

Search Rankings and Traffic Losses Audit
Original price was: 3500,00 €.2999,00 €Current price is: 2999,00 €. Select options - Sale!

Full-Scale Professional SEO Audit
Original price was: 5299,00 €.4999,00 €Current price is: 4999,00 €. Select options
For the next step, map your cluster output to a pillar-and-spoke content architecture — assigning each cluster to a position in your topical hierarchy, setting internal linking rules between related clusters, and prioritizing build order based on combined search volume and current ranking gap.







