Manually grouping 1,500 keywords by search intent takes an experienced SEO specialist up to three days. A Python pipeline using sentence-transformers, scikit-learn’s agglomerative clustering, and GPT completes the same task in under 30 minutes — and the cluster descriptions it outputs are ready to paste directly into a content brief.
This guide documents a production-grade keyword clustering workflow built on free embedding models from HuggingFace, removing the cost barrier that makes intent-based clustering inaccessible at scale. The pipeline runs five sequential stages: GPT-predicted intent per keyword, all-MiniLM-L6-v2 semantic embeddings, agglomerative grouping into 200 clusters, and GPT-generated summaries that describe what each cluster of searchers actually wants. Input is a raw Ahrefs CSV export. Output is a labeled cluster map ready to drive content architecture decisions.
- Sale!

SEO Content Audit
Original price was: 1999,00 €.1799,00 €Current price is: 1799,00 €. Select options - Sale!

Search Rankings and Traffic Losses Audit
Original price was: 3500,00 €.2999,00 €Current price is: 2999,00 €. Select options - Sale!

Full-Scale Professional SEO Audit
Original price was: 5299,00 €.4999,00 €Current price is: 4999,00 €. Select options
The full notebook is open-source at the bro.ee Marketing Automations repository on GitHub.
Why Manual Keyword Clustering Doesn’t Scale
An SEO specialist grouping 1,000 keywords manually requires up to three days of work and produces inconsistent categories because human judgment on search intent drifts across sessions. For datasets of 5,000+ keywords — typical for e-commerce sites, SaaS platforms, or content-heavy publishers — manual keyword clustering isn’t operationally viable.
Spreadsheet-based grouping by head term matches surface-level word similarity rather than semantic intent. “Content audit template” and “how to audit your blog posts” share no overlapping words but express the same search intent and should target the same page. Head-term matching misses this connection every time.
Machine learning-based keyword clustering solves both the scale and accuracy problems simultaneously. Automated clustering using semantic embeddings groups keywords by the meaning of their search intent — not by surface-level word overlap — producing groups that map directly to content architecture decisions: one page per cluster, one primary keyword per page.
One caveat worth stating directly: embedding-based clustering groups keywords by linguistic similarity, not by what Google actually ranks. For established niches with dense SERP data, SERP-overlap clustering — where you check which URLs appear across multiple queries — will produce groups that more accurately reflect Google’s intent interpretation. The pipeline documented here runs without SERP API access, which makes it viable at scale and at zero recurring cost. But if you’re planning architecture decisions for a competitive niche with 10,000+ keywords, validate a sample of the clusters against real SERP overlap before committing to the full content roadmap.
The Pipeline Architecture: HuggingFace + Agglomerative Clustering + GPT
The open-source notebook processes a real Ahrefs dataset of 1,584 keywords in five sequential stages. Each stage adds a layer of semantic understanding that the next stage builds on.
Step 1 — Export Keyword Data from Ahrefs
The pipeline takes a standard Ahrefs CSV export as input, containing each keyword alongside its ranking URL, search volume, keyword difficulty, and active SERP features. No custom preprocessing is required beyond the raw export — the notebook ingests the file with pandas and validates the schema before any API calls are made, avoiding wasted GPT tokens on malformed rows.
Step 2 — Use GPT to Predict Search Intent Per Keyword
For each keyword-URL pair, the pipeline sends a structured prompt to the OpenAI API asking GPT to infer the specific intent behind the search — not the broad category (informational/transactional/navigational), but the precise job the searcher is trying to accomplish. Given the keyword “tangential content” and its ranking URL, GPT returns a statement such as: “find information about how to use tangential content in content marketing strategy.”
This granular intent description is what makes the embedding step semantically meaningful. Two keywords expressing the same searcher goal will produce similar intent descriptions, and similar intent descriptions embed near each other in vector space. The pipeline then tokenizes each GPT response using GPT2TokenizerFast to measure response length and filter outliers before the embedding step.
Step 3 — Generate Semantic Embeddings with all-MiniLM-L6-v2
The GPT-predicted intent descriptions are passed to the sentence-transformers library, which generates dense 384-dimensional vector embeddings using the all-MiniLM-L6-v2 model from HuggingFace. This 22-million-parameter model is optimized for semantic similarity tasks: two intent descriptions that express the same underlying search goal produce a high cosine similarity score regardless of whether they share any surface-level vocabulary.
all-MiniLM-L6-v2 runs locally inside Google Colab at zero per-token cost — no API key, no rate limits, and no incremental cost per keyword embedded.
Step 4 — Apply Agglomerative Clustering
Scikit-learn’s AgglomerativeClustering algorithm groups the 1,584 intent embeddings into 200 clusters by hierarchically merging the most similar vectors from the bottom up, without requiring any predefined cluster shapes. Each keyword in the original CSV receives a cluster ID. Keywords sharing a cluster ID address the same underlying search intent and are candidates for coverage on a single page.
The 200-cluster count was calibrated for a dataset of this size. For datasets under 500 keywords, 50–100 clusters typically produces groups focused enough for a content brief. For datasets over 5,000 keywords, 300–500 clusters prevents distinct intents from collapsing into over-broad groups.
One performance note from the sentence-transformers documentation: agglomerative clustering becomes slow above a few thousand sentences. For datasets over 10,000 keywords, the fast_clustering implementation in the same library handles 50,000 sentences in under five seconds and is worth switching to at that scale.
Step 5 — Generate Cluster Descriptions with GPT
For each of the 200 clusters, the pipeline passes the full list of intent descriptions back to GPT, which returns a single human-readable sentence summarizing the shared theme. The final output is an enhanced CSV where every row includes the original keyword, its search volume and difficulty, its cluster ID, and the natural-language cluster description — a file ready to import directly into any content planning tool.
Why Agglomerative Clustering Outperforms K-Means for Keyword Data
K-means clustering requires specifying the number of clusters in advance and assumes spherical cluster shapes — an assumption that breaks down for keyword intent data, which forms irregular, variable-density groups in high-dimensional embedding space.
Agglomerative clustering makes no shape assumptions. The algorithm builds a cluster hierarchy from the data itself, allowing intent groups with tight conceptual scope (3–5 keywords) and intent groups with broader scope (30+ keywords) to both receive appropriate cluster boundaries rather than being forced into uniform sizes.
DBSCAN, which automatically identifies noise points, performs well for datasets containing many branded or navigational queries with no content opportunity, but requires manual tuning of the epsilon and min_samples parameters — adding calibration complexity that agglomerative clustering avoids. For SEO keyword datasets in the 500–10,000 keyword range, agglomerative clustering with cosine-similar sentence embeddings offers the best balance of accuracy and operational simplicity.
If your dataset skews toward very large (50,000+ keywords) or contains a high proportion of branded noise, HDBSCAN is worth evaluating. It handles variable-density clusters and noise points more gracefully than either K-means or standard agglomerative approaches, at the cost of longer setup time.
Why Free HuggingFace Embeddings Are Sufficient for Keyword Clustering
OpenAI’s text-embedding-ada-002 costs $0.0001 per 1,000 tokens. Embedding 1,584 intent descriptions averaging 20 tokens each costs approximately $0.003 for a single run — negligible in isolation. But for workflows that re-embed updated keyword exports monthly, or that process 50,000+ keywords across multiple client sites, embedding API costs compound into a real budget line.
all-MiniLM-L6-v2 eliminates this cost entirely. The Massive Text Embedding Benchmark (MTEB) placed all-MiniLM-L6-v2 within 5–7% of text-embedding-ada-002 on semantic similarity task performance. For keyword clustering — where clusters are reviewed by an SEO practitioner before any content decision is made — a 5–7% similarity gap is operationally irrelevant. A human review step catches the edge cases.
Showing 1–3 of 5 resultsSorted by popularity
- Sale!

White Label SEO Audit
Original price was: 5299,00 €.4999,00 €Current price is: 4999,00 €. Select options - Sale!

SEO Content Audit
Original price was: 1999,00 €.1799,00 €Current price is: 1799,00 €. Select options - Sale!

Search Rankings and Traffic Losses Audit
Original price was: 3500,00 €.2999,00 €Current price is: 2999,00 €. Select options
The free embedding model also eliminates rate limit interruptions, which matter when processing datasets over 10,000 rows in batches. The OpenAI embedding API applies per-minute rate limits that can stall a clustering job mid-run at scale.
If you want to eliminate the OpenAI dependency entirely — including the GPT intent prediction and cluster description steps — open-source alternatives like Mistral 7B or Llama 3, served through the HuggingFace Inference API, work as drop-in replacements at some cost to description quality. The embedding step has no OpenAI dependency regardless.
Turning Cluster Output into a Content Architecture
The CSV output from this pipeline maps directly onto three content architecture decisions. Each cluster represents a distinct search intent, which determines one of three actions.
Create a new page when the cluster covers real search demand (combined keyword search volume above 100 monthly searches) with no existing page addressing it. The cluster description becomes the brief title for the new page.
Consolidate existing pages when multiple existing pages target keywords from the same cluster, creating keyword cannibalization. The weaker pages should be merged into or redirected to the primary page targeting that cluster.
Update an existing page when a current page ranks for some keywords in the cluster but misses others. Expanding the page’s content coverage to address the full cluster intent strengthens topical authority for all keywords in the group.
A 200-cluster output for 1,584 keywords typically surfaces 30–60 net-new content opportunities, 15–25 cannibalization cases to resolve, and 40–80 update candidates — a complete content roadmap generated in a single pipeline run.
Keyword clustering executed this way produces an information architecture designed for crawl efficiency: every page resolves to a clear, singular intent; internal links connect semantically related clusters; and Google’s crawlers encounter a site where no two URLs compete for the same searcher. Google has evaluated content at the passage level since the Passage Ranking update in 2021, meaning well-scoped cluster pages are more likely to earn passage-level visibility for long-tail queries within their cluster.
The next step after building this map: assign each cluster to a position in your topical hierarchy, set internal linking rules between related clusters, and prioritize build order based on combined search volume against your current ranking gap. That sequencing is where the cluster map becomes an SEO campaign with a measurable content ROI model behind it.
Frequently Asked Questions
What is keyword clustering in SEO? Keyword clustering in SEO is the process of grouping keywords that share the same search intent so they can be targeted on a single page. Google evaluates topical coverage rather than individual keyword matches, so keyword clustering ensures each page addresses a complete intent rather than a single isolated query — allowing one page to rank for dozens of semantically related search terms simultaneously.
How many clusters should I create for my keyword dataset? For datasets of 500–2,000 keywords, 100–200 clusters typically produces groups of 5–15 keywords with coherent intent — focused enough to inform a content brief, broad enough to aggregate meaningful search volume. For datasets over 5,000 keywords, use 300–500 clusters to preserve granularity. Treat the cluster count as a tunable parameter: review a random sample of 20 clusters after the first run and increase the count if clusters feel too broad or merge them if they’re too narrow to support a standalone page.
Can I run this pipeline without an OpenAI API key? The HuggingFace embedding step requires no API key and runs entirely locally in Google Colab. The GPT steps — intent prediction per keyword and cluster description generation — require an OpenAI API key. Both GPT steps can be replaced with open-source models (Mistral 7B, Llama 3) served through the HuggingFace Inference API, eliminating the OpenAI dependency entirely at some cost to description quality.
What is the difference between keyword clustering and topical clustering? Keyword clustering groups individual keywords by shared search intent to determine which keywords should target the same page. Topical clustering (topic clustering) groups pages into pillar-and-spoke architectures where one pillar page covers a broad subject and multiple supporting pages cover subtopics, each linking back to the pillar. Keyword clustering informs which keywords belong on which page. Topical clustering informs how pages relate to each other structurally. Both are required for a complete content architecture.
Is embedding-based clustering as accurate as SERP-based clustering? No, and the gap matters in competitive niches. SERP-based clustering groups keywords by which URLs Google actually ranks for them together — a direct read of Google’s intent interpretation. Embedding-based clustering groups by linguistic similarity, which is a proxy for that signal. For new niches or large datasets where SERP API costs are prohibitive, embedding-based clustering is the practical choice. For high-competition niches where content architecture decisions carry significant revenue implications, validate the cluster output against SERP overlap data before acting on it.
Ready to Cluster Your Keyword Dataset?
The notebook behind this workflow is open-source and runs in Google Colab without any local setup. Fork it from the bro.ee Marketing Automations repository on GitHub, connect your Ahrefs CSV export and OpenAI API key, and run the cells in sequence. The output CSV is ready to import into any content planning or project management tool.
- Sale!

SEO Content Audit
Original price was: 1999,00 €.1799,00 €Current price is: 1799,00 €. Select options - Sale!

Search Rankings and Traffic Losses Audit
Original price was: 3500,00 €.2999,00 €Current price is: 2999,00 €. Select options - Sale!

Full-Scale Professional SEO Audit
Original price was: 5299,00 €.4999,00 €Current price is: 4999,00 €. Select options
If you want a senior SEO practitioner to translate the cluster map into a sequenced content campaign with revenue projections — not just a list of pages to build — get in touch with SEOBRO.Agency.







