Deleting 1,400 lines of keyword clustering code

SEOMAX needed keyword clustering. We spent two weeks building it in-house, deleted every line, and shipped a 40-line wrapper around the vendor endpoint instead. Here is what the path there looked like.
The problem
Keyword clustering groups thousands of query variations into semantic buckets. "Schafkopf Regeln Android", "Schafkopf Rufspiel", and "bavarian card game rules" all belong in the same cluster. A ranking dashboard that treats each variation as a distinct row is unreadable by row 400. A dashboard that clusters them down to 30 or 40 actionable buckets is usable by a one-person studio on a Monday morning.
We needed this working before SEOMAX could ship to its first internal user. Phase 2 of the build was blocked on clusters being present.
What we tried first
We wrote a classic unsupervised pipeline in Python. The scripts lived in seomax/clustering/ and the entry point was cluster_keywords(keyword_list: list[str]) -> dict[str, list[str]]. Under the hood:
def cluster_keywords(keywords):
tfidf = TfidfVectorizer(ngram_range=(1, 2), min_df=2)
X = tfidf.fit_transform(keywords)
db = DBSCAN(eps=0.35, metric="cosine", min_samples=3).fit(X)
return group_by_label(keywords, db.labels_)TF-IDF vectorization, cosine distance, DBSCAN. We added a custom similarity scorer on top that tried to boost German morphology — Schafkopf, Schafkopfregeln, and Schafkopfs should score close even though a naive tokenizer splits them differently. Another 200 lines for the German stemming hack. Another 180 lines for a configurable boost table for brand keywords.
Every single experiment produced clusters we could argue about internally but could not hand to a real operator. The eps parameter had to move every time the keyword mix changed. DBSCAN would either collapse everything into one huge bucket or shatter the list into 400 singletons. We pushed min_samples up, then down, then back up. The stemmer introduced false positives. The boost table introduced false negatives.
After nine working days we had 1,400 lines of clustering code and a result that was worse than what we could get from the vendor for free.
What we switched to
DataForSEO exposes a keyword-clustering endpoint at /v3/keyword_research/keyword_clustering/live. POST a list of keywords, back comes a JSON structure with cluster ids and assigned keywords. We wrote a thin wrapper:
def cluster_keywords(keywords: list[str]) -> dict[str, list[str]]:
resp = client.post("/v3/keyword_research/keyword_clustering/live",
json=[{"keywords": keywords, "language_code": "en"}])
return _group_by_cluster_id(resp.json()["tasks"][0]["result"])Forty lines including error handling. The vendor clusters were visibly better on the first run against our real internal keyword set.
The numbers
- Custom pipeline: 1,400 lines of Python, nine days of real calendar time
- Vendor wrapper: 40 lines, 90 minutes of work
- F-score on our internal 180-keyword evaluation set: 0.41 custom versus 0.67 vendor
- DataForSEO cost for clustering across all three internal projects: 3.20 USD per month
- Total LOC deleted on the switch: 1,420 (we also deleted the boost-table config file and the dedicated test suite)
The vendor endpoint also gave us language detection and cluster naming for free. Our custom pipeline did neither.
What we would do differently
We benchmarked the vendor output too late. A two-hour spike at the start, running the same 180-keyword evaluation set through DataForSEO first, would have told us in one afternoon that the vendor was ahead by 60 percent. We would have saved nine calendar days.
The deeper mistake was ideological. We told ourselves that an indie-SEO platform had to be vendor-independent to be credible. This is wrong in one specific way: for commodity ML tasks like clustering, the vendor is better because the vendor trained on a dataset we cannot match. Vendor independence is the right stance for rank tracking, where you want historical continuity even if a vendor disappears. Vendor independence is not the right stance for problems that commodity ML already solves.
The rule we wrote down after the deletion: benchmark vendor output in an afternoon before any in-house ML project exceeds a week of work. That rule has paid for itself twice already on smaller decisions inside SEOMAX, both of which we would have otherwise spent a week on before getting to the same conclusion.