1M-Row Fuzzy Matching Benchmark (2026): Similarity API vs RapidFuzz, TheFuzz, Levenshtein
Similarity API — 1M rows
7 min
~422 seconds. Runs while you get coffee.
RapidFuzz — 1M rows (estimated)
36 hrs
~130,000 seconds. Needs an overnight batch job.
Most fuzzy matching pipelines start with RapidFuzz or TheFuzz. Both are solid libraries — for small data. The problem is O(N²) scaling: each time you double your dataset, compute time quadruples. What runs in 13 seconds at 10K rows takes 36 hours at 1M.
This benchmark measures exactly where that wall hits, and what happens when you route the same workload through Similarity API instead.
Benchmark Setup
Environment
All tests ran in a standard Google Colab CPU environment — 2 vCPUs, ~13GB RAM, Python 3.x. No GPUs, no special hardware. This is the environment most data engineers actually use for pipeline testing.
Synthetic data
We generated names from a curated base list (people, companies) and applied realistic noise: character insertions, adjacent swaps, and random replacements — producing variants like Micsrosoft Corpp, Aplpe Inc., and Charlle Brown. Each string has a ground-truth label so we could sanity-check accuracy.
Dataset sizes tested
10,000 · 100,000 · 1,000,000 strings. Local libraries are measured at 10K (RapidFuzz also at 100K) and extrapolated at larger sizes using O(N²) scaling.
How Each Tool Was Used
Similarity API
A single /dedupe POST call. No preprocessing pipelines, no threading config, no memory management — just strings in, clusters out:
Similarity API call
# Full deduplication of 1M strings in ~7 minutes
import requests
response = requests.post(
"https://api.similarity-api.com/dedupe",
headers={"X-API-Key": "your-api-key"},
json={
"data": strings, # your list of strings
"config": {
"similarity_threshold": 0.85,
"remove_punctuation": false,
"to_lowercase": false,
"use_token_sort": false,
"output_format": "index_pairs"
}
}
)
clusters = response.json()Preprocessing — lowercasing, punctuation removal, token sorting — is a config toggle, not code you write and maintain. Changing the threshold and re-running takes seconds.
RapidFuzz
We used RapidFuzz's optimized cdist function with all CPU workers, which is the fastest possible local setup:
RapidFuzz
from rapidfuzz import process, fuzz
# Best-case RapidFuzz: vectorized C++, all CPU cores
similarity_matrix = process.cdist(
strings, strings,
scorer=fuzz.ratio,
workers=-1
)TheFuzz & python-Levenshtein
Used through standard Python loops — these libraries don't offer bulk vectorized operations, so this is their realistic usage pattern.
Results
| Library | 10K Records | 100K Records | 1M Records |
|---|---|---|---|
| Similarity API | 0.8 s | 58.8 s | 421.8 s (~7 min) |
| RapidFuzz | 13.0 s | 1,301.8 s (~22 min) | 130,180 s (~36 hrs) est. |
| python-Levenshtein | 46.8 s | 4,685 s (~1.3 hrs) est. | 468,490 s (~130 hrs) est. |
| TheFuzz | 39.4 s | 3,938 s (~1.1 hrs) est. | 393,820 s (~109 hrs) est. |
Performance at 10K and 100K Rows
Performance Across All Dataset Sizes (10K, 100K, 1M Rows)
Why Similarity API Is Faster
Local fuzzy matching is fundamentally O(N²) — every string compared to every other string. At 1M rows that's 1012 comparisons. Even with RapidFuzz's C++ engine and vectorized CPU operations, you can't escape the math.
Similarity API uses an internal algorithm that adapts its strategy based on input size and structure: blocking, indexing, and parallelization are applied dynamically so you're never doing more comparisons than necessary. The implementation details are proprietary, but the effect is measurable: sub-linear scaling in practice, not just in theory.
Additionally — and this matters for real workflows — changing matching behavior (threshold, tokenization, preprocessing) is a config change, not a code change. You tune and re-run in seconds instead of modifying preprocessing pipelines and waiting for another full pass.
Accuracy
Speed is irrelevant if the matches are wrong. We ran a sanity check on a 2,000-string labeled subset: all tools achieved high precision at a threshold of 0.85 — everything returned was actually a duplicate. The number of unique entities post-deduplication was also close to ground truth across all tools.
The takeaway: this isn't a speed-vs-accuracy tradeoff. You can tune Similarity API's threshold exactly as you would with RapidFuzz, and the accuracy at equivalent thresholds is comparable.
Get started in 5 minutes
- 1Sign up for a free API key — no credit card required. First 100K rows are free.
- 2Install the requests library if you haven't:
pip install requests - 3Copy the code snippet above, swap in your API key and strings. You'll have results before your next coffee.
- 4Full reference: API documentation · pricing
Free tier
100,000 rows
Pay-as-you-go
After free tier
Tier plans
Available for recurring use
Infrastructure
Zero. No servers to manage.
Reproduce This Benchmark
All benchmark code is public. Run it yourself and verify the numbers:
You'll need a free API key to run the Similarity API portion of the benchmark. Sign up here.
When Should You Switch?
Not every project needs a hosted API. Here's the honest breakdown:
Scale → Tool Recommendation
< 50K rows, one-off task
RapidFuzz is fine. It's fast enough.
50K–200K rows, or recurring runs
⚠ This is where iteration gets painful. Similarity API saves hours per run.
200K+ rows
Local libraries become batch jobs. Similarity API stays interactive.
Production pipeline (ETL, nightly dedup)
API call beats managing infrastructure. Zero servers, zero memory tuning.
Air-gapped / no external HTTP allowed
RapidFuzz. Local-only constraint means local-only tool.
If you're at 30K–50K rows today and your dataset is growing, this is worth setting up now. Once you're hitting a 20-minute dedup job, you'll be glad you didn't wait.