1M-Row Fuzzy Matching Benchmark (2025): Similarity API vs RapidFuzz, TheFuzz, Levenshtein

November 19, 20255 min readBy Similarity API Team

TL;DR

When should you stop using local fuzzy matching?

  • < 50k rows → local libraries are usually fine
  • 50k–200k rows → slow iteration, painful tuning
  • ~1M rows → local approaches become impractical

In this benchmark:

  • Similarity API deduplicates 1,000,000 strings in ~7 minutes
  • RapidFuzz, TheFuzz, and python-Levenshtein extrapolate to tens to hundreds of hours
  • That's roughly 300×–1,000× faster compute at 1M rows (excluding implementation and maintenance time)

Want to see this on your own data?

Paste sample strings or upload a CSV (up to 100k rows free, no setup).

Why This Benchmark Matters

Fuzzy string matching is at the core of common data tasks—cleaning CRM data, merging product catalogs, reconciling records, or doing fuzzy joins inside ETL pipelines. Yet most developers still rely on local Python libraries that work great at 1k–10k records but don't scale when you hit real-world volumes.

This benchmark compares:

  • Similarity API (cloud-native, adaptive matching engine)
  • RapidFuzz (fast, modern C++/Python library)
  • TheFuzz (FuzzyWuzzy fork)
  • python-Levenshtein (core edit-distance implementation)

We test them at 10k, 100k, and 1M strings.

Data & Benchmark Setup

Environment

Tests ran in a standard Google Colab CPU environment:

  • 2 vCPUs
  • ~13GB RAM
  • Python 3.x

Timings represent warm runs. The first API call has a small cold-start penalty, but subsequent calls match production steady-state behavior.

Synthetic Data

We generate names from a curated base list (people, companies, etc.) and apply realistic typos:

  • Insertions / deletions
  • Adjacent swaps
  • Random character replacements

This produces realistic noisy variants such as:

  • Micsrosoft Corpp
  • Aplpe Inc.
  • Charlle Brown

Each string gets a label based on its base name so we can run a quick accuracy sanity check.

Dataset Sizes

Benchmarks run at:

  • 10,000 strings
  • 100,000 strings
  • 1,000,000 strings

Local libraries are measured at 10k (RapidFuzz is also measured at 100k) and estimated for the remaining larger sizes via O(N²) scaling.

How Each Tool Was Used

Similarity API

A simple /dedupe call with configurable preprocessing:

POST https://api.similarity-api.com/dedupe
{
  "data": [...strings...],
  "config": {
    "similarity_threshold": 0.85,
    "remove_punctuation": false,
    "to_lowercase": false,
    "use_token_sort": false,
    "output_format": "index_pairs"
  }
}

Changing matching behavior is simply toggling config options—no preprocessing code or custom pipelines.

RapidFuzz

We use RapidFuzz's optimized C++ engine:

process.cdist(strings, strings, scorer=fuzz.ratio, workers=-1)

TheFuzz & python-Levenshtein

Used through naive Python loops, as they do not offer a bulk vectorized similarity matrix.

Quick Accuracy Sanity Check

Using a 2,000-string subset with known duplicate labels we ran a lightweight sanity check:

All tools achieved very high precision at a reasonably strict threshold (everything they returned was actually a duplicate).

We also checked that the number of unique entities after deduplication is close to the ground truth (using Similarity API's deduped_indices output format and clustering the pairs from the local libraries).

Results

Our benchmarks revealed significant performance differences between the tools, particularly as dataset sizes increased:

Library10K Records100K Records1M Records
Similarity API0.8 s58.8 s421.8 s
RapidFuzz13.0 s1301.8 s130,180.0 s (est.)
python-Levenshtein46.8 s4684.9 s (est.)468,490.0 s (est.)
TheFuzz39.4 s3938.2 s (est.)393,820.0 s (est.)

Performance at 10K and 100K Rows

Performance Across All Dataset Sizes (10K, 100K, 1M Rows)

A few things stand out:

  • At 10k rows, Similarity API is already an order of magnitude faster than the fastest local library.
  • By 100k rows, local libraries are effectively in "batch job" territory, while Similarity API is still something you can run interactively.
  • At 1M rows, Similarity API finishes in about 7 minutes, while naive estimates for the local libraries are in the tens to hundreds of hours.

If you're cleaning real-world datasets or running dedupe inside production pipelines, these differences are the line between "runs during a coffee break" and "needs an overnight batch job plus a lot of custom infrastructure."

Stop building fuzzy matching pipelines

If your datasets are already in the 100k+ range, local libraries will keep slowing you down — even before accuracy becomes a problem.

Why Similarity API Wins

1. Adaptive proprietary algorithm

Similarity API uses an internal algorithm that adapts its strategy depending on input size and structure—indexing, parallelization, and optimized data layouts—so you get top-tier fuzzy matching without designing complex systems.

2. Preprocessing as configuration, not code

Lowercasing, punctuation removal, token sorting—just toggle a boolean instead of writing preprocessing pipelines.

3. Zero infrastructure

No servers, threading, batch jobs, or memory concerns. You pass strings; the API scales.

4. Transparent pricing with a generous free tier

Process 100,000 rows for free. Pay-as-you-go and tier plans available.

Try It Yourself

Run the full benchmark yourself using either Google Colab (interactive) or clone the GitHub repository (local development).

Google Colab Notebook

Run the benchmark instantly in your browser. No setup required—just click and execute the code cells.

Open in Colab

GitHub Repository

Clone the full source code and run the benchmark locally. Customize and extend the tests to your needs.

View on GitHub

You'll need to sign up to get a free API key to run the benchmarks.

FAQ