2025 Fuzzy Matching Benchmarks: Similarity API vs RapidFuzz, TheFuzz & Levenshtein (Up to 1,000× Faster)

Why This Benchmark Matters

Fuzzy string matching is at the core of common data tasks—cleaning CRM data, merging product catalogs, reconciling records, or doing fuzzy joins inside ETL pipelines. Yet most developers still rely on local Python libraries that work great at 1k–10k records but don't scale when you hit real-world volumes.

This benchmark compares:

Similarity API (cloud-native, adaptive matching engine)
RapidFuzz (fast, modern C++/Python library)
TheFuzz (FuzzyWuzzy fork)
python-Levenshtein (core edit-distance implementation)

We test them at 10k, 100k, and 1M strings.

Data & Benchmark Setup

Environment

Tests ran in a standard Google Colab CPU environment:

2 vCPUs
~13GB RAM
Python 3.x

Timings represent warm runs. The first API call has a small cold-start penalty, but subsequent calls match production steady-state behavior.

Synthetic Data

We generate names from a curated base list (people, companies, etc.) and apply realistic typos:

Insertions / deletions
Adjacent swaps
Random character replacements

This produces realistic noisy variants such as:

Micsrosoft Corpp
Aplpe Inc.
Charlle Brown

Each string gets a label based on its base name so we can run a quick accuracy sanity check.

Dataset Sizes

Benchmarks run at:

10,000 strings
100,000 strings
1,000,000 strings

Local libraries are measured at 10k (RapidFuzz is also measured at 100k) and estimated for the remaining larger sizes via O(N²) scaling.

How Each Tool Was Used

Similarity API

A simple /dedupe call with configurable preprocessing:

POST https://api.similarity-api.com/dedupe
{
  "data": [...strings...],
  "config": {
    "similarity_threshold": 0.85,
    "remove_punctuation": false,
    "to_lowercase": false,
    "use_token_sort": false,
    "output_format": "index_pairs"
  }
}

Changing matching behavior is simply toggling config options—no preprocessing code or custom pipelines.

RapidFuzz

We use RapidFuzz's optimized C++ engine:

process.cdist(strings, strings, scorer=fuzz.ratio, workers=-1)

TheFuzz & python-Levenshtein

Used through naive Python loops, as they do not offer a bulk vectorized similarity matrix.

Quick Accuracy Sanity Check

Using a 2,000-string subset with known duplicate labels we ran a lightweight sanity check:

All tools achieved very high precision at a reasonably strict threshold (everything they returned was actually a duplicate).

We also checked that the number of unique entities after deduplication is close to the ground truth (using Similarity API's deduped_indices output format and clustering the pairs from the local libraries).

Results

Our benchmarks revealed significant performance differences between the tools, particularly as dataset sizes increased:

Library	10K Records	100K Records	1M Records
Similarity API	0.8 s	58.8 s	421.8 s
RapidFuzz	13.0 s	1301.8 s	130,180.0 s (est.)
python-Levenshtein	46.8 s	4684.9 s (est.)	468,490.0 s (est.)
TheFuzz	39.4 s	3938.2 s (est.)	393,820.0 s (est.)

Performance at 10K and 100K Rows

Performance Across All Dataset Sizes (10K, 100K, 1M Rows)

A few things stand out:

At 10k rows, Similarity API is already an order of magnitude faster than the fastest local library.
By 100k rows, local libraries are effectively in "batch job" territory, while Similarity API is still something you can run interactively.
At 1M rows, Similarity API finishes in about 7 minutes, while naive estimates for the local libraries are in the tens to hundreds of hours.

If you're cleaning real-world datasets or running dedupe inside production pipelines, these differences are the line between "runs during a coffee break" and "needs an overnight batch job plus a lot of custom infrastructure."

Stop building fuzzy matching pipelines

If your datasets are already in the 100k+ range, local libraries will keep slowing you down — even before accuracy becomes a problem.

Why Similarity API Wins

1. Adaptive proprietary algorithm

Similarity API uses an internal algorithm that adapts its strategy depending on input size and structure—indexing, parallelization, and optimized data layouts—so you get top-tier fuzzy matching without designing complex systems.

2. Preprocessing as configuration, not code

Lowercasing, punctuation removal, token sorting—just toggle a boolean instead of writing preprocessing pipelines.

3. Zero infrastructure

No servers, threading, batch jobs, or memory concerns. You pass strings; the API scales.

4. Transparent pricing with a generous free tier

Process 100,000 rows for free. Pay-as-you-go and tier plans available.

Try It Yourself

Run the full benchmark yourself using either Google Colab (interactive) or clone the GitHub repository (local development).

Google Colab Notebook

Run the benchmark instantly in your browser. No setup required—just click and execute the code cells.

Open in Colab

GitHub Repository

Clone the full source code and run the benchmark locally. Customize and extend the tests to your needs.

View on GitHub

You'll need to sign up to get a free API key to run the benchmarks.

1M-Row Fuzzy Matching Benchmark (2025): Similarity API vs RapidFuzz, TheFuzz, Levenshtein

TL;DR

Why This Benchmark Matters

Data & Benchmark Setup

Environment

Synthetic Data

Dataset Sizes

How Each Tool Was Used

Similarity API

RapidFuzz

TheFuzz & python-Levenshtein

Quick Accuracy Sanity Check

Results

Performance at 10K and 100K Rows

Performance Across All Dataset Sizes (10K, 100K, 1M Rows)

Stop building fuzzy matching pipelines

Why Similarity API Wins

1. Adaptive proprietary algorithm

2. Preprocessing as configuration, not code

3. Zero infrastructure

4. Transparent pricing with a generous free tier

Try It Yourself

Google Colab Notebook

GitHub Repository

FAQ

1M-Row Fuzzy Matching Benchmark (2025): Similarity API vs RapidFuzz, TheFuzz, Levenshtein

TL;DR

Why This Benchmark Matters

Data & Benchmark Setup

Environment

Synthetic Data

Dataset Sizes

How Each Tool Was Used

Similarity API

RapidFuzz

TheFuzz & python-Levenshtein

Quick Accuracy Sanity Check

Results

Performance at 10K and 100K Rows

Performance Across All Dataset Sizes (10K, 100K, 1M Rows)

Stop building fuzzy matching pipelines

Why Similarity API Wins

1. Adaptive proprietary algorithm

2. Preprocessing as configuration, not code

3. Zero infrastructure

4. Transparent pricing with a generous free tier

Try It Yourself

Google Colab Notebook

GitHub Repository

FAQ

Is fuzzy matching really O(N²)?

How can I deduplicate millions of strings efficiently?

How accurate is Similarity API compared to local libraries?

Does Similarity API use GPUs or distributed systems?

Is Similarity API suitable for production ETL or data quality pipelines?

How much does large‑scale fuzzy matching cost?

Can Similarity API handle millions of strings?

Why do local fuzzy matching libraries slow down so much?

Can I tune accuracy?