How to fuzzy‑match 1M rows with dbt in under 10 minutes (2026 guide)

March 20267 min readBy Similarity API Team

Duplicate records rarely look like a priority at first — until they start breaking reporting, outreach, or reconciliation workflows.

From slightly different versions of "Acme Inc" in a CRM to inconsistent supplier names across systems or messy post‑merger datasets, fuzzy matching becomes essential whenever identical strings are no longer a reliable signal of the same real‑world entity.

The scaling wall: why warehouse‑native fuzzy matching breaks at scale

Fuzzy matching looks simple on a 1,000‑row sample. But at real scale, the math changes. A naive all‑to‑all comparison grows at O(N²). Once you hit 100k+ rows, comparison space explodes, and warehouse‑native approaches become slow, expensive, or brittle.

In practice, teams usually try a sequence of approaches before realizing the real complexity. They might start with warehouse similarity functions, hit performance limits, then move to Python or notebook experiments — only to discover new bottlenecks around memory usage, blocking strategy design, and data cleanup. At that point, what looked like a simple dedupe task starts turning into a permanent matching pipeline:

  • Blocking and candidate generation logic
  • String normalization and suffix cleanup
  • Threshold tuning and evaluation loops
  • Parallelization and memory management

What started as a quick cleanup task quietly turns into ongoing engineering overhead.

The solution: call a production fuzzy‑matching engine from dbt

Similarity API is a hosted infrastructure service designed for high‑performance deduplication and reconciliation.

Instead of building and maintaining your own matching pipeline, you send the relevant strings to a dedicated matching engine optimized for noisy real‑world data and large‑scale workloads — then load the results back into your warehouse as a normal dbt model output.

The technical edge: adaptive preprocessing at scale

In real workflows, fuzzy matching quality is determined as much by data preparation strategy as by the similarity metric itself.

Local implementations often require teams to design custom normalization rules, suffix cleaning logic, token ordering heuristics, and blocking strategies — each of which must be tuned as datasets evolve.

Similarity API embeds these steps directly into the matching engine:

  • Dataset‑aware normalization: preprocessing adapts dynamically to string length, token density, and noise patterns
  • Scale‑optimized cleaning pipeline: preprocessing runs as part of the distributed matching flow, preventing cleanup stages from becoming bottlenecks at 1M+ rows
  • Configuration instead of custom code: matching behaviour is controlled through parameters such as similarity_threshold, use_token_sort, and remove_punctuation, rather than bespoke scripts

This architecture allows teams to focus on match review and downstream data actions rather than maintaining fragile preprocessing pipelines.

Why fuzzy matching belongs in your dbt layer

This guide is designed for a dbt Python model workflow.

That makes dbt a strong execution surface for fuzzy matching because you can:

  • pull source data from your warehouse using dbt refs or sources
  • call the matching API inside a repeatable transformation workflow
  • materialize match results back into warehouse tables
  • keep dedupe logic close to the rest of your analytics engineering stack

In practice, this means you can move from one‑off cleanup to a reusable model that runs as part of your broader data pipeline.

Before running this model, you will need a Similarity API production token.

You can generate one from the Similarity API dashboard. The token is passed as a standard Bearer authorization header in the request.

What you actually get back

Example input

["Acme Inc", "ACME Incorporated", "Beta LLC", "Beta Limited"]

Example output (index_pairs)

[
    [0, 1, 0.94],
    [2, 3, 0.91]
]

Each result represents two rows that likely refer to the same real-world entity, along with a similarity score.

By default, the API returns index pairs, which you can join back to the staged input rows for review, clustering, or merge workflows.

Output format is configurable — you can instead return string pairs, clustered groups of duplicates, or fully deduplicated record lists depending on your cleanup strategy.

The following dbt Python model:

  • reads company names from an upstream dbt model
  • sends them to the Similarity API
  • returns duplicate index pairs joined back to the original strings
import os
import requests
import pandas as pd


def model(dbt, session):
    dbt.config(materialized="table")

    api_key = os.environ["SIMILARITY_API_KEY"]
    api_url = "https://api.similarity-api.com/dedupe"

    source_df = dbt.ref("stg_companies").to_pandas()
    source_df = source_df.reset_index(drop=True)

    strings = (
        source_df["company_name"]
        .dropna()
        .astype(str)
        .tolist()
    )

    print(f"Loaded {len(strings):,} rows from dbt ref('stg_companies')")

    payload = {
        "data": strings,
        "config": {
            "similarity_threshold": 0.65,
            "remove_punctuation": True,
            "to_lowercase": True,
            "use_token_sort": False,
            "output_format": "index_pairs",
        },
    }

    response = requests.post(
        api_url,
        headers={"Authorization": f"Bearer {api_key}"},
        json=payload,
        timeout=3600,
    )
    response.raise_for_status()

    results = response.json().get("response_data", [])
    print(f"Workflow complete: found {len(results):,} duplicate pairs")

    if not results:
        return pd.DataFrame(
            columns=[
                "idx_1",
                "idx_2",
                "score",
                "company_name_1",
                "company_name_2",
            ]
        )

    dedupe_df = pd.DataFrame(results, columns=["idx_1", "idx_2", "score"])

    dedupe_df["company_name_1"] = dedupe_df["idx_1"].map(lambda i: strings[i])
    dedupe_df["company_name_2"] = dedupe_df["idx_2"].map(lambda i: strings[i])

    return dedupe_df

A minimal dbt_project.yml setup would expose SIMILARITY_API_KEY to the runtime environment, and the resulting table can then feed review models, merge workflows, or downstream entity clustering.

The honest "under 10‑minute" claim

Here is how the timing works in practice:

  • ~7 minutes: benchmarked processing time for a 1M‑row dataset in Similarity API. This varies with string length and duplicate density.
  • ~2 minutes: drop the model into your dbt project, set the API key, and run it

No blocking strategy design. No distributed compute tuning. No regex cleanup scripts.

From prototype to production

The advantage of dbt is that this does not have to stay a one‑off experiment.

Once the model works, you can schedule it as part of your normal transformation workflow and build downstream logic on top of the output table:

  • review likely duplicate pairs
  • cluster entities before enrichment
  • feed survivorship / merge logic
  • monitor duplicate volume over time

Because the interface is standard HTTP, the matching engine becomes a reusable data‑quality component inside the same dbt workflow your team already maintains.

Final word

At large scale, fuzzy matching stops being a string‑similarity problem and becomes an infrastructure problem.

Similarity API is built for teams that prefer to spend engineering time on analytics and product logic — not on maintaining custom deduplication pipelines.

Instead of weeks of pipeline work, you can run one dbt model and move straight to reviewing and acting on clean data.

Stop building matching infrastructure. Start acting on clean entities.