How to fuzzy‑match 1M rows from BigQuery in under 10 minutes (2026 guide)

March 20266 min readBy Similarity API Team

Duplicate records rarely look like a priority at first — until they start breaking reporting, outreach, or reconciliation workflows.

From slightly different versions of "Acme Inc" in a CRM to inconsistent supplier names across systems or messy post‑merger datasets, fuzzy matching becomes essential whenever identical strings are no longer a reliable signal of the same real‑world entity.

The scaling wall: why warehouse‑native fuzzy matching breaks at scale

Fuzzy matching looks simple on a 1,000‑row sample. But at real scale, the math changes. A naive all‑to‑all comparison grows at O(N²). Once you hit 100k+ rows, comparison space explodes, and local scripts or warehouse‑native approaches become slow, expensive, or brittle.

In practice, teams usually try a sequence of approaches before realizing the real complexity. They might start with warehouse similarity functions (such as edit distance or token similarity), hit performance limits, then switch to a quick Python script — only to discover new bottlenecks around memory usage, blocking strategy design, and data cleanup. At that point, what looked like a simple task starts turning into a permanent matching pipeline:

  • Blocking and candidate generation logic
  • String normalization and suffix cleanup
  • Threshold tuning and evaluation loops
  • Parallelization and memory management

What started as a quick dedupe task quietly turns into ongoing engineering overhead.

The solution: call a production fuzzy‑matching engine

Similarity API is a hosted infrastructure service designed for high‑performance deduplication and reconciliation.

Instead of building and maintaining your own matching pipeline, you send the dataset to a dedicated matching engine optimized for noisy real‑world data and large‑scale workloads.

The technical edge: adaptive preprocessing at scale

In real workflows, fuzzy matching quality is determined as much by data preparation strategy as by the similarity metric itself.

Local implementations often require teams to design custom normalization rules, suffix cleaning logic, token ordering heuristics, and blocking strategies — each of which must be tuned as datasets evolve.

Similarity API embeds these steps directly into the matching engine:

  • Dataset‑aware normalization: preprocessing adapts dynamically to string length, token density, and noise patterns
  • Scale‑optimized cleaning pipeline: preprocessing runs as part of the distributed matching flow, preventing cleanup stages from becoming bottlenecks at 1M+ rows
  • Configuration instead of custom code: matching behaviour is controlled through parameters such as similarity_threshold, use_token_sort, and remove_punctuation, rather than bespoke scripts

This architecture allows teams to focus on match review and downstream data actions rather than maintaining fragile preprocessing pipelines.

The BigQuery Notebook

This guide is designed to run inside a BigQuery notebook environment (Colab Enterprise integrated into BigQuery).

These notebooks let you:

  • Query production tables directly from BigQuery
  • Run Python data workflows without provisioning infrastructure
  • Call external APIs for heavy compute tasks
  • Write results back into BigQuery tables

In practice, this makes them an ideal surface for large‑scale fuzzy matching workflows: data stays in the warehouse, while compute‑intensive matching runs in a scalable external service.

Before running the notebook cell, you will need a Similarity API production token.

You can generate one from the Similarity API dashboard. The token is passed as a standard Bearer authorization header in the request.

What you actually get back

Example input

["Acme Inc", "ACME Incorporated", "Beta LLC", "Beta Limited"]

Example output (index_pairs)

[
    [0, 1, 0.94],
    [2, 3, 0.91]
]

Each result represents two rows that likely refer to the same real-world entity, along with a similarity score.

By default, the API returns index pairs, which you can quickly join back to your BigQuery table for review or merge workflows.

Output format is configurable — you can instead return string pairs, clustered groups of duplicates, or fully deduplicated record lists depending on your cleanup strategy.

The following code snippet:

  • reads a dataset directly from BigQuery
  • sends company names to the Similarity API
  • returns duplicate index pairs
from google.cloud import bigquery
import requests
import pandas as pd

# ---- CONFIG ----
PROJECT_ID = "YOUR_PROJECT_ID"
DATASET = "YOUR_DATASET"
TABLE = "YOUR_TABLE"
COLUMN = "company_name"

API_KEY = "YOUR_PRODUCTION_KEY"
API_URL = "https://api.similarity-api.com/dedupe"

# ---- LOAD DATA FROM BIGQUERY ----
client = bigquery.Client(project=PROJECT_ID)

query = f"""
SELECT {COLUMN}
FROM \`{PROJECT_ID}.{DATASET}.{TABLE}\`
WHERE {COLUMN} IS NOT NULL
"""

strings = (
    client.query(query)
    .result()
    .to_dataframe()[COLUMN]
    .astype(str)
    .tolist()
)

print(f"Loaded {len(strings):,} rows from BigQuery")

# ---- CALL SIMILARITY API ----
payload = {
    "data": strings,
    "config": {
        "similarity_threshold": 0.65,
        "remove_punctuation": True,
        "to_lowercase": True,
        "use_token_sort": False,
        "output_format": "index_pairs",
    },
}

response = requests.post(
    API_URL,
    headers={"Authorization": f"Bearer {API_KEY}"},
    json=payload,
    timeout=3600,
)

response.raise_for_status()

results = response.json().get("response_data", [])

print(f"Workflow complete: found {len(results):,} duplicate pairs")

# ---- OPTIONAL: SAVE RESULTS BACK TO BIGQUERY ----
if results:
    dup_df = pd.DataFrame(results, columns=["idx_1", "idx_2"])

    table_id = f"{PROJECT_ID}.{DATASET}.dedupe_results"

    job = client.load_table_from_dataframe(dup_df, table_id)
    job.result()

    print(f"Saved results to {table_id}")

The honest "under 10‑minute" claim

Here is how the timing works in practice:

  • ~7 minutes: benchmarked processing time for a 1M‑row dataset in Similarity API. This would of course vary depending on string length.
  • ~2 minutes: copy‑paste the notebook cell, run the query, and start the job

No blocking strategy design. No distributed compute tuning. No regex cleanup scripts.

From prototype to production

Notebooks are ideal for validating matching quality and running one‑off reconciliation jobs.

In production, the same API call pattern can be embedded into:

  • scheduled BigQuery workflows
  • Airflow or Prefect pipelines
  • backend data services
  • low‑code automation tools

Because the interface is standard HTTP, the matching engine becomes a reusable data‑quality component across your stack.

Final word

At large scale, fuzzy matching stops being a string‑similarity problem and becomes an infrastructure problem.

Similarity API is built for teams that prefer to spend engineering time on analytics and product logic — not on maintaining custom deduplication pipelines.

Instead of weeks of pipeline work, you can run a notebook cell and move straight to reviewing and acting on clean data.