Our Blog
Insights, tutorials, and updates about text similarity matching and our API.
1M-Row Fuzzy Matching Benchmark (2026): Similarity API vs RapidFuzz, TheFuzz, Levenshtein
1,000× faster than TheFuzz at 1M records — a head-to-head benchmark against RapidFuzz, TheFuzz, and python-Levenshtein.
Fuzzy-match millions of rows in Databricks (2026)
A step-by-step notebook workflow: export, match via Similarity API, and land results back into Delta.
Fuzzy-match a million rows in under 10 minutes
A practical walkthrough showing how to deduplicate a million rows of real-world data in under 10 minutes using Similarity API.
How to match a 1M-row dataset to a canonical reference in under 10 minutes (2026 guide)
Learn how to match a 1M-row dataset to a canonical reference in under 10 minutes. Avoid brute-force similarity joins, brittle scripts, and custom candidate-generation pipelines with a scalable reconciliation API.
How to Reconcile Leads Against Contacts in Salesforce at Scale
Learn how Salesforce teams reconcile leads against existing contacts to prevent duplicate pipeline, improve routing accuracy, and maintain clean CRM reporting at scale.
How to Match Two Lists with Fuzzy Logic: Merging a Trade Show Export with Your CRM (No Code)
VLOOKUP misses contacts that exist under a different name or email. Here's how to fuzzy match two lists — trade show exports, enriched leads, CRM exports — to find who's already there before you create duplicates.
How to Deduplicate Your Contact List Before Importing to HubSpot
HubSpot only deduplicates on email address — which means it misses most real-world duplicates. Here's what to clean before you hit import, and how to do it without code.
How to Find & Merge Duplicate Company Names in a Spreadsheet or CSV
Excel's Remove Duplicates misses most company name duplicates. Here's why — and how to actually find and merge records when names are spelled differently.
Why It Rarely Makes Sense to Build Fuzzy Matching Yourself in 2026
The hard part isn't scoring string similarity — it's the full pipeline around it. Here's why most teams are better off not building it.
How Similarity API Works
Most teams don't struggle because they lack a similarity function. They struggle because fuzzy matching in production quickly becomes a pipeline.
How Similarity API Mimics the Ideal Fuzzy-Matching Pipeline Engineers Would Build
Experienced engineers converge toward similar architectures for large-scale fuzzy matching. Similarity API reflects that convergence.
Why Similarity API Is Not Hard to Tune
Fuzzy matching systems often become hard to tune because of preprocessing, blocking, and threshold design. Learn why sensible defaults and practical controls matter more.
Why Fuzzy Matching at Scale Stops Being a Library Problem
Fuzzy matching libraries solve similarity scoring but not large-scale matching workflows. Learn why it becomes a system design challenge.
Using Similarity API Across Your Stack
Standardizing fuzzy-matching behaviour across tools and workflows helps teams maintain consistent deduplication and reconciliation outcomes at scale.
From One-Off Dedupe Task to Core Data Capability
Fuzzy matching often begins as a one-off deduplication task but quickly becomes a recurring need. Unifying matching logic into a consistent capability helps improve data quality and operational efficiency.
How to fuzzy-match 1M rows from BigQuery in under 10 minutes (2026 guide)
Learn how to fuzzy-match 1 million rows directly from a BigQuery notebook in under 10 minutes. Avoid cross-join explosions and custom blocking pipelines with a scalable deduplication API.
How to fuzzy-match 1M rows with dbt in under 10 minutes (2026 guide)
Learn how to fuzzy-match 1 million rows with dbt in under 10 minutes. Avoid brittle Python scripts, warehouse-native limits, and custom blocking pipelines with a scalable deduplication API.
How to fuzzy-match 1M rows in an Airflow pipeline in under 10 minutes (2026 guide)
Learn how to fuzzy-match 1 million rows inside an Airflow data pipeline in under 10 minutes. Replace brittle batch scripts and warehouse cross-joins with a scalable deduplication API step.
Fuzzy Matching at Scale: What Changes as Data Grows
A practical guide to how fuzzy matching changes as datasets grow from small cleanups to production-scale pipelines.