How to Match Salesforce Leads to Existing Contacts at Scale

Duplicate people records are almost inevitable in modern Salesforce environments. Identity data flows in from forms, enrichment tools, outbound prospecting, partner systems, event imports, product signups, and manual entry. Even with well‑run processes, slight variations in names, emails, titles, and company formats accumulate over time — especially as multiple systems feed the same CRM.

At scale, teams eventually need a way to answer very practical questions:

Which of the new leads we are importing already exist as contacts?
Which account owner should this inbound lead actually belong to?
How do we clean identity data across the CRM before a migration or reporting reset?

This is where lead‑to‑contact reconciliation workflows typically emerge.

Why teams run this workflow

The motivation is operational.

reporting accuracy — duplicate identities fragment attribution and pipeline analytics
routing correctness — new leads often need to inherit ownership from existing accounts
import risk reduction — bulk uploads can create thousands of duplicates without pre‑checks
automation enablement — teams surface similar contacts, block conversions, or auto‑assign ownership

Over time this becomes a recurring RevOps capability rather than a one‑time cleanup task.

What this looks like in practice

Common patterns include:

Pre‑import identity checks

export contacts
reconcile new leads against the contact base
review high‑confidence matches
merge or update before import

Scheduled identity cleanup jobs

compare recently created leads to contacts
write suggested match IDs or similarity scores to custom fields
create review queues for RevOps

Automation‑driven identity resolution

Apex triggers call an HTTP reconciliation endpoint before lead insert
Salesforce Flows surface candidate matches for SDR review
nightly jobs reassign leads to existing account owners

At this stage, similarity matching becomes part of operational CRM infrastructure.

Exact vs similarity matching in CRM reconciliation

Traditional CRM deduplication relies on exact matching — typically email equality or strict rule logic. This works well when identifiers are clean and consistent.

In real GTM environments, identity signals drift:

people use multiple emails
company names are formatted differently
titles and suffixes vary
records are created by different systems and teams

This is where similarity‑based matching becomes necessary. Instead of asking "are these fields identical?" the workflow asks "are these records likely to represent the same real‑world person?"

Exact matching remains useful as a first filter. Similarity matching extends coverage to ambiguous cases that exact rules cannot resolve at scale.

How reconciliation pipelines usually work

Conceptually, identity matching pipelines involve:

preprocessing — normalize casing, punctuation, token order, company suffixes
similarity calculation — compare identity strings
filtering — keep matches above a confidence threshold

This logic is straightforward on small datasets. It becomes harder when:

CRM datasets grow into hundreds of thousands of records
imports and enrichment create continuous identity drift
reconciliation must run frequently or automatically

This is typically where teams move from ad‑hoc scripts to more scalable approaches.

Substituting the pipeline with a single reconciliation call

Build it yourself

⚙️Design & algorithm selection

⚡Preprocessing & normalization

🧱Blocking strategy (for scale)

📊Scoring & threshold tuning

🔽Filtering & candidate ranking

📁Output formatting

Pipeline to build, test, and maintain

Call Similarity API

Similarity API

1 API Call

One integration

Scales automatically

No maintenance

Any HTTP environment

In practice, this entire comparison workflow can be replaced with one API request:

payload = {
    "data_a": lead_match_strings,
    "data_b": contact_match_strings,
    "config": {
        "similarity_threshold": 0.82,
        "top_n": 3,
        "to_lowercase": True,
        "remove_punctuation": True,
        "use_token_sort": True,
        "output_format": "flat_table"
    }
}

res = requests.post(
    "https://api.similarity-api.com/reconcile",
    headers={"Authorization": f"Bearer {API_KEY}"},
    json=payload
).json()

The key design choice is defining the identity string — typically a combination of first name, last name, email, company or account name, and title.

Example output

With flat_table, results are returned as row‑level matches keyed by dataset indexes.

index_a	text_a	index_b	text_b	score	matched
0	Jane \| Doe \| jane@acme.com \| Acme Inc	1542	Jane \| Doe \| j.doe@acme.com \| Acme	0.93	TRUE
0	Jane \| Doe \| jane@acme.com \| Acme Inc	9811	Janet \| Doe \| janet@acme.com \| Acme Corp	0.84	TRUE
1	Mark \| Lee \| mark@north.io \| North IO	2207	Marc \| Lee \| mlee@north.io \| North.io	0.81	FALSE

Other output formats are available. This one is commonly used because it makes it easy to:

join results back to Salesforce Lead and Contact IDs
inspect candidate matches in review queues or notebooks
feed downstream merge, routing, or automation workflows

FAQ

Ultimately, lead‑contact reconciliation is not just about deduplicating records. It is about establishing a scalable way to interpret identity similarity across revenue systems — whether the workflow runs from a notebook, an ETL job, an Apex callout, or any HTTP‑based automation layer.

Try it on your own CRM data

Upload a CSV of leads and contacts — up to 100k rows free, no setup needed.