Contents
Similarity API Documentation
Similarity API is a fast, scalable fuzzy string matching service. It lets you match records across datasets or deduplicate records within a single dataset, built for real-world text where spelling, formatting, and structure vary.
Instead of wiring together libraries and custom pipelines, you send data and receive structured, scored matches. You control how strict or permissive matching should be with a small set of parameters; the system does the heavy lifting.
Why engineers use Similarity API
- Scales from a handful of records to millions without special infrastructure
- Tunable matching behavior via clear parameters (thresholds, preprocessing, top_n)
- Simple REST interface with copy-paste code samples
Two core operations
Reconcile (A → B)
Match records in one dataset against another to link related entries that aren't identical. At its core, this is the familiar "compare strings" problem: you can send two strings to compare, or you can send millions of records across two datasets and retrieve the best matches at scale—all through the same operation.
Example: matching new supplier names from an import to canonical entries in your product catalog when formats differ.
Dedupe (A ↔ A)
Identify and group overlapping records within a single dataset so you can review or merge them with confidence scores.
Example: finding multiple customer signups that represent the same person or organization under slightly different names.
Both operations share the same request structure and options. Parameters like threshold, preprocessing, and top_n determine how strict matching is, how text is standardized, and how many candidates are returned per item.
Authentication
All requests must include your API key as a Bearer token in the HTTP header:
Authorization: Bearer <YOUR_API_KEY>
Your key is available in the dashboard after signup.
Token security & lifecycle
- Keep your token secret. Never commit it to repos or expose it in client-side code.
- Rotate keys from your dashboard if you suspect exposure.
- Tokens remain valid until you rotate or revoke them (we may revoke for abuse or billing issues).
Reconcile
The Reconcile endpoint matches strings from one dataset to records in another. It's ideal for situations where you want to link new data to an existing reference source or fuzzily merge two datasets without relying on exact keys.
At its core, you can compare just two strings — or scale the same logic to millions of records. The API handles the matching efficiently so you can focus on how to use the results, not building the underlying matching logic.
/reconcile
Request
curl -sS -X POST https://api.similarity-api.com/reconcile \ -H "Authorization: Bearer $SIMILARITY_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "data_a": ["Microsoft", "appLE"], "data_b": ["Apple Inc.", "Microsoft"], "config": { "similarity_threshold": 0.75, "top_n": 1, "remove_punctuation": true, "to_lowercase": true, "use_token_sort": true, "output_format": "flat_table" } }'
Response (flat_table format)
{ "status": "success", "response_data": [ { "index_a": 0, "text_a": "Microsoft", "matched": true, "index_b": 1, "text_b": "Microsoft", "score": 1.0, "threshold": 0.75 }, { "index_a": 1, "text_a": "applE ", "matched": true, "index_b": 0, "text_b": "Apple Inc.", "score": 0.7906, "threshold": 0.75 } ] }
Parameters
data_a
Required. Array of strings representing the list to reconcile (e.g., company names from a spreadsheet that need to be matched).
data_b
Required. Array of strings representing the reference list to match against (e.g., master database or CRM system records).
config
Optional. Configuration object containing the following parameters:
similarity_threshold
Float value between 0 and 1. Controls how similar strings must be to be considered matches. Default is 0.85. Lower thresholds will find more matches but may include false positives.
top_n
Integer value. Maximum number of matches to return per data_a item. Default is 1.
remove_punctuation
Boolean. When true, removes all punctuation before comparison. Useful for company names where punctuation may vary. Default is false.
to_lowercase
Boolean. Converts all text to lowercase before comparison, making matching case-insensitive. Default is false.
use_token_sort
Boolean. When true, sorts tokens within each string before comparison, improving matches for strings with different word orders. Default is false.
output_format
Enum specifying the response format:
flat_table (recommended)
Returns one row per data_a item with match details. Most convenient for data analysis.
{ "status": "success", "response_data": [ { "index_a": 0, "text_a": "Microsoft", "matched": true, "index_b": 1, "text_b": "Microsoft", "score": 1, "threshold": 0.75 }, { "index_a": 1, "text_a": "applE ", "matched": true, "index_b": 0, "text_b": "Apple Inc.", "score": 0.7906, "threshold": 0.75 } ] }
string_pairs
Returns per-row top-N matches with text strings included: [text_a, text_b, score].
{ "response_data": [ ["Microsoft", "Microsoft", 1.00], ["appLE", "Apple Inc.", 0.79] ], "status": "success" }
index_pairs
Returns per-row top-N matches with indices only (more compact): [index_a, index_b, score].
{ "response_data": [ [0, 1, 1.00], [1, 0, 0.79] ], "status": "success" }
Code Samples
const axios = require('axios'); async function reconcile() { const url = 'https://api.similarity-api.com/reconcile'; const token = process.env.SIMILARITY_API_KEY; const payload = { data_a: ['Microsoft', 'appLE'], data_b: ['Apple Inc.', 'Microsoft'], config: { similarity_threshold: 0.75, top_n: 1, remove_punctuation: true, to_lowercase: true, use_token_sort: true, output_format: 'flat_table' } }; try { const res = await axios.post(url, payload, { headers: { Authorization: `Bearer ${token}`, 'Content-Type': 'application/json' } }); console.log('Reconcile results:', res.data); } catch (err) { console.error('Error:', err.response?.data || err.message); } } reconcile();
Common Use-cases
String comparison
Compare two strings by sending one-element arrays in data_a and data_b. Useful for quick checks, custom workflows, or manual review.
Top-N matching
Return the best N fuzzy matches for each input string by adjusting the top_n parameter. This is ideal for candidate generation or surfacing multiple potential matches for downstream review.
Has-a-match checks
Identify whether each input string meets a minimum similarity threshold with any reference record by combining top_n: 1 with a chosen similarity_threshold. The response clearly indicates whether a match was found and at what score.
CRM & reference matching
Reconcile new records (e.g. leads) against an existing database to link accounts, catch duplicates, or unify messy naming.
Fuzzy merges during imports
Join two datasets that don't share clean IDs, without manually engineering a fuzzy join. The API scales to millions of records while maintaining accuracy.
Dedupe
The Dedupe endpoint finds near-duplicate or overlapping records within a single dataset. It's designed for cleaning data imports, merging signups, or detecting repeated entries when exact matching isn't possible.
This operation uses the same fuzzy matching engine as Reconcile, but instead of comparing two separate datasets, it looks within one dataset, identifying groups or pairs of records that are similar above your chosen threshold.
/dedupe
Request
curl -X POST https://api.similarity-api.com/dedupe \ -H "Authorization: Bearer $SIMILARITY_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "data": [ "Apple Inc", "Apple Inc.", "Apple Incorporated", "Microsoft Corporation", "Microsoft Corp" ], "config": { "similarity_threshold": 0.85, "remove_punctuation": false, "to_lowercase": false, "use_token_sort": false, "output_format": "index_pairs" // "index_pairs" | "string_pairs" | "deduped_strings" | "deduped_indices" } }'
Response (index_pairs format)
{ "response_data": [ [0, 1, 0.94], [0, 2, 0.86], [3, 4, 0.79] ], "status": "success" }
Parameters
data
Required. Array of strings to check for duplicates.
config
Optional. Configuration object containing the following parameters:
similarity_threshold
Float value between 0 and 1. Controls how similar strings must be to be considered duplicates. Default is 0.65. Lower thresholds will catch more potential duplicates but may include false positives, while higher thresholds are more precise but might miss some variations.
remove_punctuation
Boolean. When true, removes all punctuation before comparing strings. This is particularly useful for company names and addresses where punctuation may vary (e.g., "Inc" vs "Inc."). Default is false.
to_lowercase
Boolean. Converts all text to lowercase before comparison, making the matching process case-insensitive. Recommended for most general use cases, especially when capitalization isn't meaningful. Default is false.
output_format
Enum specifying the response format:
index_pairs
Returns pairs of indices for two strings that have a similarity score above the provided threshold, along with their specific similarity score. This output format has the shortest computation time.
{ "response_data": [ [0, 1, 0.94], [0, 2, 0.86], [3, 4, 0.79] ], "status": "success" }
string_pairs
Returns pairs of strings that have a similarity score above the provided threshold, along with their specific similarity score. This format is more human-readable.
{ "response_data": [ ["Apple Inc", "Apple Inc.", 0.94], ["Apple Inc", "Apple Incorporated", 0.86], ["Microsoft Corporation", "Microsoft Corp", 0.79] ], "status": "success" }
deduped_indices
Returns only one index from each group of similar strings, effectively deduplicating the input list by returning representative indices.
{ "response_data": [0, 3], "status": "success" }
deduped_strings
Returns the deduplicated strings directly, providing a clean list with duplicates removed based on the similarity threshold.
{ "response_data": ["Apple Inc", "Microsoft Corporation"], "status": "success" }
Code Samples
const axios = require('axios'); (async () => { const res = await axios.post( 'https://api.similarity-api.com/dedupe', { data: [ 'Apple Inc', 'Apple Inc.', 'Apple Incorporated', 'Microsoft Corporation', 'Microsoft Corp' ], config: { similarity_threshold: 0.85, remove_punctuation: false, to_lowercase: false, use_token_sort: false, output_format: 'index_pairs' } }, { headers: { Authorization: `Bearer ${process.env.SIMILARITY_API_KEY}` } } ); console.log(res.data); })();
Common Use-cases
Strict production merges
Set a high threshold (e.g., 0.95) with top_n: 1 to surface only very strong pairs you can safely auto-merge.
Reviewer queues
Lower the threshold (e.g., 0.80) and set top_n > 1 to provide multiple candidates per record for human validation.
Messy import cleanup
Before loading a CSV into your system, run Deduplicate with normalization flags (remove_punctuation, to_lowercase, use_token_sort) to catch formatting/spelling variants.
Build clusters client-side
If you need groups rather than pairs, treat returned pairs as edges in a graph and compute connected components (union-find). This produces duplicate clusters without a special API format.
Has-a-duplicate check
To flag whether each record has any duplicate above a threshold, set top_n: 1, choose your similarity_threshold, and interpret "has any pair returned" as a boolean.
Custom Solution
Need a custom solution? Our team can help you implement specialized string similarity and deduplication solutions tailored to your specific needs. to set up an exploratory call.