Similarity API Documentation

Similarity API is a fast, scalable fuzzy string matching service. It lets you match records across datasets or deduplicate records within a single dataset, built for real-world text where spelling, formatting, and structure vary.

Instead of wiring together libraries and custom pipelines, you send data and receive structured, scored matches. You control how strict or permissive matching should be with a small set of parameters; the system does the heavy lifting.

Why engineers use Similarity API

Scales from a handful of records to millions without special infrastructure
Tunable matching behavior via clear parameters (thresholds, preprocessing, top_n)
Simple REST interface with copy-paste code samples

Two core operations

Reconcile (A → B)

Match records in one dataset against another to link related entries that aren't identical. At its core, this is the familiar "compare strings" problem: you can send two strings to compare, or you can send millions of records across two datasets and retrieve the best matches at scale—all through the same operation.

Example: matching new supplier names from an import to canonical entries in your product catalog when formats differ.

Dedupe (A ↔ A)

Identify and group overlapping records within a single dataset so you can review or merge them with confidence scores.

Example: finding multiple customer signups that represent the same person or organization under slightly different names.

Both operations share the same request structure and options. Parameters like threshold, preprocessing, and top_n determine how strict matching is, how text is standardized, and how many candidates are returned per item.

Authentication

All requests must include your API key as a Bearer token in the HTTP header:

Authorization: Bearer <YOUR_API_KEY>

Your key is available in the dashboard after signup.

Token security & lifecycle

Keep your token secret. Never commit it to repos or expose it in client-side code.
Rotate keys from your dashboard if you suspect exposure.
Tokens remain valid until you rotate or revoke them (we may revoke for abuse or billing issues).

Reconcile

The Reconcile endpoint matches strings from one dataset to records in another. It's ideal for situations where you want to link new data to an existing reference source or fuzzily merge two datasets without relying on exact keys.

At its core, you can compare just two strings — or scale the same logic to millions of records. The API handles the matching efficiently so you can focus on how to use the results, not building the underlying matching logic.

POST

/reconcile

Request

curl -sS -X POST https://api.similarity-api.com/reconcile \
  -H "Authorization: Bearer $SIMILARITY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data_a": ["Microsoft", "appLE"],
    "data_b": ["Apple Inc.", "Microsoft"],
    "config": {
      "similarity_threshold": 0.75,
      "top_n": 1,
      "remove_punctuation": true,
      "to_lowercase": true,
      "use_token_sort": true,
      "output_format": "flat_table"
    }
  }'

Response (flat_table format)

{
  "status": "success",
  "response_data": [
    {
      "index_a": 0,
      "text_a": "Microsoft",
      "matched": true,
      "index_b": 1,
      "text_b": "Microsoft",
      "score": 1.0,
      "threshold": 0.75
    },
    {
      "index_a": 1,
      "text_a": "applE ",
      "matched": true,
      "index_b": 0,
      "text_b": "Apple Inc.",
      "score": 0.7906,
      "threshold": 0.75
    }
  ]
}

Parameters

data_a

Required. Array of strings representing the list to reconcile (e.g., company names from a spreadsheet that need to be matched).

data_b

Required. Array of strings representing the reference list to match against (e.g., master database or CRM system records).

config

Optional. Configuration object containing the following parameters:

similarity_threshold

Float value between 0 and 1. Controls how similar strings must be to be considered matches. Default is 0.85. Lower thresholds will find more matches but may include false positives.

top_n

Integer value. Maximum number of matches to return per data_a item. Default is 1.

remove_punctuation

Boolean. When true, removes all punctuation before comparison. Useful for company names where punctuation may vary. Default is false.

to_lowercase

Boolean. Converts all text to lowercase before comparison, making matching case-insensitive. Default is false.

use_token_sort

Boolean. When true, sorts tokens within each string before comparison, improving matches for strings with different word orders. Default is false.

output_format

Enum specifying the response format:

flat_table (recommended)

Returns one row per data_a item with match details. Most convenient for data analysis.

{
  "status": "success",
  "response_data": [
    {
      "index_a": 0,
      "text_a": "Microsoft",
      "matched": true,
      "index_b": 1,
      "text_b": "Microsoft",
      "score": 1,
      "threshold": 0.75
    },
    {
      "index_a": 1,
      "text_a": "applE ",
      "matched": true,
      "index_b": 0,
      "text_b": "Apple Inc.",
      "score": 0.7906,
      "threshold": 0.75
    }
  ]
}

string_pairs

Returns per-row top-N matches with text strings included: [text_a, text_b, score].

{
  "response_data": [
    ["Microsoft", "Microsoft", 1.00],
    ["appLE", "Apple Inc.", 0.79]
  ],
  "status": "success"
}

index_pairs

Returns per-row top-N matches with indices only (more compact): [index_a, index_b, score].

{
  "response_data": [
    [0, 1, 1.00],
    [1, 0, 0.79]
  ],
  "status": "success"
}

Code Samples

const axios = require('axios');

async function reconcile() {
  const url = 'https://api.similarity-api.com/reconcile';
  const token = process.env.SIMILARITY_API_KEY; 

  const payload = {
    data_a: ['Microsoft', 'appLE'],
    data_b: ['Apple Inc.', 'Microsoft'],
    config: {
      similarity_threshold: 0.75,
      top_n: 1,
      remove_punctuation: true,
      to_lowercase: true,
      use_token_sort: true,
      output_format: 'flat_table'
    }
  };

  try {
    const res = await axios.post(url, payload, {
      headers: { Authorization: `Bearer ${token}`, 'Content-Type': 'application/json' }
    });
    console.log('Reconcile results:', res.data);
  } catch (err) {
    console.error('Error:', err.response?.data || err.message);
  }
}

reconcile();

Common Use-cases

String comparison

Compare two strings by sending one-element arrays in data_a and data_b. Useful for quick checks, custom workflows, or manual review.

Top-N matching

Return the best N fuzzy matches for each input string by adjusting the top_n parameter. This is ideal for candidate generation or surfacing multiple potential matches for downstream review.

Has-a-match checks

Identify whether each input string meets a minimum similarity threshold with any reference record by combining top_n: 1 with a chosen similarity_threshold. The response clearly indicates whether a match was found and at what score.

CRM & reference matching

Reconcile new records (e.g. leads) against an existing database to link accounts, catch duplicates, or unify messy naming.

Fuzzy merges during imports

Join two datasets that don't share clean IDs, without manually engineering a fuzzy join. The API scales to millions of records while maintaining accuracy.

Dedupe

The Dedupe endpoint finds near-duplicate or overlapping records within a single dataset. It's designed for cleaning data imports, merging signups, or detecting repeated entries when exact matching isn't possible.

This operation uses the same fuzzy matching engine as Reconcile, but instead of comparing two separate datasets, it looks within one dataset, identifying groups or pairs of records that are similar above your chosen threshold.

POST

/dedupe

Request

curl -X POST https://api.similarity-api.com/dedupe \
  -H "Authorization: Bearer $SIMILARITY_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "data": [
      "Apple Inc",
      "Apple Inc.",
      "Apple Incorporated",
      "Microsoft Corporation",
      "Microsoft Corp"
    ],
    "config": {
      "similarity_threshold": 0.85,
      "remove_punctuation": false,
      "to_lowercase": false,
      "use_token_sort": false,
      "output_format": "index_pairs"  // "index_pairs" | "string_pairs" | "deduped_strings" | "deduped_indices" | "membership_map"
    }
  }'

Response (index_pairs format)

{
  "response_data": [
    [0, 1, 0.94],
    [0, 2, 0.86],
    [3, 4, 0.79]
  ],
  "status": "success"
}

Parameters

data

Required. Array of strings to check for duplicates.

config

Optional. Configuration object containing the following parameters:

similarity_threshold

Float value between 0 and 1. Controls how similar strings must be to be considered duplicates. Default is 0.65. Lower thresholds will catch more potential duplicates but may include false positives, while higher thresholds are more precise but might miss some variations.

remove_punctuation

Boolean. When true, removes all punctuation before comparing strings. This is particularly useful for company names and addresses where punctuation may vary (e.g., "Inc" vs "Inc."). Default is false.

to_lowercase

Boolean. Converts all text to lowercase before comparison, making the matching process case-insensitive. Recommended for most general use cases, especially when capitalization isn't meaningful. Default is false.

output_format

Enum specifying the response format:

index_pairs

Returns pairs of indices for two strings that have a similarity score above the provided threshold, along with their specific similarity score. This output format has the shortest computation time.

{
  "response_data": [
    [0, 1, 0.94],
    [0, 2, 0.86],
    [3, 4, 0.79]
  ],
  "status": "success"
}

string_pairs

Returns pairs of strings that have a similarity score above the provided threshold, along with their specific similarity score. This format is more human-readable.

{
  "response_data": [
    ["Apple Inc", "Apple Inc.", 0.94],
    ["Apple Inc", "Apple Incorporated", 0.86],
    ["Microsoft Corporation", "Microsoft Corp", 0.79]
  ],
  "status": "success"
}

deduped_indices

Returns only one index from each group of similar strings, effectively deduplicating the input list by returning representative indices.

{
  "response_data": [0, 3],
  "status": "success"
}

deduped_strings

Returns the deduplicated strings directly, providing a clean list with duplicates removed based on the similarity threshold.

{
  "response_data": ["Apple Inc", "Microsoft Corporation"],
  "status": "success"
}

membership_map

Returns a list of integers with the same length as your input data array. Each position corresponds to the input string at that index, and the value is that string's group ID. Strings with the same group ID belong to the same duplicate cluster. Each group ID is the smallest index within that group, matching the indices returned in deduped_indices.

{
  "response_data": [0, 0, 0, 3, 3],
  "status": "success"
}

Code Samples

const axios = require('axios');

(async () => {
  const res = await axios.post(
    'https://api.similarity-api.com/dedupe',
    {
      data: [
        'Apple Inc',
        'Apple Inc.',
        'Apple Incorporated',
        'Microsoft Corporation',
        'Microsoft Corp'
      ],
      config: {
        similarity_threshold: 0.85,
        remove_punctuation: false,
        to_lowercase: false,
        use_token_sort: false,
        output_format: 'index_pairs'
      }
    },
    { headers: { Authorization: `Bearer ${process.env.SIMILARITY_API_KEY}` } }
  );
  console.log(res.data);
})();

Common Use-cases

Strict production merges

Set a high threshold (e.g., 0.95) with top_n: 1 to surface only very strong pairs you can safely auto-merge.

Reviewer queues

Lower the threshold (e.g., 0.80) and set top_n > 1 to provide multiple candidates per record for human validation.

Messy import cleanup

Before loading a CSV into your system, run Deduplicate with normalization flags (remove_punctuation, to_lowercase, use_token_sort) to catch formatting/spelling variants.

Build clusters client-side

If you need groups rather than pairs, treat returned pairs as edges in a graph and compute connected components (union-find). This produces duplicate clusters without a special API format.

Has-a-duplicate check

To flag whether each record has any duplicate above a threshold, set top_n: 1, choose your similarity_threshold, and interpret "has any pair returned" as a boolean.

Custom Solution

Need a custom solution? Our team can help you implement specialized string similarity and deduplication solutions tailored to your specific needs. to set up an exploratory call.

Contents

Similarity API Documentation

Why engineers use Similarity API

Two core operations

Reconcile (A → B)

Dedupe (A ↔ A)

Authentication

Token security & lifecycle

Reconcile

/reconcile

Request

Response (flat_table format)

Parameters

data_a

data_b

config

similarity_threshold

top_n

remove_punctuation

to_lowercase

use_token_sort

output_format

flat_table (recommended)

string_pairs

index_pairs

Code Samples

Common Use-cases

String comparison

Top-N matching

Has-a-match checks

CRM & reference matching

Fuzzy merges during imports

Dedupe

/dedupe

Request

Response (index_pairs format)

Parameters

data

config

similarity_threshold

remove_punctuation

to_lowercase

output_format

index_pairs

string_pairs

deduped_indices

deduped_strings

membership_map

Code Samples

Common Use-cases

Strict production merges

Reviewer queues

Messy import cleanup

Build clusters client-side

Has-a-duplicate check

Custom Solution