Promo Dedupe Lab – Mobile Friendly

How this demo deduplicates promos (step-by-step)

We normalize text → compare items in a few different ways → link near-duplicates → form clusters. Use this as a quick-reference.

Exact match → Jaccard ≥ 0.55 → Jaro‑Winkler ≥ 0.90 or Norm‑Lev ≥ 0.80 → Cluster (connected items)

Input: One promo per line.
Preprocess: Lowercase, strip punctuation (keep %, $, +, ., -), collapse spaces. Tokenize words, drop stopwords if enabled, and ignore tokens shorter than Min token length.
Exact‑match dedupe: Identical strings collapse immediately (counter shows how many were merged).
Pairwise scores: Jaccard (tokens), Jaro‑Winkler (typo/prefix), Norm‑Levenshtein (1 − edit distance / max length).
Thresholding: Any metric above its slider links a pair.
Clustering: Links form a graph; connected items become a near‑duplicate cluster.
Inspect: Tap a heatmap cell for scores + diff.

Each line is one promo. Stopwords and Min token length affect tokenization for Jaccard.

Min token length 2

Remove common stopwords

If any metric clears its threshold, we link the pair; connected links become clusters. Tighten to reduce merges; loosen to merge more.

Jaccard ≥ 0.55

Jaro-Winkler ≥ 0.90

Norm. Levenshtein ≥ 0.80

Total promos

Exact duplicates merged

Near-dup clusters

Tap any cell to see scores and open the inspector. Optionally show numbers in the grid for quick scanning.

Show numbers in cells

We build a graph where edges connect pairs passing any threshold. Connected components = near‑duplicate clusters.

Scores for a selected pair. Token Overlap lists shared/unique tokens; Diff is a rough word-level comparison.