Promo Dedupe Lab

Token overlap • Levenshtein • Jaro-Winkler • Clustering

How this demo deduplicates promos (step-by-step)

We normalize text → compare items in a few different ways → link near-duplicates → form clusters. Use this as a quick-reference.

Exact match Jaccard ≥ 0.55 Jaro‑Winkler ≥ 0.90 or Norm‑Lev ≥ 0.80 Cluster (connected items)
  1. Input: One promo per line.
  2. Preprocess: Lowercase, strip punctuation (keep %, $, +, ., -), collapse spaces. Tokenize words, drop stopwords if enabled, and ignore tokens shorter than Min token length.
  3. Exact‑match dedupe: Identical strings collapse immediately (counter shows how many were merged).
  4. Pairwise scores: Jaccard (tokens), Jaro‑Winkler (typo/prefix), Norm‑Levenshtein (1 − edit distance / max length).
  5. Thresholding: Any metric above its slider links a pair.
  6. Clustering: Links form a graph; connected items become a near‑duplicate cluster.
  7. Inspect: Tap a heatmap cell for scores + diff.

Promotions

Each line is one promo. Stopwords and Min token length affect tokenization for Jaccard.
2

Thresholds & Pipeline

If any metric clears its threshold, we link the pair; connected links become clusters. Tighten to reduce merges; loosen to merge more.
0.55
0.90
0.80
0
Total promos
0
Exact duplicates merged
0
Near-dup clusters

Similarity Matrix

Tap any cell to see scores and open the inspector. Optionally show numbers in the grid for quick scanning.

Clusters

We build a graph where edges connect pairs passing any threshold. Connected components = near‑duplicate clusters.

Pair Inspector

Scores for a selected pair. Token Overlap lists shared/unique tokens; Diff is a rough word-level comparison.

Token Overlap

Diff (rough, word-level)