How this demo deduplicates promos (step-by-step)
We normalize text → compare items in a few different ways → link near-duplicates → form clusters. Use this as a quick-reference.
Exact match
→
Jaccard ≥ 0.55
→
Jaro‑Winkler ≥ 0.90 or Norm‑Lev ≥ 0.80
→
Cluster (connected items)
- Input: One promo per line.
- Preprocess: Lowercase, strip punctuation (keep %, $, +, ., -), collapse spaces. Tokenize words, drop stopwords if enabled, and ignore tokens shorter than Min token length.
- Exact‑match dedupe: Identical strings collapse immediately (counter shows how many were merged).
- Pairwise scores: Jaccard (tokens), Jaro‑Winkler (typo/prefix), Norm‑Levenshtein (1 − edit distance / max length).
- Thresholding: Any metric above its slider links a pair.
- Clustering: Links form a graph; connected items become a near‑duplicate cluster.
- Inspect: Tap a heatmap cell for scores + diff.
Promotions
Each line is one promo. Stopwords and Min token length affect tokenization for Jaccard.
2
Thresholds & Pipeline
If any metric clears its threshold, we link the pair; connected links become clusters. Tighten to reduce merges; loosen to merge more.
0.55
0.90
0.80
0
Total promos
0
Exact duplicates merged
0
Near-dup clusters
Similarity Matrix
Tap any cell to see scores and open the inspector. Optionally show numbers in the grid for quick scanning.
Clusters
We build a graph where edges connect pairs passing any threshold. Connected components = near‑duplicate clusters.
Pair Inspector
Scores for a selected pair. Token Overlap lists shared/unique tokens; Diff is a rough word-level comparison.