Discussion:
near duplicates in short text fields
(too old to reply)
merkury
2008-08-15 18:10:43 UTC
Permalink
Hi,


can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?

Is there any database tool which does just this?

I give you some examples:

not near:
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)


near:
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)

near:
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)

near:
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)



Thanks

merkury
Dejan Sarka
2008-08-16 05:53:05 UTC
Permalink
Hi!

I suggest you take a look at the SQL Server Integration Services Fuzzy
Grouping and Fuzzy Lookup transformations.
--
Dejan Sarka
http://blogs.solidq.com/EN/dsarka/default.aspx
Post by merkury
Hi,
can anybody tell me how to find near duplicates in a large amount (20
million) short text labels?
Is there any database tool which does just this?
Rugby Polo - black/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Rugby Shirt Striped - aqua/white - S; (Angebot von Kabelmeister)
Shirt Striped - aqua/white - S; (Angebot von)
301 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT BLAU in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT SCHWARZ in L (eBay Shop
jeanspoint74)
482 LA RUGBY SEXY DISCO PARTY POLO T-SHIRT WEISS in M (eBay Shop
jeanspoint74)
Thanks
merkury
Loading...