Use Cases and Similarity Function Recommendations

This section provides recommendations you could use in different scenarios based on empirical best practices for the selection of similarity functions. The similarity function, and especially, the parameters best suited are highly dependent on the data. Software AG recommends OneData Matching similarity as the default function.

Entity names like full names of persons or companies usually require the most sophisticated similarity algorithms. As discussed in many research papers, hybrid similarity measures offer the best results in matching quality because they combine the strengths of character- and token-based algorithms. For single names, like given names, birth names or surnames of persons, a robust character-based similarity would suffice. Jaro-Winkler offers the most efficient and quick character-based heuristic.

As full person and company names usually contain swapped, added, or missing words, a combination of token-based similarity with a secondary character-based similarity works best. Therefore, a hybrid measure usually deliver high quality matches. You could use Monge-Elkan combined with Jaro-Winkler as the secondary metric.

Identifiers like ZIP codes, phone numbers, email addresses and the like, usually do not show big variations apart from different normalizations and formats. Values are either very similar and contribute important duplicate information, or they are too dissimilar to provide valuable information. Duplicate ZIP codes, phone numbers, and so on, mostly contain one or two data elements or modified characters. Values containing more changes should not be considered duplicates. A character-based similarity like Damerau-Levenshtein that also counts adjacent character swaps as one edit operation is a good fit for identifiers. Additionally, robust normalization is important for such values. For example, you might want to convert phone numbers to a standardized format before performing the matching operation.

Names of cities, streets, and other similar data, tend to be slightly more standardized than full person or company names. However, they have more variations than identifiers like ZIP codes. They require a similarity function which is more flexible than, for example, Levenshtein. For such data, Jaro-Winkler or Sift3 offers quick and reliable results.