Search text based on phonetic similarities.
The following construct(s) refer to this construct:
tf:phonetic(string $searchString) => unspecified
The function tf:phonetic
is specific to Tamino. It takes a
search string as argument and returns all strings that are "phonetically
equivalent". It can only be used within the scope of the following
functions:
tf:containsAdjacentText
tf:containsNearText
tf:containsText
tf:createAdjacentTextReference
tf:createNearTextReference
tf:createTextReference
Tamino performs this search according to a set of rules that is modeled after the widely known Soundex algorithm. It is based on the pronunciation of the English language, but includes also checks for character combinations that occur in German. This means that the accuracy of the algorithm is highest for English and German, but it can also be used for other languages. However, it is not exact: Sometimes it will fail to identify words that are homophones, and sometimes the algorithm will incorrectly detect a match when in fact the pronunciation of the word is quite distinct. The algorithm works by reducing letters or combinations of letters to their phonetic equivalents according to the following rules:
Letters | Phonetic Equivalent |
---|---|
A, E, I, O, U, Y (initial position) | A |
P, B | B |
F, V, W, (P + H) | F |
G, K, Q, (C + [A, E, H, I, J, K, L, O, Q, R, U, X, Y]) | G |
L | L |
M | M |
N | N |
R | R |
C, S, Z, (D + [C, S, Z]), (X + [C, K, Q]), (T + [C, S, Z]), (S+C), (Z+C) | S |
D, T | D |
(G + G + S) | GS |
(X - [C, K, Q], *) | GS |
H | (ignored) |
Here, "+" denotes two letters appearing in the order shown. "[…]" denotes alternative letters, "-" denotes exclusion, i.e., two letters appearing together of which the second letter is not one of the letters listed. Finally, "*" denotes any letter.
So "(X + [C, K, Q])" means a sequence of letters consisting of "X" followed by one of the letters "C", "K" or "Q", whereas "(X-[C, K, Q], *)" means a sequence of three letters consisting of "X" followed by any letter other than "C" or "K" or "Q", followed by any letter.
More elaborated rules take precedence over simple rules: For example, if a word contains the adjacent letters "P" and "H", the rule reducing the combination of "(P + H)" to "F" has precedence over the two simple rules that reduce "P" to "B" and ignore "H".
Example: "PHONETIC" is interpreted as "(P + H), (O), (N), (E), (T), (I), (C)" and reduced to "FNDS".
Note:
The value of the server parameter "markup as
delimiter" is respected when determining the word tokens. See the
documentation of the Tamino Manager for details.
$searchString |
string value |
---|
Retrieve the names of all patients whose surname sound like "Meier".
for $a in input()/patient where tf:containsText($a/name/surname, tf:phonetic("Meier")) return $a/name
This query effectively retrieves all patient names that are written as "Meier", "Maier", "Mayer", or "Meyer" as they all sound alike.