[pmwiki-users] Search for terms with ss and ß
Petko Yotov
5ko at 5ko.fr
Mon Feb 6 00:40:08 PST 2023
On 05/02/2023 22:07, Hans Bracker wrote:
>> Someone searching for "champs elysees" will also find the correct
>> "Champs Élysées".
>
> okay, I see it works that way with your recipe. But that is because
> "Champs Élysées" is getting saved in pageindex as words "champs" and
> "elysees", so subsequent searches for "champs" and/or "elysees" will
> get the pagename with "Champs Élysées" in the text as result.
>
> But then with TextExtract I am stuck, because it looks then through
> the actual text, row by row, and cannot find the "Élysées", only the
> "Champs". And there is no way TextExtract can construct "Élysées"
> from "elysees".
>
> Nore can I assume it is safe a German term with 'ss' may well be a
> term with ß substituted.
> But substitute ss for ß is fine.
You might fold the line, and match that to the folded search terms.
>> Your folding should probably be adapted to the language you actually
>> use. I have added 2 lines to the UnaccentUTF8() function in the
>> cookbook, uncomment them to enable the folding ü->ue that is suitable
>> for German.
>
> Yes that works, thank you! What does this do, and what is the
> connection to German Umlauts?
> $str = preg_replace("/\xcc\x88/", 'e', $str);
> I see that \xcc\x88 stands for the character of the two dots, like the
> dots above o in ö (Umlaut)
Indeed "\xcc\x88" is the UTF-8 representation of the character U+0308
COMBINING DIAERESIS.
There are 2 valid ways to have a diacritic - either one special
character on a separate code point, or the plain letter followed by a
"combining" diacritics character.
Most often it is the former, but you don't know which one will be in
your texts, especially if someone copies texts from other sources into
your pages.
In the UnaccentUTF8() function, I have included both.
Petko
More information about the pmwiki-users
mailing list