[pmwiki-users] Search for terms with ss and ß

Petko Yotov 5ko at 5ko.fr
Mon Feb 6 00:40:08 PST 2023


On 05/02/2023 22:07, Hans Bracker wrote:
>> Someone searching for "champs elysees" will also find the correct 
>> "Champs Élysées".
> 
> okay, I see it works that way with your recipe. But that is because
> "Champs Élysées" is getting saved in pageindex as  words "champs" and
> "elysees", so subsequent searches for "champs" and/or  "elysees" will
> get the pagename with "Champs Élysées" in the text as result.
> 
> But then with TextExtract I am stuck, because it looks then through
> the actual text, row by row, and cannot find the "Élysées", only the
> "Champs". And there is no way TextExtract can construct  "Élysées"
> from "elysees".
> 
> Nore can I assume it is safe a German term with 'ss' may well be a
> term with ß substituted.
> But substitute ss for  ß is fine.

You might fold the line, and match that to the folded search terms.

>> Your folding should probably be adapted to the language you actually 
>> use. I have added 2 lines to the UnaccentUTF8() function in the 
>> cookbook, uncomment them to enable the folding ü->ue that is suitable 
>> for German.
> 
> Yes that works, thank you! What does this do, and what is the
> connection to German Umlauts?
>  $str = preg_replace("/\xcc\x88/", 'e', $str);
> I see that \xcc\x88 stands for the character of the two dots, like the
> dots above o in ö  (Umlaut)

Indeed "\xcc\x88" is the UTF-8 representation of the character U+0308 
COMBINING DIAERESIS.

There are 2 valid ways to have a diacritic - either one special 
character on a separate code point, or the plain letter followed by a 
"combining" diacritics character.

Most often it is the former, but you don't know which one will be in 
your texts, especially if someone copies texts from other sources into 
your pages.

In the UnaccentUTF8() function, I have included both.

Petko



More information about the pmwiki-users mailing list