[pmwiki-users] Search for terms with ss and ß

Sat Feb 4 03:25:53 PST 2023

This is by design.

In most languages, if one word has an accented letter, the same word 
with a plain letter would either have a different meaning or be a 
grammatical error. We want a search to find all variants, accented and 
not.

In French, the words cote (noun) cote (verb), côte, côté have different 
meanings, in fact several each, and if you install the recipe, searching 
for any of these will find all matches for all of these.

Someone searching for "champs elysees" will also find the correct 
"Champs Élysées".

In UTF-8, Classical Diaeresis (Tréma) and German Umlaut look exactly the 
same and use the same characters and code points.

In German, the letters with Umlauts are usually collated to "plain 
letter" + "e", for example Jörg->Joerg or Brückner->Brueckner. This is 
not the case for French.

Your folding should probably be adapted to the language you actually 
use. I have added 2 lines to the UnaccentUTF8() function in the 
cookbook, uncomment them to enable the folding ü->ue that is suitable 
for German.

Petko

On 04/02/2023 08:51, Hans Bracker wrote:
> Hello Petko,
> 
> Friday, February 3, 2023, 3:22:00 PM, you wrote:
> 
>>    https://www.pmwiki.org/wiki/Cookbook/UnaccentUTF8
> 
>> Not sure if it will be enough for you as it also folds to lowercase. 
>> But you can copy this  function and adapt it. Maybe simply remove ":: 
>> Lower();" from the argument, or review the documentation for the 
>> Intl/Transliterator class at php.net.
> 
> Thanks, I tried it out, as you put it, and as a customisation for 
> TextExtract.
> I think one needs to be very careful, if one wants to use it.
> For German language, and used as it is, it will give many false
> positives in search results.
> Word pairs like Bär and Bar, Blüten and bluten, Fähre and fahre,
> möchte and mochte, are treated as the same, but have total different
> meanings. So I would not recommend this recipe for German language
> sites. I can imagine other languages using UTF8 could have similar
> problems.
> 
> As to my TextExtract search for terms with ss and ß:
> I think it may be better if I offer a customisation, with a custom
> array of substitutes.
> That could then also offer substitutes for accented characters, like
> used in Roman languages, but not substitutes for ä, ö, ü, and others,
> which would lead to too many false positive results.
> 
> 
> cheers,
> Hans