[pmwiki-users] Search for terms with ss and ß
Petko Yotov
5ko at 5ko.fr
Sat Feb 4 03:25:53 PST 2023
This is by design.
In most languages, if one word has an accented letter, the same word
with a plain letter would either have a different meaning or be a
grammatical error. We want a search to find all variants, accented and
not.
In French, the words cote (noun) cote (verb), côte, côté have different
meanings, in fact several each, and if you install the recipe, searching
for any of these will find all matches for all of these.
Someone searching for "champs elysees" will also find the correct
"Champs Élysées".
In UTF-8, Classical Diaeresis (Tréma) and German Umlaut look exactly the
same and use the same characters and code points.
In German, the letters with Umlauts are usually collated to "plain
letter" + "e", for example Jörg->Joerg or Brückner->Brueckner. This is
not the case for French.
Your folding should probably be adapted to the language you actually
use. I have added 2 lines to the UnaccentUTF8() function in the
cookbook, uncomment them to enable the folding ü->ue that is suitable
for German.
Petko
On 04/02/2023 08:51, Hans Bracker wrote:
> Hello Petko,
>
> Friday, February 3, 2023, 3:22:00 PM, you wrote:
>
>> https://www.pmwiki.org/wiki/Cookbook/UnaccentUTF8
>
>> Not sure if it will be enough for you as it also folds to lowercase.
>> But you can copy this function and adapt it. Maybe simply remove "::
>> Lower();" from the argument, or review the documentation for the
>> Intl/Transliterator class at php.net.
>
> Thanks, I tried it out, as you put it, and as a customisation for
> TextExtract.
> I think one needs to be very careful, if one wants to use it.
> For German language, and used as it is, it will give many false
> positives in search results.
> Word pairs like Bär and Bar, Blüten and bluten, Fähre and fahre,
> möchte and mochte, are treated as the same, but have total different
> meanings. So I would not recommend this recipe for German language
> sites. I can imagine other languages using UTF8 could have similar
> problems.
>
> As to my TextExtract search for terms with ss and ß:
> I think it may be better if I offer a customisation, with a custom
> array of substitutes.
> That could then also offer substitutes for accented characters, like
> used in Roman languages, but not substitutes for ä, ö, ü, and others,
> which would lead to too many false positive results.
>
>
> cheers,
> Hans
More information about the pmwiki-users
mailing list