[pmwiki-users] Issues with non basic latin in uploads file names

Petko Yotov 5ko at 5ko.fr
Wed Feb 22 22:41:31 PST 2023


On 23/02/2023 01:30, kirpi at kirpi.it wrote:
> I have a twofold problem with uploads file names.
> 
> First issue.
> My install is UTF-8 enabled but, even if allowing the display of all
> languages and all alphabets in pages is fine, UTF-8 file names in
> uploads are a pain as they create several issues once they are
> uploaded to my shared hosting server (example: I cannot delete them
> anymore, not even via FileZilla, nor rename them).
> By following some hints[1] and some search[2] I tried and put together
> an easy way to replace accented characters as well as any problematic
> characters in uploads. Although apparently verbose, this "table" seems
> straightforward to understand, extend and adapt to one's need.
> 
> 
> $UploadNameChars = "-\\w. "; # default: allow dash, letters, digits,
> underscore, and dots (no spaces)

This appears to allow a space, but later $MakeUploadNamePatterns 
replaces the spaces with underscores.

There is one problem with the '\w' ("word") character type, it may 
change depending on the locale. On a server with the English (UK) locale 
\w means [a-zA-Z0-9_] but with English (NZ) locale it may include 5 
accented Māori characters, and with Français locale it may also include 
10 different accented characters (but not all possible accented 
characters). Also even this sometimes behaves inconsistently on my own 
servers.

So I usually replace '\w' with 'a-zA-Z0-9_' to make it clear I only want 
plain Latin characters.

I suspect when $UploadNameChars was implemented, it meant to include 
plain letters rather than locale letters. But it is as it is now, and I 
don't want to change the PmWiki default because it might inconvenience 
administrators with existing wikis.


> $MakeUploadNamePatterns = array(
> 'Š'=>'S', 'š'=>'s', 'Ž'=>'Z', 'ž'=>'z', 'À'=>'A', 'Á'=>'A', 'Â'=>'A',
> 'Ã'=>'A', 'Ä'=>'A', 'Å'=>'A', 'Æ'=>'AE', 'Ç'=>'C', 'È'=>'E', 'É'=>'E',
> 'Ê'=>'E', 'Ë'=>'E', 'Ì'=>'I', 'Í'=>'I', 'Î'=>'I', 'Ï'=>'I', 'Ñ'=>'N',
> 'Ò'=>'O', 'Ó'=>'O', 'Ô'=>'O', 'Õ'=>'O', 'Ö'=>'O', 'Ø'=>'O', 'Ù'=>'U',
>     'Ú'=>'U', 'Û'=>'U', 'Ü'=>'U', 'Ý'=>'Y', 'Þ'=>'B', 'ß'=>'ss', 
> 'à'=>'a',
>     'á'=>'a', 'â'=>'a', 'ã'=>'a', 'ä'=>'a', 'å'=>'a', 'æ'=>'ae', 
> 'ç'=>'c',
>     'è'=>'e', 'é'=>'e', 'ê'=>'e', 'ë'=>'e', 'ì'=>'i', 'í'=>'i', 
> 'î'=>'i',
>     'ï'=>'i', 'ð'=>'o', 'ñ'=>'n', 'ò'=>'o', 'ó'=>'o', 'ô'=>'o', 
> 'õ'=>'o',
>     'ö'=>'o', 'ø'=>'o', 'ù'=>'u', 'ú'=>'u', 'û'=>'u', 'ý'=>'y', 
> 'þ'=>'b',
>     'ÿ'=>'y', 'Ğ'=>'G', 'İ'=>'I', 'Ş'=>'S', 'ğ'=>'g', 'ı'=>'i', 
> 'ş'=>'s',
>     'ü'=>'u', 'ă'=>'a', 'Ă'=>'A', 'ș'=>'s', 'Ș'=>'S', 'ț'=>'t', 
> 'Ț'=>'T',

The array keys above should be regular expression patterns like '/É/'.

>     "/[^$UploadNameChars]/" => '',    # strip all not-allowed 
> characters
>     '/(\\.[^.]*)$/' => 'cb_tolower',  # convert extension to lowercase
>     '/^[^[:alnum:]_]+/' => '',        # strip initial spaces, dashes, 
> dots
>     '/[^[:alnum:]_]+$/' => '',        # strip trailing spaces, dashes, 
> dots
>     '/ +/' => '_');                   # replace space(s) with 
> underscore
> 
> 
> I did not yet try it in config.php because I am afraid to screw
> something, and prefer to ask before doing some harm.
> Would it make sense to use the above code, please?
> 
> Still there might be issues, like with "ё" which could perhaps be
> converted into "e" in some cases and "io" in others; but this is where

The characters ë (Latin e-tréma) and ё (Cyrillic yo) may look similarly 
in most fonts, but are not at the same code points, so there is no need 
to worry about this.

However, in German language, the "vowel with umlaut" like "ü" would be 
folded to "ue". In French language the same letter would be folded to 
"u" (same code point).

It becomes messy as "vowel with umlaut" (one character) can also be 
written as "vowel followed by combining-diaeresis" (two characters). 
Both are valid. In your patterns above this is not a problem as the 
combining diacritics will be removed, but for a German language they 
will be lost. :-)


> the "table" comes into play: it will be easy to spot specific letters
> and adapt them to one's need (be it mainly transliterating or just
> getting a usable file name somehow), while adding more characters if
> required[3].
> ----
> 
> Now the second issue: we cannot map everything. Think Chinese as an 
> example.
> Having file names in "random alphabets" is to me a huge problem
> because both some software and some user of my website will end up
> being stuck somewhere. It happened already too many times. Imagine I
> upload this file 王毅与普京会晤_中俄关系稳如泰山_百度搜索.pdf it will be a pain for most
> of the people in the world to handle it.
> I am perhaps too old, but sticking more or less to a basic a-zA-Z1-0

1-0 is invalid, as 0 is before 1 in the character set. You probably mean 
to use 0-9.

> group of characters is a safer choice, it is more inclusive in some
> way: I am 99% sure that anybody and any system can handle that file if
> renamed in a basic English alphabet plus some numbers.
> 
> I am not sure how to solve this, but I guess I would like to tell the
> system that, if there is no specific map set in the wiki for such
> characters (see issue one), then any random generated name would be
> better than the original Chinese (or whatever). At least I can rename
> the files afterwards.

See below for FileZilla and character sets.

A randomly generated file name can be achieved not with 
$MakeUploadNamePatterns (since it will also affect existing links to 
files, and possibly file listings), but with a custom 
$UploadVerifyFunction which can rename the file while it is uploaded.

This is a little more advanced, and cannot be a generic function, it 
will adapt to your specification, your current wiki usage and workflows 
- let me know if you need assistance with it.


> I often happen to upload images by simply dragging them from web pages
> (or other sources) into my wiki, and I do not even know their file
> names.

Do you drag them from a web page and drop them in a DDMU dropzone? I 
didn't know this was possible - in fact it still isn't in my browser.

Or do you drop them into FileZilla?

> Quite often I end up with exotic names that are stuck in my
> folders; impossible to rename or delete them, as I get "invalid
> attachment name" or "PmWiki can't process your request, no such
> attachment".

'invalid attachment name' appears if you try to upload a file that 
matches the entries in $UploadBlacklist (.php, .pl, .cgi in the middle 
of a file name that may still be executed by the server.)

I cannot find this message: "PmWiki can't process your request, no such 
attachment". The first part comes from the Abort() function, but I don't 
see "no such attachment".

Is it "?requested file not found" ? This may come from HandleDownload() 
when a file name cannot be found.


> Even FileZilla cannot rename or delete them.

FileZilla has an option to change the character set. In the site 
manager, when you select a site on the left, on the right there are 
tabs, the last one is "Charset".

If you cannot rename or delete files on the server because of invalid 
filenames, try changing the character set from Automatic to an 8-bit 
encoding like ISO-8859-1, then reconnect.

You can do this at least temporarily in order to delete or rename the 
lost files.

> I would like to avoid such issues by making sure that files are
> properly renamed to basic latin before being uploaded.
> But in this case I would not know how to.

When you upload such a file, how do you link to it from the wiki?

Do you ever type [[Attach:自由定制的风格.pdf]] in your page?

Or do you simply have (:attachlist:) or (:thumblist:) / Mini:* that will 
list all files?


Modifying $MakeUploadNamePatterns may break some links from your wiki 
pages to existing attachments.


> not understand why $UploadNameChars was not left to default by Petko
> in that case.

Because of the '\w' issue in different locales.

Petko



More information about the pmwiki-users mailing list