[collectd] The joy of (different) encodings

Staněk Luboš kolektor<span style="display: none;">.trailing-username</span>(a)<span style="display: none;">leading-domain.</span>atlas.cz
Mon Nov 27 20:19:15 CET 2006


Hi Florian,
I must always step into troubles. I do not consider encodings to be a
joy. :)
But I have had to handle such tasks from my computer beginnings.


Florian Forster napsal(a):
> Hi Lubos,
> 
> I've broken the thread, since ultimately this isn't a ignorelist
> problem..
> 
> On Fri, Nov 24, 2006 at 12:29:36PM +0100, Lubo?? Stan??k wrote:
>> There is another problem. My regex implementation is for 8-bit
>> characters. It will work with the UTF-8 strings till you want to check
>> special character properties.
> 
> As far as I know, the POSIX regexen use the `locale' setting to
> determine the character set. This may or may not match the actual
> encoding of the config file, which might cause problems. I see two
> possible ways around this:
> 1) Detmine the files encoding, temporarily set the locale and parse the
>    file. This, however, may cause problems later on, and I don't think
>    setting the locale and never change it back is a bad thing.
> 2) Convert all external data (config file, data read, network stuff) to
>    our current locale. This may restrict the useable characters, but
>    it's now the user's responsibility to set the correct locale, as with
>    all other programs.
> 

Yes, you are right. The regexp functions use LC_TYPE variable.
I do not have so deep knowledge about using locales to judge the rest.


> Also, since strings are exchanged over the network, either the used
> encoding needs to be transfered, or we should use a unified encoding,
> such as UTF-8. Since UTF-8 is the future and it's hip and it's colorful
> (think: executive talk), I'd prefer it.
> 

I would prefer it too.


>> You can object to this that it will be rare. With a HAL mounted media
>> it is possible, take into account that UDF DVD has Unicode volume
>> label.
> 
> That's right, it _is_ rare. We could substitute eight-byte characters
> with something else, e. g. a questionmark as many other applications do,
> to save users from weird characters being displayed. That should be
> trivial to implement.
> 
> Any thoughts are welcome. Regards,
> -octo

I did a quick analysis.
The different encodings could appear in any plugin's configuration in
the collectd.conf, mainly in ignorelist, strings of collected entities
and in filenames.

Could it be a usable concept?

1) In 'Local', 'Log' and 'Client' we could mode expect collectd.conf in
the daemon's locale, plugins report in the daemon's locale. This is
almost current state, the user is responsible to set all up.
The 'Client' would report the locale of data to a server.

2) In server mode we could work in the daemon's locale, get data from
clients, convert them from reported locale if necessary, use Unicode
filenames. The conversion should handle the character substitution (like
'?' or '_').

I think it is more likely that a server runs in better and more current
system than clients.

Best regards,
Lubos



More information about the collectd mailing list