[collectd] patch for email stats from postfix and amavisd-new, dns stats from powerdns

Luke Heberling collectd at c-ware.com
Thu Nov 29 20:28:33 CET 2007


Florian Forster wrote:

> Why primes? 2^k-1 I would have understood, basically anything that
> has to with padding and avoiding extra bytes, but primes? What do
> they have to do with it?

It's for decent table distribution in the face of lesser quality hash
routines which often have a bias. Even good quality hash routines have
been shown to have bias with particular data sets. There was a fairly
significant amount of attention paid to this in the java community a
while back with the release of the 1.3 ? jdk. The Hashtable class had
always used prime numbers for its hash table length, but it was
superseded by the HashMap class which always uses powers of two.
Apparently this was because of regressions in the efficiency of modulo
operations associated with the hotspot compiler. Long story short they
found that the % operator was much more efficient on power of two table
lengths, and that biased hash functions could be effectively neutralized
by rehashing; hashing the hash, as it were. (1)

In related news, I found a comparison of hash functions that discusses
this issue. (2) This guy seems to have certainly done his homework. His
conclusion is that "Table lengths should always be a power of 2 because
that's faster than prime lengths and all acceptable hashes allow it". In
light of this, I suggest we use power of two table lengths and either
rehash the hash or just rely on the programmer to use a high quality
hash function. The rehash function used by the java HashMap may be
accessible to us under the gpl license. There's actually two of them, an
'new' and an 'old' function. The new one may have even better
distribution and both are very fast.

(1) http://www.javaspecialists.co.za/archive/Issue054.html
(2) http://www.burtleburtle.net/bob/hash/doobs.html

> I doubt that anyone needs more than 419e6 entries in some table, but
> why not make this dynamic? History has shown that limiting a list or
> counter is problematic. Think Y2K-bug ;)

Yes, artificial limits are annoying. Larry Wall would be proud. We can
maybe get a twofer by using power of two table lengths.
Off-topic, but I'd make the silly distinction that I the 491e6 limit is
arrived at by determining that the library is well nigh useless long
before it is reached (3) and the y2k limit was arrived at by people who
determined that when it was reached it would be somebody else's problem. :-)

(3) By george, who has half a billion "most recently used" objects? even
half a million is pretty ridiculous. If you need that much storage use a
regular hashtable and forego the linked list overhead.

> Maybe the right way to do this is to implement a method that returns
> the data in lines, similar to fgets. This function could check if the
> file was moved when it reads an end-of-file, open the new file and
> return the first line of that file.

I like this approach because it could probably hide the FILE* pointer
entirely.  Which brings to mind another method which goes a little
further. We could abstract even the read loop using a function pointer.
By way of example:

from header:
    cu_tail *create( char *name, int flags, int buffer_len );
    void destroy( cu_tail *tail );
    void tail ( cu_tail *tail, process_line_func func );
    char *buffer( cu_tail *tail );
    int buffer_len( cu_tail *tail );

    /* return non-zero to break processing */
    typedef int (*process_line_func)(cu_tail *tail);

usage:
    int main() {
       cu_tail *ptail = create( "/var/log/mail.log", 0, 256 );
       while( 1 ) {
          tail( ptail, process_line );
          sleep( 5 );
       }
    }

    int process_line( cu_tail *tail ) {
       char *buffer = buffer( tail );
       int buffer_len = buffer_len( tail );
       /* do something */
       return is_failure;
    }

I'd be happy to present an api and implementation if you find this to be
a good method. By the way, what is the cu_ prefix to the cu_tail_*
functions in your example? I'm presuming "common utility" or similar as
a modifier to create a unique name in the case it is used in projects
other than collectd. Is this a prefix which would be well served on the
mru_* functions?

Luke Heberling




More information about the collectd mailing list