mIRC Home    About    Download    Register    News    Help

Print Thread
#156758 20/08/06 06:35 AM
Joined: Jun 2003
Posts: 12
Y
Yil Offline OP
Pikka bird
OP Offline
Pikka bird
Y
Joined: Jun 2003
Posts: 12
I'm concerned that the $hash function doesn't have enough randomness to it. I first noticed this when benchmarking to see if there was a speed benefit to using local aliases. Basically I created a whole file of empty aliases which were similiarly named and noticed performance calling one of those aliases increased DRAMATICALLY based on the number of aliases in the file. Since I presume the same hash function is used internally for everything from variables to aliases I started testing this hypothesis using the $hash function.

As I understand it the $hash function takes a string and the number of bits to return. However assuming hash tables of roughly 10 bits or 1024 entries I was shocked to discover that every single letter/number/etc hashes to the value 1. In fact it's not until you add 17 bits of precision (128k) that it changes from the value 1. I don't think anybody is using 128k sized hash tables which means for all intents and purposes every single letter variable or alias is a hash collision if the result of the hash function isn't further modified. Given that it takes a bit parameter my guess is it isn't...

For the record every 2 letter/number/etc combination appears to collide at 2 for 10 bits, but starts to diverge very slowly starting at 11bits. Still very bad!

At 3 or more letters you start to see some randomness, but it's still somewhat limited because the last characters of each string contribute very little to the hash! For example: test-1 through test-9 are the same hash at 10bits, test-10 through test-99 are the same, test-100 through test-199 are the same. Test-1aa through test-1ZZ are all the same and only 1 off from the 100! In fact it's nearly impossible to get the last character in the string to modify the value of the hash for any reasonable bit size argument to the $hash function. I'm pretty sure that's the DEFINITION of a bad hash algorithm.

For the record 1000 aliases named test-#### suffered a bad performance hit, 1000 totally randomly generated alias names had little effect on performance.

I'm implementing stacks/temp storage in a hash using numbers so I'm guessing that I have insane hash collisions since the single digit numbers are all colliding and the 2 digits are all colliding. Given the poor parsing performance of mIRC it's probably hiding this, but why not choose a better hash algorithm...

Check out: General Purpose Hash Functions

Max bucket length would be a good property to use with $hget()...

Do scripted hash tables actually use the size specified in hmake or do they scale up to the next bit size? Since the $hash function isn't introducing enough randomness into the high bits it won't really help for smallish strings...

#156759 21/08/06 05:21 AM
Joined: Jan 2004
Posts: 2,127
Hoopy frood
Offline
Hoopy frood
Joined: Jan 2004
Posts: 2,127
the $hash function is definitely not 1-way, and is easy to create strings to match a specific hash.

//echo -a $base($hash(ABC,24),10,16)

gives: 414243
... which are the hex representation of the 3 bytes being hashed.

//echo -a $base($hash(ABCFED,24),10,16)
gives: 878787

Increase the 1st character by 1 and decrease the 4th by 1, and //echo -a $base($hash(BBCEED,24),10,16) gives the same output.
Don't bother using $hash(string,N) with N larger than 24, since it simply shifts the bits to the left. Instead, you could try using the $md5 hash that's been available since 6.03 - If the hash is too long, then grab an arbitrary length from the front or end.

#156760 23/08/06 10:25 PM
Joined: Jun 2003
Posts: 12
Y
Yil Offline OP
Pikka bird
OP Offline
Pikka bird
Y
Joined: Jun 2003
Posts: 12
I think you completely missed the point. I'm not trying to use the $hash function itself, I'm showing how I essentially proved that every single letter alias/hashtable entry collides because of the crappy hash function. Every two letter alias/hashtable entry probably collides for reasonable sized hash tables. I also showed that generic prefix-suffix names have serious collision issues unless suffix is greater than 2 characters.

Imagine the performance issues this is causing! A simple change to the hash function would fix this.

#156761 24/08/06 03:28 AM
Joined: Jan 2004
Posts: 2,127
Hoopy frood
Offline
Hoopy frood
Joined: Jan 2004
Posts: 2,127
They can't change the way $hash works, to maintain compatibility with older versions. If any changes, they would need to add an additional hash function. EIther that, or you can use $md5 or $crc. $md5 is more secure, being a one-way hash, but $crc is good for catching bit errors, since a 1-bit change in a file is guaranteed to change the hash, while $md5 and other crypto hashes don't guarantee this.

//echo -a $left($md5(string),digits)

//echo -a $crc(string,0)

#156762 24/08/06 06:37 AM
Joined: Sep 2003
Posts: 4,230
D
Hoopy frood
Offline
Hoopy frood
D
Joined: Sep 2003
Posts: 4,230
He isnt talking about the OUTPUT of the hash function, rather the internal workings of the hashing algorithim. Hes saying (i dont agrree or disagree as i havent tested and dont know) that with similar named ITEMS in a hash table alot of them well have identicle lookup values (collides).
Amagine a 1024 entry entry hash table /HMAKE TABLE 1024, (0-1023). This well hash down any ITENNAME to a 10bit value (0 to 1023), that slot in the table is then looked at to see if its already in use, and if it is a pointer to the next free slot after 1023 is inserted into it, so that now 2 ITENAMES with the same hashvalue exist in the table, if a 3rd one is created it well be allocated a new free space and its linked pointer added to the previous last one in the chain.
Now what hes saying I gather is there are a huge number of collisions of these slot numbers if the ITEMNAMES are similar so using ITEMNAMES like BLAHWHO.NNN may not be the bext soultion. (personally im not sure exactly what item names hes saying are bad and which ones arent, or if its just if u use a two smaller hash table size or what)

#156763 24/08/06 12:35 PM
Joined: Dec 2002
Posts: 2,962
S
Hoopy frood
Offline
Hoopy frood
S
Joined: Dec 2002
Posts: 2,962
Yil isn't just talking about the hashing algorithm used internally for hash tables, he's talking about the $hash() identifier and assuming it's the same as the internal hash table algorithm. maroon is pointing out that even if Yil's assumptions are correct and the hash table algorithm should be changed, Khaled can't simply update the $hash() identifier too as it would break compatability with existing scripts.


Spelling mistakes, grammatical errors, and stupid comments are intentional.
#156764 24/08/06 02:38 PM
Joined: Sep 2003
Posts: 4,230
D
Hoopy frood
Offline
Hoopy frood
D
Joined: Sep 2003
Posts: 4,230
yes, i read that, my percerption of what he was saying, is that he begain with that, and then progressed onto the believe (im not saying i agree here) that hash tables hashing isnt overly good, and should be corrected. The later being his main thrust of his concern, i didnt expect he was of the opion the hash output could just be altered, or if he was he was pretty niave to think that.

Before I accept that what he thinks about the hashing is correct, I would like to see some of the tests he performed, to come to that solution. I personally find that considering the speed the rest of a mirc script runs at its hardly an overriding problem. smile


Link Copied to Clipboard