Originally Posted By: Khaled
Originally Posted By: maroon
I don't see how this can be blamed on fnv1a

Oh the irony :-) It turns out that it is in fact due to fnv1a.



I was trying to say the slowdown can't be blamed on the speed of using $hash() vs the speed of using $fnv1a(), because - as you said - the time spent hashing was a small portion of the task done by /hadd or $hget(table,item). FNV1a should be faster than $hash, because there's not much room for $hash to be quicker than FNV1a's 1 xor + 1 mul per byte, as long as the MUL isn't being calculated using floating point math.

I had seen the lack of avalanching in FNV1a, where the hash of 'a' through 'z' produces hashes where the top 20 bits rarely change. However, in spite of this, the hash MOD 101 looks like it does a good job of assigning them to a variety of buckets.

Quote:

string a fnv1a: E40C292C bucket 11
string b fnv1a: E70C2DE5 bucket 23
string c fnv1a: E60C2C52 bucket 19
string d fnv1a: E10C2473 bucket 100
string e fnv1a: E00C22E0 bucket 96
string f fnv1a: E30C2799 bucket 7
string g fnv1a: E20C2606 bucket 3
string h fnv1a: ED0C3757 bucket 47
string i fnv1a: EC0C35C4 bucket 43
string j fnv1a: EF0C3A7D bucket 55
string k fnv1a: EE0C38EA bucket 51


However FNV1a is supposed to a low collision rate, where different strings have the same hash, which is a property that $hash() doesn't have. It's easy to find many such strings:

Code:
//echo -a $base($hash(abcdef,32),10,16) same as $base($hash(abddee,32),10,16)


Can you confirm how mIRC has been assigning items to buckets prior to the change to FNV1a? By looking at the order in which $hget(table,N).item sorts the items, I would expect $calc(1+($hash(itemname,32) % 101)) to be sequentially changing as you work your way through that list from the first through last N.

I do see the newest beta sorting items in the $hget(table,N).item order based on this MOD calculation if it uses the FNV1a-32 hash. But it doesn't work for me in the older versions when I substitute $hash(item,32) in place of FNV1a.

With items in 1897 assigned to buckets 1-101 using (1+(FNV1a(itemname) % 101)) I wouldn't think the lack of avalanche causes the hash of similar strings to have a bias toward being the same value MOD 101.

However I still see evidence that FNV1a isn't responsible for the delay shown by hadd_test:

1. If the earlier /hadd_test alias is changed so the identical portion of the string is lengthened and moved to the tail end of the strings, any possible FNV1a-vs-$hash() speed is still a small portion of the overhead time, and the string at the end gives plenty of time for the avalanche to happen, making it even more unlikely that similar strings get placed into the same bucket causing the slowdown. I'm still seeing the same 30% slowdown when the alias changes the item names from $(item,%i) to $+(%i,evenlongeritemname), where 18 extra characters surely gives the avalanche a chance to spread the slight differences in input strings into diverse hashes that would be even less likely to be associated with similar remainders MOD 101. In my alias, ALL the item names in this example were short + similar, so there should be minimal time difference in finding the difference between item12345/item12445 vs having to tell the difference between item12345/item12355. Surely any problem with avalanching should have been even worse with $hash:

Code:
//var %bits 32 , %a $hash(mIRC,%bits) , %b $hash(mIRD,%bits) | echo -a $base(%a,10,2,32) bucket $calc(%a % 101) | echo -a $base(%b,10,2,32) bucket $calc(%b % 101)



This example changed only 1 bit of the output, and it's not difficult to find many such examples. But since $hash(anything,32) always returns the lowest 8 bits as zeroes, dividing by 101 will usually shuffle them to buckets far apart. Even if the hashing algorithm were $hash(string,24) where the changed bit is the last bit, the sequential output would have assigned them (MOD 101) to different buckets, and I would expect FNV1a to do even better.

2. As my prior post showed, I was seeing much more of the slowdown between 752.notbeta-vs-1438 than I was seeing between beta 1438-vs-1897. Since 1897 shuffles $hget(table,N).item into a different sequence than not-beta does, but 1897 does the same sort order as not-beta, the slowdown in 1438 can't be blamed on FNV1a.

3. If the slowness were caused by FNV1a shuffling more similar strings into the same bucket, that could explain the increase in time required by $hget(table,item) to locate the item within the bucket, but wouldn't explain the extra 30% time to /hadd them into the bucket - unless there's also some kind of bucket indexing that requires extra time when strings within 1 bucket are too similar.

I've not yet replicated the poor bucket shuffling you're seeing. For example, when I take the FNV1a hash of 'a' through 'z' then MOD by 101, I'm finding these 26 hashes assigned to 26 different buckets.

-

FYI Background

Below is an alias that matches the test vectors at the FNV site. It would be a lot faster using the assembler code at that site which doesn't need to use extra calculations to evade the 2^52 limit of floating point math. The quote near the top of the post is created by:

Code:
//var %i $asc(a) | while (%i isnum 1-122) { var %string $chr(%i) | echo -a string %string fnv1a: $fnv1a-32(%string,h) bucket $calc(1+($fnv1a-32(%string) % 101)) | inc %i }



Code:
alias fnv1a-32 {
  noop $regsubex(temp,$1,,,&temp) | var %len $bvar(&temp,0) , %i 0 , %hash 2166136261
  while (%i < %len) {
    !inc %i
    !var %hash $xor(%hash,$bvar(&temp,%i))
    !var %hash $calc(( (%hash % 256) * 16777216 + %hash * 403) % 4294967296 )
  }
  if (h isin $2) return $base(%hash,10,16,8) | return %hash
} ; by maroon 2018
; If not for the 2^52 limit, the MUL could have been $calc((%hash * 16777619) % 4294967296)
; because the bits of the product above 2^32 aren't needed. $fnv1a-32(string,[h]) h=hash as hex


Below is how I see (FNV1a MOD 101) used in beta 1897 to assign items to buckets, but can't see the same thing in older versions using $hash(item,32). Regardless whether the item-names are random or similar, $hget(table,N).item in 1897 lists them in bucket order regardless of the order of creation. Similar input strings seem no more likely to be assigned to the same bucket than when they're randomly different.

Code:
//hfree -sw test | hmake -s test 101 | var %i 20 , %N 1 | while (%i) { hadd test $base($rand(1,4503599627370495),10,36) | dec %i } | while ($hget(test,%N).item) { echo 4 -a $ord(%N) item is $v1 bucket: $calc(1+($fnv1a-32($v1) % 101)) | inc %N }