|
Joined: Dec 2008
Posts: 1,515
Hoopy frood
|
OP
Hoopy frood
Joined: Dec 2008
Posts: 1,515 |
I am not sure if this is really a bug or feature or something else but it seems i cannot get the correct order when creating items in a hash, it seems to return what ever number wants instead the first added, i will provide a code example to reproduce. Example: //var %i = 1 | while (%i <= 1000) { hadd -m TEST test_ $+ %i 1 | inc %i } | echo -a IS: $hget(TEST,1).item It suposed to return: test_1 instead of test_789(tested also and with 1438 beta, same results) - Thanks!
|
|
|
|
Joined: Jan 2004
Posts: 2,127
Hoopy frood
|
Hoopy frood
Joined: Jan 2004
Posts: 2,127 |
That's because your use of -m instead of -m1 is using the default 100 buckets. When N buckets is greater than 1, it's going to assign them a sequence number that appears to be random, but that's caused by the items being shuffled out to the different buckets using a pseudo-random algorithm based on the itemname. It appears that the actual number of buckets is always an odd number, in spite of the value returned by $hget(table).size, because you always get the same shuffling of items for N as you get from $or(N,1). (It's possible it does $and(N,-2) but I doubt it, and don't know how to check which it is.) If your example had used buckets=1, you'd find that the 1st item is always the latest one created:
//hfree -w TEST | var %i = 1 | while (%i <= 1000) { hadd -m1 TEST test_ $+ %i 1 | inc %i } | echo -a IS: $hget(TEST,1).item
It appears /hadd creates new items by inserting them to the front of the sequential list. But if the item already exists, it instead gets updated in its current bucket, but I'm not sure whether it's deleted then added to the front of the bucket or continues to use its existing spot in the bucket. If you want $hget(table,X).item to be item name test_X, you need to create them in reverse order:
//hfree -sw test | hmake -s test 1 | var %i 20 | while (%i) { hadd test item $+ %i data | dec %i } | var %N 1 | while ($hget(test,%N).item) { echo 4 -a $ord(%N) item is $hget(test,%N).item | inc %N }
If you change the buckets 1 to be 100 you'll see that the items are created out of sequence, which is how you ended up with the 1st item being 783. If you increase %i from 20 to also be 100, you'll see it appear more shuffled. The fact that N=100 and i=20 doesn't appear randomly shuffled tells me that the items are put in buckets using a simple method, and is not using a 'real' hash function like $sha1, nor even $crc. It appears to use a case-insensitive version of the itemname, because changing your example from using itemnames starting with "test_" and "TEst_" give the same sequence, but other strings have a huge chance of having something else be your 1st item. If you need to hsave then hload your buckets=1 table, this causes your list to be in reverse order than created the next time you /hload, because it looks like /hload begins at the front of the file and /hadd's the info, then continues to the end of the file. To get your /hload'ed data to be in the original 'correct' order, you'd need to either /hload + /hsave + /hfree + /hload, or /hload it into a dummy table and clone it from there. To fix your issue would probably require a new switch for /hadd and /hload to append new items instead of inserting them. I see the -a switch is available.
Last edited by maroon; 13/10/18 10:24 PM.
|
|
|
|
Joined: Dec 2002
Posts: 5,493
Hoopy frood
|
Hoopy frood
Joined: Dec 2002
Posts: 5,493 |
Thanks for your bug report. This is by design. Items in hash tables are stored in a completely random order. That is how hash tables work. No switch can be added to make them save in any particular order.
|
|
|
|
Joined: Aug 2003
Posts: 320
Pan-dimensional mouse
|
Pan-dimensional mouse
Joined: Aug 2003
Posts: 320 |
Yes - this is how hash tables work (hence the term "hash"). An explanation of how hash tables works follows - but see the solution at the end. A hash table works by creating a number of slots (which you can specify). The item name is fed into a hashing algorithm to select which slot to put an item on. Each slot is then a linked list of items and your new item is inserted at the very beginning of the linked list of the slot that the hash algorithm selects. When you want to search for the data by item name, mIRC uses the same algorithm to choose the slot and then iterates down the linked list to see if it can find the item you want. When you do a positional $hget for the first item, mIRC gives you the first item in the linked list on the first slot. But since it is unlikely that the first item you added hashed to the first slot, you won't get that - instead you get the first item whose hash points to the first slot. The default number of slots if you do not specify it explicitly is 100, which actually results in 101 slots because mIRC always makes the number of slots an odd number because of the way that the hashing algorithm works. The surprise is that (for 101 slots) that none of the first 788 items hashed to the first slot when you might expect that to have happened a lot sooner. Now, if you have followed my explanation you might have spotted two things: 1. Performance If you have too few slots for the number of items, then a look up by item name will pick the slot quite fast, but will then have to iterate down the linked list to find the item wanted which could be a lengthy CPU operation if the linked list is a long one. In general terms, the more slots you have, the shorter the average length of the linked list. (However you cannot guarantee that the items will be distributed evenly over the available slots. The worst case scenario is that by chance all the item names hash down to the same slot and you get a single very long linked list on that slot and empty linked lists on the other slots. But given a reasonably random set of item names, usually they will be reasonably evenly distributed across slots.) I have not benchmarked the performance for a random lookup in a hash table with random item names depending on the slot size - but my guess is that the best performance will be when the number of slots is close to the number of items in the table. Khaled may have done some benchmarking when he suggests that you use a number of slots about 1/10 the number of items, but it seems to me that might mean having linked lists between (say) 5 and 20 long - whereas with the number of slots close to the number of items the linked lists should only be (say) 1 to 5 items long. Wikichip suggests that the optimum number of slots is 1.282x the number of entries. Each slot only takes a very small number of bytes - far smaller than the bytes required to store the items themselves - so in most cases it seems to me that there is little downside to using a relatively large number of slots. (The maximum number of slots is 10,000 (i.e. actually 10,001) which uses 40K of memory - and that is not a huge amount in today's PCs, so I tend not to worry too much about excess slots.) HOWEVER that applies only to lookup by item name. $hfind also allows you to search by item name using the item name as a wildcard match to text, or as wildcard text matching the item name, or using the item name as a regex to match to text or as regex text to match the item name. Since there seems to be no way to use a hash determine which slot the item might be on, it seems to me that it is likely that mIRC has to iterate across all linked lists on all slots to complete the find, in which case the performance is unlikely to be effected by the number of slots. Ditto when you do any of the above searches on the data rather than the item. So for $hfind which iterates across all entries regardless of number of slots, you might as well use 1 slot. But if you are using $hget to do a hash lookup based on a specific item, then you should use as large a number of slots as you think you need. 2. If you have digested all of the above then you may have realised that you can make a hash table store items sequentially so that you can iterate over them as the original poster would like to by using -m1 to create a single slot hash table and so a single linked list. Since each item is added to the beginning of the list, the sequence they are added is preserved (but reversed) when you iterate using $hget(table,n) in other words you can either store them in reverse order or access them in reverse order: //hfree TEST | //var %i = 1 | while (%i <= 1000) { hadd -m1 TEST test_ $+ %i 1 | inc %i } | echo -a Item 1: $hget(TEST,1).item Item 1000: $hget(TEST,1000).item
//hfree TEST | //var %i = 1000 | while (%i) { hadd -m1 TEST test_ $+ %i 1 | dec %i } | echo -a Item 1: $hget(TEST,1).item Item 1000: $hget(TEST,1000).item
Last edited by Protopia; 13/10/18 08:18 PM.
|
|
|
|
Joined: Aug 2003
Posts: 320
Pan-dimensional mouse
|
Pan-dimensional mouse
Joined: Aug 2003
Posts: 320 |
To fix your issue would probably require a new switch for /hadd and /hload to append new items instead of inserting them. I see the -a switch is available. The reason that mIRC inserts at the beginning of the linked-list is because it is an easy, low cost thing to do. To insert entries at the end of the list you would either need to iterate all the way to the end of the list which would kill performance, or double the number of bytes for each slot in order to hold the pointer to the end of the list (which on today's modern PCs may not be that much of an issue).
|
|
|
|
Joined: Aug 2003
Posts: 320
Pan-dimensional mouse
|
Pan-dimensional mouse
Joined: Aug 2003
Posts: 320 |
P.S. I have no idea how to calculate the number of slots to get the best performance from a hash table with N entries. But the following analogy suggests that a significantly higher number of slots should be used than anyone has previously suggested.
Q: How many slots are needed for a hash table with 23 entries in order to have a probability > 50% that every entry will be in a separate slot? A: 366
I know this sounds improbable, but the reverse of this is a well known mathematical puzzle:
Q: How many students do you need in a class to have a probability that two or more share a birthday > 50%? A: 23
Obviously, the overhead to iterate over two or five or even ten entries on a slot is not going to be that significant, and the mathematics is much more complicated, but the moral seems reasonably clear: if you are doing hash table look-ups by item name, then using a large number of slots might help performance a lot without using a huge amount of memory.
OTOH if you are iterating using $hget(table,n) or using $hfind - both of which need to iterate over the entire table - you might as well use -m1 for a single slot and save the memory overhead of a hash table which will never be used.
|
|
|
|
Joined: Jan 2004
Posts: 2,127
Hoopy frood
|
Hoopy frood
Joined: Jan 2004
Posts: 2,127 |
Thanks for the reply, Protopia. A few points about yours and Khaled's posts. 1. Some readers may be confused because you use "slot" to have the same meaning as /help uses "buckets". Just like other references may use the 'key' term in the way /help uses 'item'. 2. Yes you can place items in forward order by /hadd'ing them in reverse order, which was what my example showed. But that's only for the initial creation, and as my examples above showed, as you /hsave and /hload the table, the item order gets flipped. So you either need to do the /hsave + /hload twice, or have a temp table you use to un-flip the order. 3. Khaled posted: Items in hash tables are stored in a completely random order.
They are shuffled into buckets in a way that appears random, especially if the item names are not similar to each other. However they're not truly random because the same combination of item names and buckets always results in them being in the same sequence.
//hfree -sw test | hmake -s test 100 | var %i 1 , %a | while (%i isnum 1-999) { hadd test item $+ $base(%i,10,10,3) data | inc %i } | var %n 1 | while ($hget(test,%n).item) { var %a $sha1(%a $v1) | inc %n } | echo -a hash of item sequence %a
This creates a unique sha1 hash based on the sequence the 999 items are distributed into buckets/linked-chains. Repeating the same command always results in the identical sort-order. I'm sure it's because it's best to have the number of buckets and the number of items be relatively prime to each other, and being an odd number is deemed 'close-enough'. The default 100 behaving like N=101 buckets, which is a prime number. The max 10000 behaves like 10001 which is not prime, as the largest prime <= 10001 is 9973. This alias also shows that buckets=even_number and buckets=even_number+1 always produce the same order. There's precedence for the order to change depending on the version, as the order in 6.35 is different than it currently is. The allocation of items to buckets doesn't appear to be a sophisticated algorithm, or even something simple like $crc(). It's possible to have combinations of item-count and buckets which results in items not appearing to be very well shuffled at all, which you wouldn't expect from something that's trying to pseudo-randomly assign items to buckets. For example, for most of this list it simply swaps an even item with even+1: //clear | hfree -sw test | hmake -s test 101 | var %i 20 | while (%i) { hadd test item $+ %i data | dec %i } | var %N 1 | while ($hget(test,%N).item) { echo 4 -a $ord(%N) item is $hget(test,%N).item | inc %N }
Based on the name of the identifier, I'm wondering if $hash() is used when assigning buckets to each item, since it's very prone to having identical 24-bit hashes to similar strings. No switch can be added to make them save in any particular order.
Having a switch for /hload which begins with the last pair of lines in the file then works its way forward would simulate such a switch, though I suppose there might be a performance hit for doing so. Either that, or people would be forced to script their own fake /hload alias which would use $read() and /hadd to load the disk file into the same order in which the items were /hsave'ed.
|
|
|
|
Joined: Dec 2002
Posts: 5,493
Hoopy frood
|
Hoopy frood
Joined: Dec 2002
Posts: 5,493 |
They are shuffled into buckets in a way that appears random, especially if the item names are not similar to each other. However they're not truly random because the same combination of item names and buckets always results in them being in the same sequence. Well, from the perspective of the scripter, they are random. You should not depend on items in hash tables being stored in any particular order. As this same topic has been discussed so many times, I will be adding a note to the help file. Thanks for your comments everyone.
|
|
|
|
Joined: Feb 2003
Posts: 2,812
Hoopy frood
|
Hoopy frood
Joined: Feb 2003
Posts: 2,812 |
Based on the name of the identifier, I'm wondering if $hash() is used when assigning buckets to each item, since it's very prone to having identical 24-bit hashes to similar strings. maroon: Yes, $hash(,32) is used for hash tables. Also, I wasn't able to reproduce successive /hload and /hsave operations to reverse the ordering of the items as they appear in the file. I might not be understanding it right. //var %id = testing.hlist | hmake %id | hadd %id a a | hadd %id b b | hadd %id c c | hsave %id %id | loadbuf -api %id | hfree %id | hmake %id | hload %id %id | remove %id | hsave %id %id | loadbuf -api %id | remove %id | hfree %id
Well. At least I won lunch. Good philosophy, see good in bad, I like!
|
|
|
|
Joined: Dec 2002
Posts: 5,493
Hoopy frood
|
Hoopy frood
Joined: Dec 2002
Posts: 5,493 |
maroon: Yes, $hash(,32) is used for hash tables. Hash tables use a different hashing algorithm. They may have used the same method long ago though. I haven't tested either in a long time. Neither are likely to achieve good results with SMhasher. The algorithm for $hash() cannot be changed since it has been in use for so long and its results may be relied on by scripts. The algorithm for hash tables could be changed. The current algorithm was chosen at the time because it was simple, fast, and "good enough".
|
|
|
|
Joined: Jan 2004
Posts: 2,127
Hoopy frood
|
Hoopy frood
Joined: Jan 2004
Posts: 2,127 |
Raccoon, your example didn't use buckets=1, which is the only case where items remain in a sequential pattern in the order they were created.
//var %id = testing.hlist | hmake %id 1 | hadd %id a a | hadd %id b b | hadd %id c c | hsave %id %id | loadbuf -api %id | hfree %id | hmake %id 1 | hload %id %id | remove %id | hsave %id %id | loadbuf -api %id | remove %id | hfree %id
It appears the physical order on disk depends on the number of buckets. Buckets=100 leaves the order unchanged. Buckets=1 flip the order. But in this next example, it looks like most of the other buckets keep the same order generally, but most items are flipped with nearby items. //var %buckets 5 , %id test.dat | hfree -w %id | hmake -s %id %buckets | var %i 10 | while (%i) { hadd %id item $+ %i 1 | dec %i } | hsave %id %id | filter -fs %id *item* | hfree %id | hmake -s %id %buckets | hload %id %id | hsave %id %id | filter -fs %id *item*
Khaled wrote
from the perspective of the scripter, they are random. You should not depend on items in hash tables being stored in any particular order.
I wasn't trying to depending on items being in a specific shuffled order when buckets is greater than 1, I was saying that repeating the same combo of buckets/names+number-of-items causes the identical pattern each time it's tried, and the example of a small quantity of similarly named items showed the order doesn't appear very random. But yes, different mIRC versions have different shuffling patterns than others. For the code in my 2nd post, or changing the last code in my 1st post to use buckets=101, the current version and 6.35 have different shuffling order from each other, but each version consistently shuffles the same things the same way each time. I had incorrectly left a "not" in my previous post, it should be "especially if the item names are not similar". When changing the buckets from 1 to 100 in the 2nd example of my 1st post, the items look like they're not shuffled very well at all. 8,9 6,7 4,5 2,3 1... with the 5-length and 6-length item names being shuffled as separate groups. The similar item names appear to be causing the lack of shuffling, as $hash() does not have very "random-looking" output. The article on $hash at wikichip shows some of the problems with $hash output; including poor distribution of output values, output values heavily dependant on input length, and changing 1 bit of the input changes very few bits of the output. https://en.wikichip.org/wiki/mirc/identifiers/$hash Bob Jenkins has a few varieties of hashes which are pretty fast, and might be a good replacement for $hash(). I've created aliases for his "Lookup3" and "One at a time" hashes at https://mircscripts.net/WHkiR and there is C source code around for both functions. The 1-at-a-time is probably the better choice for a replacement of $hash in bucketing hash items, as Lookup3 handles bytes in groups of 12, and might be more suited for longer strings - though hard to know without benchmarking them. $hget won't tell if these items are getting placed into the same bucket, or just happen to be in nearby buckets. If it's the latter, this is probably fine, since the goal of the hash assignment to buckets is to spread them into different buckets in a smooth frequency distribution and be able to quickly find which bucket something was already placed. But if the current hashing is not doing a good job at evenly distributing 10k items into 1k buckets, especially with short input strings similar to each other, then perhaps a different hash algorithm is needed.
|
|
|
|
Joined: Dec 2002
Posts: 5,493
Hoopy frood
|
Hoopy frood
Joined: Dec 2002
Posts: 5,493 |
Bob Jenkins has a few varieties of hashes which are pretty fast, and might be a good replacement for $hash(). Thanks for the suggestion. I performed some tests to check hashing algorithm speed/distribution, with different lengths/types of data, eg. short/long strings, numbers, different table sizes, etc. The current hashing algorithm distributes okay in some situations but is awful in others. I tested different algorithms, eg. lookup3, DJB2, FarmHash, xxHash, MurmurHash, FNV-1a, etc. and have settled on FNV-1a for simplicity. In all of the above tests, it worked just as well as the more complex algorithms and had good distribution. It is difficult to say how this change will affect hash table performance in general - most overhead is due to the scripting language and other features. But in specific contexts and with certain types of data, it could make a big difference. This will be in the next beta.
|
|
|
|
Joined: Aug 2003
Posts: 320
Pan-dimensional mouse
|
Pan-dimensional mouse
Joined: Aug 2003
Posts: 320 |
Leaving aside the issues of interpreted scripting having inherent overhead, I suspect that the biggest overhead with hash tables is iterating over the items in the slot (or if you are indexing by position or using $hfind iterating over the whole table).
Clearly the fastest lookup is a hash lookup where every occupied slot has only a single entry, avoiding iteration. In mathematics there is a well know "birthday problem" which says that it only takes 23 people for the probability that 2 or more of them share a birthday to be > 50%. This is equivalent to saying that for a hash table of 23 items, you would need 366 slots for the probability of every slot having at most one item to be > 50%. On the other hand, since the maximum 10,000 slots only takes 40KB of memory, on modern computers it is not that big a problem to use that value. The only downside is that in a non-hashed lookup you are going to be iterating over a lot of empty slots which has its own overhead.
Alternatively, it is easy to calculate the probability that, for a table of M items and N slots, every slot is used. ((N-1)/N)^M It turns out that you need 1.444 slots per item for there to be a 50% probability of all slots being used.
So my advice would be:
a. If you are only doing non-hashed lookups, use a slots value of 1.
b. If you are only hashed lookups, then use a slots value of 10,000.
c. If you are doing both hashed lookups and non-hashed lookups use slots = estimated-items * 1.444.
BUT...
If we are looking for optimising non-hashed table lookups, then my suggestion would be to allow the scripter to choose an alternative table implementation using flags to create a hash table stored internally:
a. An array of pointers to the item/data sequenced on item-name using binary search to locate the item. b. An array of pointers to the item/data sequenced on data using binary search. c. Use a. as an alternative to the current implementation for the list for each slot. d. As previously suggested, enhance the existing implementation by storing the first and last items of the linked list in each slot (using 8 bytes instead of 4) and add new items to the end of the list instead of the beginning - so that slots=1 stores the table exactly sequentially.
And aside from the overheads of adding or deleting an item from three lists (and the memory of course), you could even allow hashing, and the above two additional methods to be used in combination.
This would allow mIRC to have an arsenal of optimisation techniques to improve the speed of non-hashed lookups. For example a wild-match on item which doesn't have */? as the first character could be optimised to use 2 binary searches to find the beginning and end of the section of the sequenced array that holds the relevant items, and then search only within that section. Or a search with items as the wild-match could do something similar to look for table entries that start with */?/first character and search only those parts of the table.
As far as I can see, this would be fully functionally backwards and forwards compatible, and scripts written for the new functionality would only have a performance issue on older mIRC versions.
What do people think?
|
|
|
|
Joined: Dec 2002
Posts: 5,493
Hoopy frood
|
Hoopy frood
Joined: Dec 2002
Posts: 5,493 |
There are no plans to make any other changes to hash tables at this time. The only change I will be making is to the hashing algorithm. This change will be in the next beta.
|
|
|
|
Joined: Aug 2003
Posts: 320
Pan-dimensional mouse
|
Pan-dimensional mouse
Joined: Aug 2003
Posts: 320 |
Khaled - That's fine - I know there is a big difference between a small performance tweak to a hashing algorithm and a significant functional enhancement. And I don't have any specific need for this either - it was simply an observation.
But it might be a nice if you could bear it in mind for a future enhancement.
Last edited by Protopia; 15/10/18 04:44 PM.
|
|
|
|
Joined: Jan 2004
Posts: 2,127
Hoopy frood
|
Hoopy frood
Joined: Jan 2004
Posts: 2,127 |
Thanks. FNV looks fast because it's so simplistic. Just 1 mul and 1 xor. CPU Multiply isn't as slow compared to Add/Shift as it used to be, so that helps FNV, compared to One-At-a-Time trying to avoid MUL by using Add/Shift instead. I would expect Jenkins' Lookup3 to be slower, because it uses 12-byte blocks, so it wastes a lot of effort for item names whose lengths aren't at/just-below a multiple of 12.
Does the switch to FNV mean we'll have a new $fnv() identifier too? If so, would it be limited to the 32-bit variant due to avoiding big-number arithmetic? It looks like FNV-32 multiplies to a quotient as large as 56 bits, but FNV-64 would have a quotient up to 104 bits. I know some mIRC interal functions are able to have accuracy better than the 2^52 limit in $calc().
|
|
|
|
Joined: Jan 2004
Posts: 2,127
Hoopy frood
|
Hoopy frood
Joined: Jan 2004
Posts: 2,127 |
It is difficult to say how this change will affect hash table performance in general - most overhead is due to the scripting language and other features.
Compared to the no-beta 7.52, in this test beta 1897 benchmarked slower. The speed difference is around 30% at buckets=100, but the speed is much closer at greater and fewer buckets, so I don't see how this can be blamed on fnv1a. The speed in the prior beta is in-between the speed from the newest beta vs the no-beta. 7.52 1897 buckets 1 = hadd 29079 ms hget 28673 ms 7.52 1897 buckets 2 = hadd 21699 ms hget 23245 ms 7.52 1897 buckets 11 = hadd 11544 ms hget 13058 ms 7.52 1897 buckets 51 = hadd 5740 ms hget 6677 ms 7.52 1897 buckets 100 = hadd 4290 ms hget 5569 ms 7.52 1897 buckets 10000 = hadd 2886 ms hget 4180 ms 7.52 1438 buckets 100 = hadd 3837 ms hget 4992 ms 7.52 buckets 1 = hadd 29469 ms hget 28860 ms 7.52 buckets 2 = hadd 21778 ms hget 21933 ms 7.52 buckets 11 = hadd 10655 ms hget 11965 ms 7.52 buckets 51 = hadd 4306 ms hget 5164 ms 7.52 buckets 100 = hadd 3323 ms hget 4228 ms 7.52 buckets 10000 = hadd 2730 ms hget 3666 ms 6.35 buckets 1 = hadd 179573 ms hget 175438 ms 6.35 buckets 100 = hadd 4602 ms hget 5881 ms 6.35 buckets 10000 = hadd 3073 ms hget 4119 ms alias hadd_test {
hfree -w hashtest | hmake hashtest $1
var %j 99999 , %i %j , %ticks $ticks
while (%i) {
hadd hashtest item $+ %i
dec %i
}
var %msg hadd $calc($ticks - %ticks) ms , %ticks $ticks
while (%j) {
var %a $hget(hashtest,item $+ %j)
dec %j
}
echo -a $version $beta buckets $hget(hashtest).size = %msg hget $calc($ticks - %ticks) ms
}
|
|
|
|
Joined: Dec 2002
Posts: 5,493
Hoopy frood
|
Hoopy frood
Joined: Dec 2002
Posts: 5,493 |
I don't see how this can be blamed on fnv1a Oh the irony :-) It turns out that it is in fact due to fnv1a. I reverted to 1438 and just added the fnv1a hashing algorithm to it to reproduce the issue. It took some effort to narrow it down, and some performance profiling. The older hashing algorithm distributes data unevenly across buckets in the hash table, so some buckets have far fewer/greater items in them than others, however it is distributing individual items better. At least, with certain combinations of data. With some combinations of data, it leaves many buckets empty. The fnv1a algorithm distributes items more evenly across all buckets, however, due to its lack of avalanching, more similar items are being stored in the same buckets. This means that when a hash matches a bucket and it has to be searched, even though there are fewer items in it (because they're more evenly distributed across buckets), the string comparison call takes longer as it is matching more characters per item because they are more similar, even though there are fewer string comparisons. Unfortunately, every other hashing algorithm I have tried, eg. murmur3, xxhfast32, lookup3, takes more time to create the hash, at least for short strings. So it looks like we may just have to stick with the older algorithm.
|
|
|
|
Joined: Aug 2003
Posts: 320
Pan-dimensional mouse
|
Pan-dimensional mouse
Joined: Aug 2003
Posts: 320 |
Oh the irony :-) The older hashing algorithm distributes data unevenly across buckets in the hash table, so some buckets have far fewer/greater items in them than others, however it is distributing individual items better. At least, with certain combinations of data. With some combinations of data, it leaves many buckets empty.
The fnv1a algorithm distributes items more evenly across all buckets, however, due to its lack of avalanching, more similar items are being stored in the same buckets. This means that when a hash matches a bucket and it has to be searched, even though there are fewer items in it (because they're more evenly distributed across buckets), the string comparison call takes longer as it is matching more characters per item because they are more similar, even though there are fewer string comparisons.
All I can say is that you have put a lot of work into analysing the impact of this proposed change - and we (the mIRC users) are lucky to have someone as dedicated and diligent maintaining this tool.
|
|
|
|
Joined: Jul 2006
Posts: 4,188
Hoopy frood
|
Hoopy frood
Joined: Jul 2006
Posts: 4,188 |
What about adding a new switch to go with -m, to specify the name of the hashing algorithm to be used? Or maybe just a switch to specify the fnv1a algorithm without supporting any other if they are not so great.
#mircscripting @ irc.swiftirc.net == the best mIRC help channel
|
|
|
|
|