/hadd does not save items in correct order

I am not sure if this is really a bug or feature or something else but it seems i cannot get the correct order when creating items in a hash, it seems to return what ever number wants instead the first added, i will provide a code example to reproduce.

Example:

Code

//var %i = 1 | while (%i <= 1000) { hadd -m TEST test_ $+ %i 1 | inc %i } | echo -a IS: $hget(TEST,1).item

It suposed to return: test_1 instead of test_789

(tested also and with 1438 beta, same results)

- Thanks!

That's because your use of -m instead of -m1 is using the default 100 buckets. When N buckets is greater than 1, it's going to assign them a sequence number that appears to be random, but that's caused by the items being shuffled out to the different buckets using a pseudo-random algorithm based on the itemname.

It appears that the actual number of buckets is always an odd number, in spite of the value returned by $hget(table).size, because you always get the same shuffling of items for N as you get from $or(N,1). (It's possible it does $and(N,-2) but I doubt it, and don't know how to check which it is.)

If your example had used buckets=1, you'd find that the 1st item is always the latest one created:

Code:

//hfree -w TEST | var %i = 1 | while (%i <= 1000) { hadd -m1 TEST test_ $+ %i 1 | inc %i } | echo -a IS: $hget(TEST,1).item

It appears /hadd creates new items by inserting them to the front of the sequential list. But if the item already exists, it instead gets updated in its current bucket, but I'm not sure whether it's deleted then added to the front of the bucket or continues to use its existing spot in the bucket.

If you want $hget(table,X).item to be item name test_X, you need to create them in reverse order:

Code:

//hfree -sw test | hmake -s test 1 | var %i 20 | while (%i) { hadd test item $+ %i data | dec %i } | var %N 1 | while ($hget(test,%N).item) { echo 4 -a $ord(%N) item is $hget(test,%N).item | inc %N }

If you change the buckets 1 to be 100 you'll see that the items are created out of sequence, which is how you ended up with the 1st item being 783. If you increase %i from 20 to also be 100, you'll see it appear more shuffled. The fact that N=100 and i=20 doesn't appear randomly shuffled tells me that the items are put in buckets using a simple method, and is not using a 'real' hash function like $sha1, nor even $crc. It appears to use a case-insensitive version of the itemname, because changing your example from using itemnames starting with "test_" and "TEst_" give the same sequence, but other strings have a huge chance of having something else be your 1st item.

If you need to hsave then hload your buckets=1 table, this causes your list to be in reverse order than created the next time you /hload, because it looks like /hload begins at the front of the file and /hadd's the info, then continues to the end of the file.

To get your /hload'ed data to be in the original 'correct' order, you'd need to either /hload + /hsave + /hfree + /hload, or /hload it into a dummy table and clone it from there.

To fix your issue would probably require a new switch for /hadd and /hload to append new items instead of inserting them. I see the -a switch is available.

Thanks for your bug report. This is by design. Items in hash tables are stored in a completely random order. That is how hash tables work. No switch can be added to make them save in any particular order.

Yes - this is how hash tables work (hence the term "hash"). An explanation of how hash tables works follows - but see the solution at the end.

A hash table works by creating a number of slots (which you can specify). The item name is fed into a hashing algorithm to select which slot to put an item on. Each slot is then a linked list of items and your new item is inserted at the very beginning of the linked list of the slot that the hash algorithm selects.

When you want to search for the data by item name, mIRC uses the same algorithm to choose the slot and then iterates down the linked list to see if it can find the item you want.

When you do a positional $hget for the first item, mIRC gives you the first item in the linked list on the first slot. But since it is unlikely that the first item you added hashed to the first slot, you won't get that - instead you get the first item whose hash points to the first slot.

The default number of slots if you do not specify it explicitly is 100, which actually results in 101 slots because mIRC always makes the number of slots an odd number because of the way that the hashing algorithm works.

The surprise is that (for 101 slots) that none of the first 788 items hashed to the first slot when you might expect that to have happened a lot sooner.

Now, if you have followed my explanation you might have spotted two things:

1. Performance If you have too few slots for the number of items, then a look up by item name will pick the slot quite fast, but will then have to iterate down the linked list to find the item wanted which could be a lengthy CPU operation if the linked list is a long one.

In general terms, the more slots you have, the shorter the average length of the linked list.

(However you cannot guarantee that the items will be distributed evenly over the available slots. The worst case scenario is that by chance all the item names hash down to the same slot and you get a single very long linked list on that slot and empty linked lists on the other slots. But given a reasonably random set of item names, usually they will be reasonably evenly distributed across slots.)

I have not benchmarked the performance for a random lookup in a hash table with random item names depending on the slot size - but my guess is that the best performance will be when the number of slots is close to the number of items in the table. Khaled may have done some benchmarking when he suggests that you use a number of slots about 1/10 the number of items, but it seems to me that might mean having linked lists between (say) 5 and 20 long - whereas with the number of slots close to the number of items the linked lists should only be (say) 1 to 5 items long. Wikichip suggests that the optimum number of slots is 1.282x the number of entries.

Each slot only takes a very small number of bytes - far smaller than the bytes required to store the items themselves - so in most cases it seems to me that there is little downside to using a relatively large number of slots. (The maximum number of slots is 10,000 (i.e. actually 10,001) which uses 40K of memory - and that is not a huge amount in today's PCs, so I tend not to worry too much about excess slots.)

HOWEVER that applies only to lookup by item name. $hfind also allows you to search by item name using the item name as a wildcard match to text, or as wildcard text matching the item name, or using the item name as a regex to match to text or as regex text to match the item name. Since there seems to be no way to use a hash determine which slot the item might be on, it seems to me that it is likely that mIRC has to iterate across all linked lists on all slots to complete the find, in which case the performance is unlikely to be effected by the number of slots. Ditto when you do any of the above searches on the data rather than the item.

So for $hfind which iterates across all entries regardless of number of slots, you might as well use 1 slot.

But if you are using $hget to do a hash lookup based on a specific item, then you should use as large a number of slots as you think you need.

2. If you have digested all of the above then you may have realised that you can make a hash table store items sequentially so that you can iterate over them as the original poster would like to by using -m1 to create a single slot hash table and so a single linked list. Since each item is added to the beginning of the list, the sequence they are added is preserved (but reversed) when you iterate using $hget(table,n) in other words you can either store them in reverse order or access them in reverse order:

Code:

//hfree TEST | //var %i = 1 | while (%i <= 1000) { hadd -m1 TEST test_ $+ %i 1 | inc %i } | echo -a Item 1: $hget(TEST,1).item Item 1000: $hget(TEST,1000).item
//hfree TEST | //var %i = 1000 | while (%i) { hadd -m1 TEST test_ $+ %i 1 | dec %i } | echo -a Item 1: $hget(TEST,1).item Item 1000: $hget(TEST,1000).item

Originally Posted By: maroon

To fix your issue would probably require a new switch for /hadd and /hload to append new items instead of inserting them. I see the -a switch is available.

The reason that mIRC inserts at the beginning of the linked-list is because it is an easy, low cost thing to do. To insert entries at the end of the list you would either need to iterate all the way to the end of the list which would kill performance, or double the number of bytes for each slot in order to hold the pointer to the end of the list (which on today's modern PCs may not be that much of an issue).

P.S. I have no idea how to calculate the number of slots to get the best performance from a hash table with N entries. But the following analogy suggests that a significantly higher number of slots should be used than anyone has previously suggested.

Q: How many slots are needed for a hash table with 23 entries in order to have a probability > 50% that every entry will be in a separate slot?
A: 366

I know this sounds improbable, but the reverse of this is a well known mathematical puzzle:

Q: How many students do you need in a class to have a probability that two or more share a birthday > 50%?
A: 23

Obviously, the overhead to iterate over two or five or even ten entries on a slot is not going to be that significant, and the mathematics is much more complicated, but the moral seems reasonably clear: if you are doing hash table look-ups by item name, then using a large number of slots might help performance a lot without using a huge amount of memory.

OTOH if you are iterating using $hget(table,n) or using $hfind - both of which need to iterate over the entire table - you might as well use -m1 for a single slot and save the memory overhead of a hash table which will never be used.

Thanks for the reply, Protopia. A few points about yours and Khaled's posts.

1. Some readers may be confused because you use "slot" to have the same meaning as /help uses "buckets". Just like other references may use the 'key' term in the way /help uses 'item'.

2. Yes you can place items in forward order by /hadd'ing them in reverse order, which was what my example showed. But that's only for the initial creation, and as my examples above showed, as you /hsave and /hload the table, the item order gets flipped. So you either need to do the /hsave + /hload twice, or have a temp table you use to un-flip the order.

3. Khaled posted:

Quote:

Items in hash tables are stored in a completely random order.

They are shuffled into buckets in a way that appears random, especially if the item names are not similar to each other. However they're not truly random because the same combination of item names and buckets always results in them being in the same sequence.

Code:

//hfree -sw test | hmake -s test 100 | var %i 1 , %a | while (%i isnum 1-999) { hadd test item $+ $base(%i,10,10,3) data | inc %i } | var %n 1 | while ($hget(test,%n).item) { var %a $sha1(%a $v1) | inc %n } | echo -a hash of item sequence %a

This creates a unique sha1 hash based on the sequence the 999 items are distributed into buckets/linked-chains. Repeating the same command always results in the identical sort-order. I'm sure it's because it's best to have the number of buckets and the number of items be relatively prime to each other, and being an odd number is deemed 'close-enough'. The default 100 behaving like N=101 buckets, which is a prime number. The max 10000 behaves like 10001 which is not prime, as the largest prime <= 10001 is 9973.

This alias also shows that buckets=even_number and buckets=even_number+1 always produce the same order. There's precedence for the order to change depending on the version, as the order in 6.35 is different than it currently is.

The allocation of items to buckets doesn't appear to be a sophisticated algorithm, or even something simple like $crc(). It's possible to have combinations of item-count and buckets which results in items not appearing to be very well shuffled at all, which you wouldn't expect from something that's trying to pseudo-randomly assign items to buckets. For example, for most of this list it simply swaps an even item with even+1:

Code:

//clear | hfree -sw test | hmake -s test 101 | var %i 20 | while (%i) { hadd test item $+ %i data | dec %i } | var %N 1 | while ($hget(test,%N).item) { echo 4 -a $ord(%N) item is $hget(test,%N).item | inc %N }

Based on the name of the identifier, I'm wondering if $hash() is used when assigning buckets to each item, since it's very prone to having identical 24-bit hashes to similar strings.

Quote:

No switch can be added to make them save in any particular order.

Having a switch for /hload which begins with the last pair of lines in the file then works its way forward would simulate such a switch, though I suppose there might be a performance hit for doing so. Either that, or people would be forced to script their own fake /hload alias which would use $read() and /hadd to load the disk file into the same order in which the items were /hsave'ed.

Quote:

They are shuffled into buckets in a way that appears random, especially if the item names are not similar to each other. However they're not truly random because the same combination of item names and buckets always results in them being in the same sequence.

Well, from the perspective of the scripter, they are random. You should not depend on items in hash tables being stored in any particular order. As this same topic has been discussed so many times, I will be adding a note to the help file.

Thanks for your comments everyone.

Originally Posted By: maroon

Based on the name of the identifier, I'm wondering if $hash() is used when assigning buckets to each item, since it's very prone to having identical 24-bit hashes to similar strings.

maroon: Yes, $hash(,32) is used for hash tables.

Also, I wasn't able to reproduce successive /hload and /hsave operations to reverse the ordering of the items as they appear in the file. I might not be understanding it right.

Code:

//var %id = testing.hlist | hmake %id | hadd %id a a | hadd %id b b | hadd %id c c | hsave %id %id | loadbuf -api %id | hfree %id | hmake %id | hload %id %id | remove %id | hsave %id %id | loadbuf -api %id | remove %id | hfree %id

Quote:

maroon: Yes, $hash(,32) is used for hash tables.

Hash tables use a different hashing algorithm. They may have used the same method long ago though. I haven't tested either in a long time. Neither are likely to achieve good results with SMhasher.

The algorithm for $hash() cannot be changed since it has been in use for so long and its results may be relied on by scripts.

The algorithm for hash tables could be changed. The current algorithm was chosen at the time because it was simple, fast, and "good enough".

Raccoon, your example didn't use buckets=1, which is the only case where items remain in a sequential pattern in the order they were created.

Code:

//var %id = testing.hlist | hmake %id 1 | hadd %id a a | hadd %id b b | hadd %id c c | hsave %id %id | loadbuf -api %id | hfree %id | hmake %id 1 | hload %id %id | remove %id | hsave %id %id | loadbuf -api %id | remove %id | hfree %id

It appears the physical order on disk depends on the number of buckets. Buckets=100 leaves the order unchanged. Buckets=1 flip the order. But in this next example, it looks like most of the other buckets keep the same order generally, but most items are flipped with nearby items.

Code:

//var %buckets 5 , %id test.dat | hfree -w %id | hmake -s %id %buckets | var %i 10 | while (%i) { hadd %id item $+ %i 1 | dec %i } | hsave %id %id | filter -fs %id *item* | hfree %id | hmake -s %id %buckets | hload %id %id | hsave %id %id | filter -fs %id *item*

Quote:

Khaled wrote

from the perspective of the scripter, they are random. You should not depend on items in hash tables being stored in any particular order.

I wasn't trying to depending on items being in a specific shuffled order when buckets is greater than 1, I was saying that repeating the same combo of buckets/names+number-of-items causes the identical pattern each time it's tried, and the example of a small quantity of similarly named items showed the order doesn't appear very random. But yes, different mIRC versions have different shuffling patterns than others. For the code in my 2nd post, or changing the last code in my 1st post to use buckets=101, the current version and 6.35 have different shuffling order from each other, but each version consistently shuffles the same things the same way each time.

I had incorrectly left a "not" in my previous post, it should be "especially if the item names are not similar". When changing the buckets from 1 to 100 in the 2nd example of my 1st post, the items look like they're not shuffled very well at all. 8,9 6,7 4,5 2,3 1... with the 5-length and 6-length item names being shuffled as separate groups.

The similar item names appear to be causing the lack of shuffling, as $hash() does not have very "random-looking" output. The article on $hash at wikichip shows some of the problems with $hash output; including poor distribution of output values, output values heavily dependant on input length, and changing 1 bit of the input changes very few bits of the output.

https://en.wikichip.org/wiki/mirc/identifiers/$hash

Bob Jenkins has a few varieties of hashes which are pretty fast, and might be a good replacement for $hash(). I've created aliases for his "Lookup3" and "One at a time" hashes at https://mircscripts.net/WHkiR and there is C source code around for both functions. The 1-at-a-time is probably the better choice for a replacement of $hash in bucketing hash items, as Lookup3 handles bytes in groups of 12, and might be more suited for longer strings - though hard to know without benchmarking them.

$hget won't tell if these items are getting placed into the same bucket, or just happen to be in nearby buckets. If it's the latter, this is probably fine, since the goal of the hash assignment to buckets is to spread them into different buckets in a smooth frequency distribution and be able to quickly find which bucket something was already placed. But if the current hashing is not doing a good job at evenly distributing 10k items into 1k buckets, especially with short input strings similar to each other, then perhaps a different hash algorithm is needed.

Quote:

Bob Jenkins has a few varieties of hashes which are pretty fast, and might be a good replacement for $hash().

Thanks for the suggestion. I performed some tests to check hashing algorithm speed/distribution, with different lengths/types of data, eg. short/long strings, numbers, different table sizes, etc. The current hashing algorithm distributes okay in some situations but is awful in others. I tested different algorithms, eg. lookup3, DJB2, FarmHash, xxHash, MurmurHash, FNV-1a, etc. and have settled on FNV-1a for simplicity. In all of the above tests, it worked just as well as the more complex algorithms and had good distribution. It is difficult to say how this change will affect hash table performance in general - most overhead is due to the scripting language and other features. But in specific contexts and with certain types of data, it could make a big difference. This will be in the next beta.

Leaving aside the issues of interpreted scripting having inherent overhead, I suspect that the biggest overhead with hash tables is iterating over the items in the slot (or if you are indexing by position or using $hfind iterating over the whole table).

Clearly the fastest lookup is a hash lookup where every occupied slot has only a single entry, avoiding iteration. In mathematics there is a well know "birthday problem" which says that it only takes 23 people for the probability that 2 or more of them share a birthday to be > 50%. This is equivalent to saying that for a hash table of 23 items, you would need 366 slots for the probability of every slot having at most one item to be > 50%. On the other hand, since the maximum 10,000 slots only takes 40KB of memory, on modern computers it is not that big a problem to use that value. The only downside is that in a non-hashed lookup you are going to be iterating over a lot of empty slots which has its own overhead.

Alternatively, it is easy to calculate the probability that, for a table of M items and N slots, every slot is used. ((N-1)/N)^M It turns out that you need 1.444 slots per item for there to be a 50% probability of all slots being used.

So my advice would be:

a. If you are only doing non-hashed lookups, use a slots value of 1.

b. If you are only hashed lookups, then use a slots value of 10,000.

c. If you are doing both hashed lookups and non-hashed lookups use slots = estimated-items * 1.444.

BUT...

If we are looking for optimising non-hashed table lookups, then my suggestion would be to allow the scripter to choose an alternative table implementation using flags to create a hash table stored internally:

a. An array of pointers to the item/data sequenced on item-name using binary search to locate the item.
b. An array of pointers to the item/data sequenced on data using binary search.
c. Use a. as an alternative to the current implementation for the list for each slot.
d. As previously suggested, enhance the existing implementation by storing the first and last items of the linked list in each slot (using 8 bytes instead of 4) and add new items to the end of the list instead of the beginning - so that slots=1 stores the table exactly sequentially.

And aside from the overheads of adding or deleting an item from three lists (and the memory of course), you could even allow hashing, and the above two additional methods to be used in combination.

This would allow mIRC to have an arsenal of optimisation techniques to improve the speed of non-hashed lookups. For example a wild-match on item which doesn't have */? as the first character could be optimised to use 2 binary searches to find the beginning and end of the section of the sequenced array that holds the relevant items, and then search only within that section. Or a search with items as the wild-match could do something similar to look for table entries that start with */?/first character and search only those parts of the table.

As far as I can see, this would be fully functionally backwards and forwards compatible, and scripts written for the new functionality would only have a performance issue on older mIRC versions.

What do people think?

There are no plans to make any other changes to hash tables at this time. The only change I will be making is to the hashing algorithm. This change will be in the next beta.

Khaled - That's fine - I know there is a big difference between a small performance tweak to a hashing algorithm and a significant functional enhancement. And I don't have any specific need for this either - it was simply an observation.

But it might be a nice if you could bear it in mind for a future enhancement.

Thanks. FNV looks fast because it's so simplistic. Just 1 mul and 1 xor. CPU Multiply isn't as slow compared to Add/Shift as it used to be, so that helps FNV, compared to One-At-a-Time trying to avoid MUL by using Add/Shift instead. I would expect Jenkins' Lookup3 to be slower, because it uses 12-byte blocks, so it wastes a lot of effort for item names whose lengths aren't at/just-below a multiple of 12.

Does the switch to FNV mean we'll have a new $fnv() identifier too? If so, would it be limited to the 32-bit variant due to avoiding big-number arithmetic? It looks like FNV-32 multiplies to a quotient as large as 56 bits, but FNV-64 would have a quotient up to 104 bits. I know some mIRC interal functions are able to have accuracy better than the 2^52 limit in $calc().

Quote:

It is difficult to say how this change will affect hash table performance in general - most overhead is due to the scripting language and other features.

Compared to the no-beta 7.52, in this test beta 1897 benchmarked slower. The speed difference is around 30% at buckets=100, but the speed is much closer at greater and fewer buckets, so I don't see how this can be blamed on fnv1a. The speed in the prior beta is in-between the speed from the newest beta vs the no-beta.

7.52 1897 buckets 1 = hadd 29079 ms hget 28673 ms
7.52 1897 buckets 2 = hadd 21699 ms hget 23245 ms
7.52 1897 buckets 11 = hadd 11544 ms hget 13058 ms
7.52 1897 buckets 51 = hadd 5740 ms hget 6677 ms
7.52 1897 buckets 100 = hadd 4290 ms hget 5569 ms
7.52 1897 buckets 10000 = hadd 2886 ms hget 4180 ms

7.52 1438 buckets 100 = hadd 3837 ms hget 4992 ms

7.52 buckets 1 = hadd 29469 ms hget 28860 ms
7.52 buckets 2 = hadd 21778 ms hget 21933 ms
7.52 buckets 11 = hadd 10655 ms hget 11965 ms
7.52 buckets 51 = hadd 4306 ms hget 5164 ms
7.52 buckets 100 = hadd 3323 ms hget 4228 ms
7.52 buckets 10000 = hadd 2730 ms hget 3666 ms

6.35 buckets 1 = hadd 179573 ms hget 175438 ms
6.35 buckets 100 = hadd 4602 ms hget 5881 ms
6.35 buckets 10000 = hadd 3073 ms hget 4119 ms

Code:

alias hadd_test {
  hfree -w hashtest | hmake hashtest $1
  var %j 99999 , %i %j , %ticks $ticks
  while (%i) {
    hadd hashtest item $+ %i
    dec %i
  }
  var %msg hadd $calc($ticks - %ticks) ms , %ticks $ticks
  while (%j) {
    var %a $hget(hashtest,item $+ %j)
    dec %j
  }
  echo -a $version $beta buckets $hget(hashtest).size = %msg hget $calc($ticks - %ticks) ms
}

Quote:

I don't see how this can be blamed on fnv1a

Oh the irony :-) It turns out that it is in fact due to fnv1a. I reverted to 1438 and just added the fnv1a hashing algorithm to it to reproduce the issue. It took some effort to narrow it down, and some performance profiling.

The older hashing algorithm distributes data unevenly across buckets in the hash table, so some buckets have far fewer/greater items in them than others, however it is distributing individual items better. At least, with certain combinations of data. With some combinations of data, it leaves many buckets empty.

The fnv1a algorithm distributes items more evenly across all buckets, however, due to its lack of avalanching, more similar items are being stored in the same buckets. This means that when a hash matches a bucket and it has to be searched, even though there are fewer items in it (because they're more evenly distributed across buckets), the string comparison call takes longer as it is matching more characters per item because they are more similar, even though there are fewer string comparisons.

Unfortunately, every other hashing algorithm I have tried, eg. murmur3, xxhfast32, lookup3, takes more time to create the hash, at least for short strings. So it looks like we may just have to stick with the older algorithm.

Originally Posted By: Khaled

Oh the irony :-) The older hashing algorithm distributes data unevenly across buckets in the hash table, so some buckets have far fewer/greater items in them than others, however it is distributing individual items better. At least, with certain combinations of data. With some combinations of data, it leaves many buckets empty.

The fnv1a algorithm distributes items more evenly across all buckets, however, due to its lack of avalanching, more similar items are being stored in the same buckets. This means that when a hash matches a bucket and it has to be searched, even though there are fewer items in it (because they're more evenly distributed across buckets), the string comparison call takes longer as it is matching more characters per item because they are more similar, even though there are fewer string comparisons.

All I can say is that you have put a lot of work into analysing the impact of this proposed change - and we (the mIRC users) are lucky to have someone as dedicated and diligent maintaining this tool.

What about adding a new switch to go with -m, to specify the name of the hashing algorithm to be used? Or maybe just a switch to specify the fnv1a algorithm without supporting any other if they are not so great.

Originally Posted By: Wims

What about adding a new switch to go with -m, to specify the name of the hashing algorithm to be used? Or maybe just a switch to specify the fnv1a algorithm without supporting any other if they are not so great.

I am sure that Khaled could do this if it genuinely made sense, however:

1. Unless the user has a way of knowing his data in advance and also a way of determining which hash algorithm gave the best spread, how would they have any way to determine which flag to use? So then you would have to provide an identifier for each hashing algorithm specifically so a user could pre-analyse their data to determine the best one, and then Khaled would no longer have any freedom to innovate to improve hash table performance.

2. The performance gains are, it seems, fairly marginal - at least when compared to using a binary-search algorithm instead of iterating down the entire linked-list of a slot. So if we are going to look for performance (and functional) improvements, my vote would be to spend the energy on that rather than flags to allow a user to choose an alternative hashing algorithm.

P.S. If we are looking for ways to improve hash table performance (other than binary search), then a couple of ideas off the top of my head...:

a. If not already in the code, other performance gains could be made by caching the results of the last $hget(table,n) or $hfind(table,needle,n) (and trashing the cache if /hadd is used) so that when you do a subsequent call for the same search but for a later position n+m, you start from where you left off rather than from the beginning of the table.(Though this would have the oddity that using an incrementing loop would be far faster than a decrementing loop.

b. Another possibility for massive performance improvements which literally just occurred to me would be to get mIRC to use the 3d graphic chip parallel processing that are in most modern PCs to do non-hashed lookups. (Though probably way to much programming effort to be worthwhile.)

Originally Posted By: Wims

What about adding a new switch to go with -m, to specify the name of the hashing algorithm to be used? Or maybe just a switch to specify the fnv1a algorithm without supporting any other if they are not so great.

That would be good and interesting to give the opportunity to specify the algorithm that you wanna use for hash but i "think" it will need a decade to be added because it will be very hard.

Example:

On /hadd and /hmake:

-aN = If not N then use default. if N = 1 then use X algorithm, if N = 2 then X algorithm , etc...

Quote:

I am sure that Khaled could do this if it genuinely made sense

Well, he said that the current algorithm was not great in some cases, that's where the other algorithm might perform better. Hash table are almost never used with unknown data, you typically associate hardcoded, known strings, to unknown data, but you also typically know the kind of data you're using, that's what you would (typically) use to know what 'flag' to use, no need to pre-analyse.

Your suggestion may not be bad but this thread was mainly about the way items are distributed, which is related to the algorithm, so my suggestion seemed fair.

Originally Posted By: Wims

Your suggestion may not be bad but this thread was mainly about the way items are distributed, which is related to the algorithm, so my suggestion seemed fair.

Whilst this thread was initially about why hashing means you lose the sequence you stored items, it very quickly became a discussion on performance.

I was not suggesting that you shouldn't make a suggestion about a flag to allow the scripter to choose one of several hashing algorithms, just that if you are going to do this then you need to:

a. Give the scripter the tools to analyse the data in order to determine the best algorithm to use for their data; and

b. At the moment Khaled can change the hashing algorithm to improve performance without any scripts breaking. If you give the user the choice / tools to analyse, then you are creating another set of constraints that you have to meet in the future to maintain backwards compatibility; and

c. A different hashing algorithm might make small differences to performance for hashed lookups - but as I attempt to show below, a lot of lookups are unhashed, and both hashed and unhashed lookups have to iterate through linked lists which is pretty slow and therefore has scope for significant performance improvements, and in a lot of cases the scripts are iterating running the same $hget(table,n) or $hfind lookup multiple times which again has scope for significant performance improvements. IMO such performance improvements could dwarf those gained by tweaking the hashing algorithm.

Originally Posted By: Khaled

The older hashing algorithm distributes data unevenly across buckets in the hash table, so some buckets have far fewer/greater items in them than others, however it is distributing individual items better. At least, with certain combinations of data. With some combinations of data, it leaves many buckets empty.

The fnv1a algorithm distributes items more evenly across all buckets, however, due to its lack of avalanching, more similar items are being stored in the same buckets. This means that when a hash matches a bucket and it has to be searched, even though there are fewer items in it (because they're more evenly distributed across buckets), the string comparison call takes longer as it is matching more characters per item because they are more similar, even though there are fewer string comparisons.

Unfortunately, every other hashing algorithm I have tried, eg. murmur3, xxhfast32, lookup3, takes more time to create the hash, at least for short strings. So it looks like we may just have to stick with the older algorithm.

Analysing my own scripts I find that I use hash matches when I am storing data by nickname or channel - so the item names are relatively short strings, but much more often I am wanting to use them to store requests in a FIFO queue (using $hget(table,n) or to use it to perform wildcard matches (in which case I am using $hfind), both of which are non-hashed lookups.

So, I think Khaled is right to look at the cost of calculating the hash for smaller strings (like nicks or channels) rather than longer ones, BUT that you need to look at the cost of the hash algorithm weighed against the benefits of having more even length lists to iterate over rather than just the cost of the algorithm alone.

There are two different scenarios when the hashing algorithm is uneven:

a. It still spreads items over all slots, but the lengths are uneven;
b. As a. but worse so that some slots are left completely empty.

In a., the average length of the slots is the same as if they were evenly spread, but... because the number of hits on a particular slot is proportional to the length of the list, and the time taken to iterate over the list is also proportional to the length of the list, then the performance hit is likely to be the square of how much variance in length there is.

In b., the above is exaggerated because the lists in empty slots are never searched.

So, the greater the variance in distribution, the bigger the performance hit and the more you can afford to increase the cost of the hashing algorithm.

I haven't looked at the specifics of the current hashing algorithm compared to the alternatives, but MurmurHash3 is the one people seem to recommend as being both even and fast.

HOWEVER... as I suggested above, I am not at all sure that we are focusing efforts on the areas where performance improvements will have a real-life benefit...

HASH TABLE SIZE
It seems to me that we should be less concerned (but definitely not unconcerned) about significant performance changes for small hash tables which have quick lookups - if say a 2ms lookup increases to 5ms or decreases to 1ms, is a 1ms-3ms change going to be noticeable to anyone in a real-life application (as opposed to a somewhat artificial benchmark).

Instead we should try to optimise for lookups on large tables which could take a significant time - a 50% reduction in a lookup on a huge hash table which takes 0.5s to run might well be noticeable by the user.

ITERATION - EXACT MATCH
As I pointed out before, iterating through a large linked list has a big cost. For hashed (i.e. exact match $hget) lookups, at least you are iterating only through one slot containing only a small portion of the data - yes, you want to optimise this by having an evenly distributed hash function, but the time needed for a hash lookup in a table with n items and s slots is O(n/s).

But for a sequential $hget(table,n) lookup and for all $hfind lookups, you are iterating over the entire table O(n), and there is a small overhead for iterating across slots as well as within the linked list for each slot. So on the basis of focusing on the functionality with the worst performance, it seems to me that unhashed lookups has the greatest scope for performance gains.

To put this in context:

* an iterative search is O(n), so an iterative $hfind search of a 10,000 entry table is going to take between an average of 5,000 iterations if it is matched and 10,000 iterations if unmatched; but
* a search through a self-balancing binary tree is O($log2(n)), so a similar search using one of these would take c. 13 iterations to either find or not find a match (and 10,000 to 13 iterations is a 99.8% reduction).

You might even find that an exact match lookup on a binary tree is as fast in practice as on a hash lookup because you spend the time looking rather than calculating the hash.

ITERATING - NON-EXACT MATCH
For wildcard / regex matches, you can't use a binary search directly and you still have to iterate, But, if the first character is not a wildcard character (* or ?) you can take the string up to the first wildcard and use the binary tree search to find the first and last possible nodes you need to iterate over - so instead of iterating over the whole tree i.e. 10,000 iterations, instead you use 13 iterations to find the starting node, and 13 to find the ending node, perhaps resulting in 100 nodes to iterate over for a total of 126 iterations - still a saving of almost 90%.

MIXED OPERATIONS
As stated in a previous post, you can even maintain multiple ways to access the data at the same time:

a. -m1 - turn hashing off (as now)
b. -i - turn on binary tree for item name (as well as hash)
c. -d - turn on binary tree for data (as well as hash)
d. -o - use binary tree per hash slot instead of linked list - for -m1 gives you a table ordered by item name rather than ordered by the sequence they were added - and this could even be default for >1 slots since order is not defined for this situation

Yes - more overhead to maintain the multiple structures when you add items, but since adding an item requires a search to see if the item already exists (currently an iteration), then overall the performance adding items could still end up better.

CACHED ITERATION
When you use $hget(table,n) or $hfind, you will often use these in an iterative loop. In these situations where you are iterating (i.e. to find the n'th wildcard match or the n'th sequential item), you can cache the last search parameters and position, and providing that you are asking for the n'th or greater item on the next call, then you can continue searching from the point you previously reached rather than starting to count from the beginning. This would also e.g. give you quick access to the data for the item you just found.

Originally Posted By: Khaled

Originally Posted By: maroon

I don't see how this can be blamed on fnv1a

Oh the irony :-) It turns out that it is in fact due to fnv1a.

I was trying to say the slowdown can't be blamed on the speed of using $hash() vs the speed of using $fnv1a(), because - as you said - the time spent hashing was a small portion of the task done by /hadd or $hget(table,item). FNV1a should be faster than $hash, because there's not much room for $hash to be quicker than FNV1a's 1 xor + 1 mul per byte, as long as the MUL isn't being calculated using floating point math.

I had seen the lack of avalanching in FNV1a, where the hash of 'a' through 'z' produces hashes where the top 20 bits rarely change. However, in spite of this, the hash MOD 101 looks like it does a good job of assigning them to a variety of buckets.

Quote:

string a fnv1a: E40C292C bucket 11
string b fnv1a: E70C2DE5 bucket 23
string c fnv1a: E60C2C52 bucket 19
string d fnv1a: E10C2473 bucket 100
string e fnv1a: E00C22E0 bucket 96
string f fnv1a: E30C2799 bucket 7
string g fnv1a: E20C2606 bucket 3
string h fnv1a: ED0C3757 bucket 47
string i fnv1a: EC0C35C4 bucket 43
string j fnv1a: EF0C3A7D bucket 55
string k fnv1a: EE0C38EA bucket 51

However FNV1a is supposed to a low collision rate, where different strings have the same hash, which is a property that $hash() doesn't have. It's easy to find many such strings:

Code:

//echo -a $base($hash(abcdef,32),10,16) same as $base($hash(abddee,32),10,16)

Can you confirm how mIRC has been assigning items to buckets prior to the change to FNV1a? By looking at the order in which $hget(table,N).item sorts the items, I would expect $calc(1+($hash(itemname,32) % 101)) to be sequentially changing as you work your way through that list from the first through last N.

I do see the newest beta sorting items in the $hget(table,N).item order based on this MOD calculation if it uses the FNV1a-32 hash. But it doesn't work for me in the older versions when I substitute $hash(item,32) in place of FNV1a.

With items in 1897 assigned to buckets 1-101 using (1+(FNV1a(itemname) % 101)) I wouldn't think the lack of avalanche causes the hash of similar strings to have a bias toward being the same value MOD 101.

However I still see evidence that FNV1a isn't responsible for the delay shown by hadd_test:

1. If the earlier /hadd_test alias is changed so the identical portion of the string is lengthened and moved to the tail end of the strings, any possible FNV1a-vs-$hash() speed is still a small portion of the overhead time, and the string at the end gives plenty of time for the avalanche to happen, making it even more unlikely that similar strings get placed into the same bucket causing the slowdown. I'm still seeing the same 30% slowdown when the alias changes the item names from $(item,%i) to $+(%i,evenlongeritemname), where 18 extra characters surely gives the avalanche a chance to spread the slight differences in input strings into diverse hashes that would be even less likely to be associated with similar remainders MOD 101. In my alias, ALL the item names in this example were short + similar, so there should be minimal time difference in finding the difference between item12345/item12445 vs having to tell the difference between item12345/item12355. Surely any problem with avalanching should have been even worse with $hash:

Code:

//var %bits 32 , %a $hash(mIRC,%bits) , %b $hash(mIRD,%bits) | echo -a $base(%a,10,2,32) bucket $calc(%a % 101) | echo -a $base(%b,10,2,32) bucket $calc(%b % 101)

This example changed only 1 bit of the output, and it's not difficult to find many such examples. But since $hash(anything,32) always returns the lowest 8 bits as zeroes, dividing by 101 will usually shuffle them to buckets far apart. Even if the hashing algorithm were $hash(string,24) where the changed bit is the last bit, the sequential output would have assigned them (MOD 101) to different buckets, and I would expect FNV1a to do even better.

2. As my prior post showed, I was seeing much more of the slowdown between 752.notbeta-vs-1438 than I was seeing between beta 1438-vs-1897. Since 1897 shuffles $hget(table,N).item into a different sequence than not-beta does, but 1897 does the same sort order as not-beta, the slowdown in 1438 can't be blamed on FNV1a.

3. If the slowness were caused by FNV1a shuffling more similar strings into the same bucket, that could explain the increase in time required by $hget(table,item) to locate the item within the bucket, but wouldn't explain the extra 30% time to /hadd them into the bucket - unless there's also some kind of bucket indexing that requires extra time when strings within 1 bucket are too similar.

I've not yet replicated the poor bucket shuffling you're seeing. For example, when I take the FNV1a hash of 'a' through 'z' then MOD by 101, I'm finding these 26 hashes assigned to 26 different buckets.

-

FYI Background

Below is an alias that matches the test vectors at the FNV site. It would be a lot faster using the assembler code at that site which doesn't need to use extra calculations to evade the 2^52 limit of floating point math. The quote near the top of the post is created by:

Code:

//var %i $asc(a) | while (%i isnum 1-122) { var %string $chr(%i) | echo -a string %string fnv1a: $fnv1a-32(%string,h) bucket $calc(1+($fnv1a-32(%string) % 101)) | inc %i }

Code:

alias fnv1a-32 {
  noop $regsubex(temp,$1,,,&temp) | var %len $bvar(&temp,0) , %i 0 , %hash 2166136261
  while (%i < %len) {
    !inc %i
    !var %hash $xor(%hash,$bvar(&temp,%i))
    !var %hash $calc(( (%hash % 256) * 16777216 + %hash * 403) % 4294967296 )
  }
  if (h isin $2) return $base(%hash,10,16,8) | return %hash
} ; by maroon 2018
; If not for the 2^52 limit, the MUL could have been $calc((%hash * 16777619) % 4294967296)
; because the bits of the product above 2^32 aren't needed. $fnv1a-32(string,[h]) h=hash as hex

Below is how I see (FNV1a MOD 101) used in beta 1897 to assign items to buckets, but can't see the same thing in older versions using $hash(item,32). Regardless whether the item-names are random or similar, $hget(table,N).item in 1897 lists them in bucket order regardless of the order of creation. Similar input strings seem no more likely to be assigned to the same bucket than when they're randomly different.

Code:

//hfree -sw test | hmake -s test 101 | var %i 20 , %N 1 | while (%i) { hadd test $base($rand(1,4503599627370495),10,36) | dec %i } | while ($hget(test,%N).item) { echo 4 -a $ord(%N) item is $v1 bucket: $calc(1+($fnv1a-32($v1) % 101)) | inc %N }

Thanks for your comments, unfortunately I have already spent a lot of time testing this. As you know, the performance of a hash table depends on a combination of factors, including hash table size, number of items added, hash algorithm, whether the table sizes needs to be prime/odd/even, item lengths, item content, and so on. You cannot test these in isolation. In your tests, FNV1A performed better/worse at certain table sizes. That is simply how it worked out for the type and amount of data you used. Remember, no other hash table code was changed, just the hash algorithm. I may in future add a switch to the /hmake command that allows you to set the hash algorithm. However, for now, the beta has reverted to the older hash algorithm.

In the /hadd_test alias above, the 2230 beta is back to the performance of the 1438 beta for it, which is better than 1897 was, but is still noticeably slower than the no-beta was. Again, I don't believe the differences in performance were caused by the choice of hash function, because it was affecting the time needed to /hadd the items as well as $hget()'ing them later, and I'm assuming the choice of hash wouldn't be responsible for the /hadd delay.

* Regarding the change to forcing the bucket count to be a prime. Would you consider either allowing it to consider N=1 to be a prime, or allowing a new switch that allows the hash table to not be forced away from the requested bucket count? (Other than perhaps requiring it at least be odd like currently.)

It now is not permitting buckets=1 for those wishing to keep items in the original order, because N=1 is now flipped to use buckets=3.

* Could you change the max from 10000 to 10007?

It's now forcing N=9994 thru 10000 to use buckets 10007. This would prevent "hmake table2 $hget(table1).size" from being an error if the .size prop reported 10007 bucket. I'm not familiar enough with hash tables to know whether the largest prime below 2^14 (16381) would be a more appropriate max.

Code:

//hfree -sw test | hmake -s test 101 | var %i 20 , %N 1 | while (%i) { hadd test $base($rand(1,4503599627370495),10,36) | dec %i } | while ($hget(test,%N).item) { echo 4 -a $ord(%N) item is $v1 bucket: $calc(1+($fnv1a-32-mod($v1) % 101)) | inc %N }

As for FNV1a being modified, I found a modification mentioned at Brett Mulvey's website, and when using the FNV1a-32-Mod alias below, the calculated buckets increase as you cycle through the sequential $hget(table,N).item's. When I use FNV1a-32 in place of FNV1a-32-mod in this alias, that's the same order as the items were sorted in beta 1897.

However, when I substitute $hash($v1,32) in place of $fnv1a-32-mod($v1), the calculation of the bucket-number doesn't increase, indicating that unmodified $hash() isn't the method used to assign items to buckets.

* Was the $hash() identifier intended to be the same method used to assign items to buckets, or was it intentionally modified for use with hash tables? I'm mentioning this in case this might be a source of uneven assignments to buckets in the past.

* If FNV ends up making the cut, would we end up getting some kind of $fnv() alias to wean people away from using $hash()? It could either be a new identifier, or a new parameter where 0 is the default doing the old hash, while 1 and 2 would be fnv1a with/without the modification. True, it can be simulated, but is significantly slower as an alias.

Code:

alias fnv1a-32-mod {
  noop $regsubex(temp,$1,,,&temp) | var %len $bvar(&temp,0) , %i 0 , %hash 2166136261
  while (%i < %len) {
    !inc %i
    !var %hash $xor(%hash,$bvar(&temp,%i))
    !var %hash $calc(( (%hash % 256) * 16777216 + %hash * 403) % 4294967296 )
  }
  var %hash $calc((%hash * 8193) % 4294967296)
  var %hash $calc(($xor(%hash,$calc(%hash /128))    *    9) % 4294967296)
  var %hash $calc(($xor(%hash,$calc(%hash /131072)) *   33) % 4294967296)
  if (h isin $2) return $base(%hash,10,16,8) | return %hash
} ; by maroon 2018
; If not for the 2^52 limit, the MUL could have been $calc((%hash * 16777619) % 4294967296)
; because the bits of the product above 2^32 aren't needed. $fnv1a-32(string,[h]) h=hash as hex
; is identical to original FNV1a except adding the following operations after the string is hashed.
;  hash += hash << 13; (same as hash * 8193)
;  hash ^= hash >> 7;
;  hash += hash << 3; (same as hash * 9)
;  hash ^= hash >> 17;
;  hash += hash << 5; (same as hash * 33)

As I have said previously, there is nothing wrong with discussions about how to make the hashing algorithm work more efficiently (either in terms of spread across buckets or speed), but these will have relatively minor impacts on hash table speed.

If we are going to expend effort on speeding up hash tables, then surely it makes sense to focus that effort on whatever area will speed up hash tables the most.

Linked-list iteration is probably THE SINGLE MOST EXECUTED ALGORITHM in mIRC's hash tables, and is the one which contributes by far the most to slow hash table performance for large hash tables because it is O(n).

It is used in /hadd, /hdel and $hget(table,item) to iterate over a single bucket to find the item or check whether the item exists and in $hget(table,n).item and $hfind to iterate over the entire table to count or find an item.

So, if we are looking for significant real-world performance improvements for large hash tables (as opposed to small ones, as measured in artificial workloads / benchmarks) then shouldn't we be looking at how to avoid iterating over linked lists?

Your code reminded me, do you think $calc() would benefit from bit shifting and other bitwise operators built in?

Quote:

In the /hadd_test alias above, the 2230 beta is back to the performance of the 1438 beta for it, which is better than 1897 was, but is still noticeably slower than the no-beta was.

This is actually due to item 42 in beta.txt. The memory change required to handle this results in a small slow down in scripted hash tables.

Quote:

Again, I don't believe the differences in performance were caused by the choice of hash function, because it was affecting the time needed to /hadd the items as well as $hget()'ing them later, and I'm assuming the choice of hash wouldn't be responsible for the /hadd delay.

I have come to accept that you will not change your mind on this :-)

Quote:

Regarding the change to forcing the bucket count to be a prime. Would you consider either allowing it to consider N=1 to be a prime, or allowing a new switch that allows the hash table to not be forced away from the requested bucket count?

Internally, the hash table has always been forced away from the requested bucket count ie. it was always made odd. This cannot be changed. $hget().size always returned the original requested size to minimize dissonance :-) Now it returns the actual size. As for N=1, see next answer.

Quote:

It now is not permitting buckets=1 for those wishing to keep items in the original order, because N=1 is now flipped to use buckets=3.

If I do this, this means that all future optimizations I make will need to cater for this behaviour, even though it is not meant to be a feature of hash tables. This is why I keep stating that you should not expect hash tables to preserve order.

In fact, one of the optimizations I was going to make, which repositions items closer to the top of the list when they are retrieved to make future retrieval faster, would undermine order even in a one slot hash table. I had planned to do this for the next beta.

Quote:

As for FNV1a being modified, I found a modification mentioned at Brett Mulvey's website, and when using the FNV1a-32-Mod alias below, the calculated buckets increase as you cycle through the sequential $hget(table,N).item's. When I use FNV1a-32 in place of FNV1a-32-mod in this alias, that's the same order as the items were sorted in beta 1897.

Yes, it is using FNV1a-32-mod that performs better avalanching, the same as your script below.

Quote:

Was the $hash() identifier intended to be the same method used to assign items to buckets

$hash() has nothing to do with hash tables.

Originally Posted By: Khaled

Originally Posted By: maroon

It now is not permitting buckets=1 for those wishing to keep items in the original order, because N=1 is now flipped to use buckets=3.

If I do this, this means that all future optimizations I make will need to cater for this behaviour, even though it is not meant to be a feature of hash tables. This is why I keep stating that you should not expect hash tables to preserve order.

In fact, one of the optimizations I was going to make, which repositions items closer to the top of the list when they are retrieved to make future retrieval faster, would undermine order even in a one slot hash table. I had planned to do this for the next beta.

1. To change how the linked_list is ordered will break backwards compatibility for any scripts which rely on order.

2. To not allow slots = 1 will break backwards compatibility for any scripts which rely on slots = 1 being ordered e.g. for queuing / stacks.

3. If you move an item in the linked-list, then you will also change the position in iterable use so e.g. $hget(table,11).item will suddenly become $hget(table,1).item, and this will break backwards compatibility.

4. In real-world use of hash tables, is there any evidence that moving the most recently accessed item to the beginning of the linked list will improve performance?

As an alternative, for each table how about caching the last $hget / $hfind parameters and linked-list location and intelligently deciding whether you can use those to speed things up.

In the real world you commonly get both item or data so $hget(table,n).item followed by $hget(table,n).data would be faster because you can go straight to the linked-list location. Similarly it is common to iterate through the table with $hget(table,n).item or $hfind so $hget(table,n).data followed by $hget(table,n + m).data would be faster because you can start iterating from the point that you got to last time. The same would work for $hfind.

You might also want to separately cache the parameters and results of any $hget(table,0).item or $hfind(...,0) so that loops which keep asking for these values would not need to iterate the whole list every time you call it.

These cached parameters / locations / results would be invalidated any time you add, change or delete records in the hash table.

These optimisations would work for all loops which iterate forwards i.e.

Code:

var %i 1 | while (%i <= $hget(table,0).item) { echo $hget(table,%i).item = $hget(table,%i).data | inc %i }
var %i 1 | while ($hget(table,%i).item) { echo $v1 = $hget(table,%i).data | inc %i }

Personally, I tend to code differently i.e.

Code:

var %i $hget(table,0).item | while (%i) { echo $hget(table,%i).item = $hget(table,%i).data | dec %i }

and these optimisations would not work for any descending loop. But they would be unlikely to slow them down noticeably, and for loops where the sequence is unimportant, recoding to take advantage of performance improvements would be simple.

Indeed, the only use case where you couldn't recode is when you want to access in the exact sequence you added items with slots=1. Because mIRC currently adds to the beginning of the linked-list you don't hold a pointer to the last item in the linked list (and possibly each item only has a forward pointer), it would be more difficult to optimise decreasing indexes. If you wanted to optimise for decreasing loops you would need an /hmake /hadd flag which tells mIRC to hold a pointer to the end of each linked-list and to hold backward pointers in each item so that you can iterate over the list in both directions. So this would enable you to decide whether e.g. for $hget(table,n).item followed by $hget(table,n - m).item mIRC could choose to iterate backwards over the linked_list using the cached location or to iterate from scratch from the beginning of the linked-list. (This choice could e.g. be based on the value of n and whether n-m is closer to 0 or to n, but I suspect in this event the scripter might want to influence this choice so we would need syntax for that.)

All that said, in the real world I suspect that hash tables used e.g. as a FIFO queue where you want to iterate in sequence added or e.g. $hget(table,$hget(table,0).item).data would tend to be relatively small cf. other types of table where you want to iterate but don't care about sequence, so personally I wouldn't worry about optimising decreasing loops.

P.S. I am unclear what the benefit of forcing the number of slots to be a prime number.

If (and it is a big if) the lower order part of the hash value is truly evenly distributed, then modulus anything should also be evenly distributed.

Of course, that "if" may well not be true.

However again I would argue that we should not be bothering with changing the slots number unless we can demonstrate significant real-world performance improvements.

Quote:

I am unclear what the benefit of forcing the number of slots to be a prime number.

As you say, the distribution may not be evenly distributed, that is why.

Quote:

1. To change how the linked_list is ordered will break backwards compatibility for any scripts which rely on order.

Which is why I have stated many times that you should not depend on the order of hash table items.

Quote:

2. To not allow slots = 1 will break backwards compatibility for any scripts which rely on slots = 1 being ordered e.g. for queuing / stacks.

Again, same as above.

Quote:

3. If you move an item in the linked-list, then you will also change the position in iterable use so e.g. $hget(table,11).item will suddenly become $hget(table,1).item, and this will break backwards compatibility.

Yes, as above again.

Quote:

4. In real-world use of hash tables, is there any evidence that moving the most recently accessed item to the beginning of the linked list will improve performance?

It is an optimization that is generally recommended to speed up access for items that are retrieved more often than others.

If I decide to make my hash table implementation return items in a specific order in particular contexts, this means that I will need to support this in all future hash table improvements, or if/when I change to a new/better hash table implementation. That simply is not practical. Especially since I have repeatedly said, many times, you should not depend on the hash table preserving order since that is not a normal feature of hash tables.

Quote:

Linked-list iteration is probably THE SINGLE MOST EXECUTED ALGORITHM in mIRC's hash tables

There are no plans to make any changes to the core hash table routines, so this will continue to be a linked list hash table. For now, I am simply optimizing the hash table code. That is all.

Originally Posted By: Khaled

Quote:

I am unclear what the benefit of forcing the number of slots to be a prime number.

As you say, the distribution may not be evenly distributed, that is why.

True, but will modulus prime be better than any other modulus at distributing the items amongst slots.

I can see why an even modulus might give a bigger bias than an odd modulus e.g. if the least significant bit of the hash was always 0, but I cannot see why a prime modulus >2 would give any better distribution than any odd modulus >2.

Quote:

True, but will modulus prime be better than any other modulus at distributing the items amongst slots.

That is the general theory.

Originally Posted By: Khaled

Originally Posted By: Protopia

1. To change how the linked_list is ordered will break backwards compatibility for any scripts which rely on order.

Which is why I have stated many times that you should not depend on the order of hash table items.

In the real world, people have relied on this order even if you stated that they should not rely on it in some places (though NOT in the help file). So in the real world this change will break scripts.

Originally Posted By: Khaled

Originally Posted By: Protopia

4. In real-world use of hash tables, is there any evidence that moving the most recently accessed item to the beginning of the linked list will improve performance?

It is an optimization that is generally recommended to speed up access for items that are retrieved more often than others.

I can see no reason why you shouldn't move it to the first position in the slot if slots > 1. The cost of doing this is very small, though I am unsure just how often it will produce noticeable benefits in the real world (i.e. how often a script will do an $hget(table,item) for the same item in quick succession).

Originally Posted By: Khaled

If I decide to make my hash table implementation return items in a specific order in particular contexts, this means that I will need to support this in all future hash table improvements, or if/when I change to a new/better hash table implementation. That simply is not practical. Especially since I have repeatedly said, many times, you should not depend on the hash table preserving order since that is not a normal feature of hash tables.

The only thing I am suggesting is that the existing functionality for ordered recall with slots=1 is retained. I am not proposing any changes which will require items to be delivered in any sequence if slots > 1 - and indeed in a previous post I pointed out that users like me and others should NOT try to box you in to e.g. specific hash routines which would then restrict your ability to improve code, so we are agreed on this.

However depending on implementation, alternative search mechanisms enabled by flags might in practice deliver items in a sequence which people then come to rely on (like slots = 1 did) - and if you ever decide to implement these, you should bear this in mind. That said, if such a sequence is intrinsic to how the items are stored, then it would be unlikely that any future changes would impact that, and perhaps you could accept those restrictions.

It is of course your choice as to what you decide to implement - but in my defence I would say that all of my suggestions have been designed to be simple to implement and to preserve backward compatibility whilst delivering significant performance improvements for large hash tables.

(I do not have any code which needs large hash tables, and therefore I am not seeing any performance issues even with non-hashed access, so I am personally not that worried whether you implement them or not. My suggestions are only for the benefit of anyone who does use large hash tables, and particular anyone who uses unhashed access to such tables.)

P.S. Has anyone looked at Paul Hsieh's SuperfastHash http://www.azillionmonkeys.com/qed/hash.html or Bob Jenkins even later and faster lookup3 both of which claim to be both fast and very evenly distributed.

Originally Posted By: Khaled

Quote:

True, but will modulus prime be better than any other modulus at distributing the items amongst slots.

That is the general theory.

I have been looking into this a little, and that theory seems to be dependent on:

a. Whether the hashing algorithm is one based on powers of a prime (often 31) in which case you don't want your modulus to be a multiple of that prime; and

b. Hash tables which use strides when a slot is already used (rather than linked-lists for each slot as mIRC does).

Neither of these seem to apply in this discussion.

Yes, I have already researched this. A prime can be used as a precaution in case the hash function is not reliable. Thanks for looking into it though.

Do you have a table of pre-calculated primes, or is it discovering them on demand? And is there a potential for slow-down when working with 600,000+ item lists?

No, there is no point in using a table of pre-calculated primes in this situation as it has little effect on performance.

Quote:

Has anyone looked at Paul Hsieh's SuperfastHash http://www.azillionmonkeys.com/qed/hash.html or Bob Jenkins even later and faster lookup3 both of which claim to be both fast and very evenly distributed.

I tested these as well as a large number of other hash functions. I even downloaded and compiled SMHasher and performed tests. When testing the hash tables in mIRC using direct calls to the hash table routines, FNV1-A-mod was better in both distribution and speed than any of the hashes tested so far. It has its flaws, eg. zero sensitivity, but that is unlikely to be an issue in practice. It's also satisfyingly simple :-) That is why I have chosen it.

Originally Posted By: Protopia

Originally Posted By: Khaled

Originally Posted By: Protopia

1. To change how the linked_list is ordered will break backwards compatibility for any scripts which rely on order.

Which is why I have stated many times that you should not depend on the order of hash table items.

In the real world, people have relied on this order even if you stated that they should not rely on it in some places (though NOT in the help file). So in the real world this change will break scripts.

Khaled - Upon consideration I would actually go further than I did above and say that there is a pressing need for scripters to be able to store unlimited data in a sequenced list.

Think about the difference between Python dicts (access by name) vs. lists (access by position) or Javascript Objects (access by name) vs. Arrays (access by position). If these languages have gone to the trouble of having entirely different ways of storing both types of data, there is clearly a strong need in any language to provide both types.

If you take away the current implementation whereby -m1 is sequenced, then you will leave mSL without any means of storing large amounts of sequenced data.

Quote:

If you take away the current implementation whereby -m1 is sequenced, then you will leave mSL without any means of storing large amounts of sequenced data.

Although it doesn't support binvar, a (hidden) @window is an ordered list data structure, this is what scripters should be using if they want ordered list . If you need one or more item in the list to be a binvar, you could always try to mark those lines specifically (if your data is known and never start with & for example, you can name the line &item where item is an item name to a known hash table having the binvar (could always put the name of the hash in there to make it generic)) and get a working list. It's not the same performance but it can be done. It is true that mIRC lacks data structure, specifically array, but it's also true that hash table are unordered, it's only unfortunate that the implementation + it not being documented properly resulted in some people relying on this to get order.

Breaking the bucket=1 behavior is not nice, ideally a new set of identifers and commands like /hmake2 could be added to preserve the current hash tables behavior but that's not practical in itself. I'm for the change though, let's have proper unordered associative array. But yes, an array support is needed, then.

Originally Posted By: Wims

Although it doesn't support binvar, a (hidden) @window is an ordered list data structure, this is what scripters should be using if they want ordered list.

1. No such thing as a hidden window - hidden windows still appear in the window list.

2. How does hidden window performance compare with hash tables -m1?

Hidden window is just a state, it's faster to display to an hidden window, you don't need such window to be visible if it's used for a list data structure is what I meant.
I don't know about the perf.

Originally Posted By: Wims

It is true that mIRC lacks data structure, specifically array, but it's also true that hash table are unordered, it's only unfortunate that the implementation + it not being documented properly resulted in some people relying on this to get order.

I think if you look back over the years, a lot of architecturally and aesthetically poor decisions have been made in the interests of preserving backwards compatibility. Why should this be different?

Originally Posted By: Wims

Breaking the bucket=1 behavior is not nice,

Exactly. Backwards compatibility is a watchword of mIRC's history.

Originally Posted By: Wims

Ideally a new set of identifers and commands like /hmake2 could be added to preserve the current hash tables behavior but that's not practical in itself.

How about a new set of commands and identifiers for a new better implementation of hash tables, keeping the existing ones for backward compatibility - or as previously proposed use the same commands and identifiers but with additional hmake flags to defined alternative internal processing.

Originally Posted By: Wims

let's have proper unordered associative array.

We already have an unordered associative array - the hash table. Aside from performance issues (which as I have already said several times are down to use of a linked-list per bucket much more than a slow or uneven hashing algorithm), what we have is fully functional.

We also have a proper, reasonably well performing ordered array solution for small amounts of data - tokenised strings.

What we don't have are:

1. A well performing ordered array solution that is not size limited.

2. Ability to do as JSON does and nest associative or ordered tables inside one another.

My wish list for an ordered list equivalent to existing hash tables would be:

a. Commands to insert and sort.
b. Nice to be able to slice and to transfer slices to / from token strings in a single statement without coded loops.
c. Nice to have ability to keep it auto sorted by data values.
d. Would be nice to have mixed associative / ordered tables which could be positional or auto-sorted on item OR data.

In terms of implementation, might be worth referring to the ancient IBM VSAM "database" which (if memory serves me) worked as follows:

1. Index was a linked list of fixed size array blocks each of which that could hold a UP TO a certain number of pointers to data (or name / data) records. (Each block held links to next and previous blocks and a counter of how many records were included in the block.) Deleting a record meant moving the pointers down one, inserting a record meant moving them up one. If block was full, then it would be split into two blocks each half full. There was also a command to reorganise the index if you had a lot of almost empty blocks in order to make all the index blocks x% full.

2. Whether you inserted by position or allowed it to determine position based on a sort order was a user choice.

3. Aside from the need to move (on average 1.5k of) memory up or down by a few bytes every time you delete or insert, this would be relatively well performing.

For positional access, the correct block would be found by iterating over the linked list of blocks and summing the number of items in each block. Once you have the right block, the memory position in the block can be calculated. Since you can probably store c. 1000 pointers in a 4k block this would be c. 1000x faster than iterating over a linked list of single items (as per single bucket hash list). (Of course they don't have to be 4K blocks - you could use any size blocks (say 1k or 16k) depending on the performance characteristics you were looking for.)

For auto-sorted access, you find the right block by looking at the last entry in each block, and can use a binary search within the correct block to find the exact value - so also pretty well performing.

Wildcard searches might still need to iterate over the whole table, but leading non-wild characters could allow you to subset which parts of the table to iterate over.

4. This algorithm can be used not only at a table-wide level, but you can also use it at a bucket level for buckets >=3 to store records in either item or data sequence and so do a binary search rather than an iteration to find the correct item.

NOTE: Since HADD has to do an HGET in order to determine whether to replace an existing record or add a new one, the performance improvements would likely happen for most types of hash table command / identifier.

5. Subject to the overhead of storing and maintaining the indexes, you could actually combine hashed, item-sorted, data-sorted and positional access methods on a single table, using hmake flags to indicate which types of access to create internally.

The only new commands needed would be hinsert and hsort.

6. This would create extended functionality with significantly better performance and WITHOUT losing backwards compatibility.

Originally Posted By: Wims

Hidden window is just a state, it's faster to display to an hidden window, you don't need such window to be visible if it's used for a list data structure is what I meant.
I don't know about the perf.

Currently writing scripts which need unlimited sequenced data - I might try a hidden window technique rather than hash -m1.

Quote:

How about a new set of commands and identifiers for a new better implementation of hash tables, keeping the existing ones for backward compatibility

That's what I meant, it was said in the sense that a new set allows to preserve the old one.

Originally Posted By: Wims

Quote:

How about a new set of commands and identifiers for a new better implementation of hash tables, keeping the existing ones for backward compatibility

That's what I meant, it was said in the sense that a new set allows to preserve the old one.

Yes - but even better would be to NOT create a competing set of commands / identifiers which then results in having to maintain both sets for ever more, but instead to leave the existing set as they are and extend them by use of flags (which is what has generally been done in the past).

But of course, all this is dependent on Khaled being willing to develop any changes, and I think he has made his position quite clear on what he is and isn't prepared to do at the present time.

Okay, I have extended the hash table code to allow custom options per table, so different rules can now be applied to different tables. For backwards compatibility, the next beta will allow single slot hash tables and will disable any optimizations that affect the order of items.

Originally Posted By: Khaled

Okay, I have extended the hash table code to allow custom options per table, so different rules can now be applied to different tables.

I would be very interested to hear details of these changes if you are willing to spend the time to describe them.

Originally Posted By: Khaled

For backwards compatibility, the next beta will allow single slot hash tables and will disable any optimizations that affect the order of items.

I think that this is a good decision.

I am not suggesting another attempt to improve hash tables should be made now, but I just wanted to note for future consideration that Facebook have just open-sourced their F14 hashing algorithm C++ code which is supposed to minimise hash clashes, be extremely fast and have 14 access types.