Originally Posted By: Wims
Your suggestion may not be bad but this thread was mainly about the way items are distributed, which is related to the algorithm, so my suggestion seemed fair.
Whilst this thread was initially about why hashing means you lose the sequence you stored items, it very quickly became a discussion on performance.

I was not suggesting that you shouldn't make a suggestion about a flag to allow the scripter to choose one of several hashing algorithms, just that if you are going to do this then you need to:

a. Give the scripter the tools to analyse the data in order to determine the best algorithm to use for their data; and

b. At the moment Khaled can change the hashing algorithm to improve performance without any scripts breaking. If you give the user the choice / tools to analyse, then you are creating another set of constraints that you have to meet in the future to maintain backwards compatibility; and

c. A different hashing algorithm might make small differences to performance for hashed lookups - but as I attempt to show below, a lot of lookups are unhashed, and both hashed and unhashed lookups have to iterate through linked lists which is pretty slow and therefore has scope for significant performance improvements, and in a lot of cases the scripts are iterating running the same $hget(table,n) or $hfind lookup multiple times which again has scope for significant performance improvements. IMO such performance improvements could dwarf those gained by tweaking the hashing algorithm.

Originally Posted By: Khaled
The older hashing algorithm distributes data unevenly across buckets in the hash table, so some buckets have far fewer/greater items in them than others, however it is distributing individual items better. At least, with certain combinations of data. With some combinations of data, it leaves many buckets empty.

The fnv1a algorithm distributes items more evenly across all buckets, however, due to its lack of avalanching, more similar items are being stored in the same buckets. This means that when a hash matches a bucket and it has to be searched, even though there are fewer items in it (because they're more evenly distributed across buckets), the string comparison call takes longer as it is matching more characters per item because they are more similar, even though there are fewer string comparisons.

Unfortunately, every other hashing algorithm I have tried, eg. murmur3, xxhfast32, lookup3, takes more time to create the hash, at least for short strings. So it looks like we may just have to stick with the older algorithm.

Analysing my own scripts I find that I use hash matches when I am storing data by nickname or channel - so the item names are relatively short strings, but much more often I am wanting to use them to store requests in a FIFO queue (using $hget(table,n) or to use it to perform wildcard matches (in which case I am using $hfind), both of which are non-hashed lookups.

So, I think Khaled is right to look at the cost of calculating the hash for smaller strings (like nicks or channels) rather than longer ones, BUT that you need to look at the cost of the hash algorithm weighed against the benefits of having more even length lists to iterate over rather than just the cost of the algorithm alone.

There are two different scenarios when the hashing algorithm is uneven:

a. It still spreads items over all slots, but the lengths are uneven;
b. As a. but worse so that some slots are left completely empty.

In a., the average length of the slots is the same as if they were evenly spread, but... because the number of hits on a particular slot is proportional to the length of the list, and the time taken to iterate over the list is also proportional to the length of the list, then the performance hit is likely to be the square of how much variance in length there is.

In b., the above is exaggerated because the lists in empty slots are never searched.

So, the greater the variance in distribution, the bigger the performance hit and the more you can afford to increase the cost of the hashing algorithm.

I haven't looked at the specifics of the current hashing algorithm compared to the alternatives, but MurmurHash3 is the one people seem to recommend as being both even and fast.

HOWEVER... as I suggested above, I am not at all sure that we are focusing efforts on the areas where performance improvements will have a real-life benefit...

HASH TABLE SIZE
It seems to me that we should be less concerned (but definitely not unconcerned) about significant performance changes for small hash tables which have quick lookups - if say a 2ms lookup increases to 5ms or decreases to 1ms, is a 1ms-3ms change going to be noticeable to anyone in a real-life application (as opposed to a somewhat artificial benchmark).

Instead we should try to optimise for lookups on large tables which could take a significant time - a 50% reduction in a lookup on a huge hash table which takes 0.5s to run might well be noticeable by the user.

ITERATION - EXACT MATCH
As I pointed out before, iterating through a large linked list has a big cost. For hashed (i.e. exact match $hget) lookups, at least you are iterating only through one slot containing only a small portion of the data - yes, you want to optimise this by having an evenly distributed hash function, but the time needed for a hash lookup in a table with n items and s slots is O(n/s).

But for a sequential $hget(table,n) lookup and for all $hfind lookups, you are iterating over the entire table O(n), and there is a small overhead for iterating across slots as well as within the linked list for each slot. So on the basis of focusing on the functionality with the worst performance, it seems to me that unhashed lookups has the greatest scope for performance gains.

To put this in context:

* an iterative search is O(n), so an iterative $hfind search of a 10,000 entry table is going to take between an average of 5,000 iterations if it is matched and 10,000 iterations if unmatched; but
* a search through a self-balancing binary tree is O($log2(n)), so a similar search using one of these would take c. 13 iterations to either find or not find a match (and 10,000 to 13 iterations is a 99.8% reduction).

You might even find that an exact match lookup on a binary tree is as fast in practice as on a hash lookup because you spend the time looking rather than calculating the hash.

ITERATING - NON-EXACT MATCH
For wildcard / regex matches, you can't use a binary search directly and you still have to iterate, But, if the first character is not a wildcard character (* or ?) you can take the string up to the first wildcard and use the binary tree search to find the first and last possible nodes you need to iterate over - so instead of iterating over the whole tree i.e. 10,000 iterations, instead you use 13 iterations to find the starting node, and 13 to find the ending node, perhaps resulting in 100 nodes to iterate over for a total of 126 iterations - still a saving of almost 90%.

MIXED OPERATIONS
As stated in a previous post, you can even maintain multiple ways to access the data at the same time:

a. -m1 - turn hashing off (as now)
b. -i - turn on binary tree for item name (as well as hash)
c. -d - turn on binary tree for data (as well as hash)
d. -o - use binary tree per hash slot instead of linked list - for -m1 gives you a table ordered by item name rather than ordered by the sequence they were added - and this could even be default for >1 slots since order is not defined for this situation

Yes - more overhead to maintain the multiple structures when you add items, but since adding an item requires a search to see if the item already exists (currently an iteration), then overall the performance adding items could still end up better.

CACHED ITERATION
When you use $hget(table,n) or $hfind, you will often use these in an iterative loop. In these situations where you are iterating (i.e. to find the n'th wildcard match or the n'th sequential item), you can cache the last search parameters and position, and providing that you are asking for the n'th or greater item on the next call, then you can continue searching from the point you previously reached rather than starting to count from the beginning. This would also e.g. give you quick access to the data for the item you just found.

Last edited by Protopia; 19/10/18 11:43 PM.