mIRC Home    About    Download    Register    News    Help

Print Thread
Joined: Dec 2005
Posts: 2
R
Bowl of petunias
OP Offline
Bowl of petunias
R
Joined: Dec 2005
Posts: 2
Heyyas.
I've discovered a rather nasty limitation of the hashdb's that has turned into an annoyance.
As it is, currently, I have a hash DB with about 300k entries.
This DB grows with, say about 5-10000 entries per day.
When using /hmake, you're limited to the number 10.000, which I figure is some sort of precache that makes searches faster.
Now, $hfind(NameOfDB,*SomeWord*,0,w) (A search through the item names, not the actual data) takes about 5 seconds on a slow machine with 512MB ram, 1.2ghz, while it takes about 4 seconds on a 64bit 3700+ with 2GB ram.

So what I've concluded is that the slow search times are not due to hardware limitations.

What I'm wondering is, if hacking/bypassing/changing the hardcoded cap of 10000 couldnt help speed up searches, or if I will have to convert my entire DB to an sql db rather than hash (Which I've been quite satisfied with so far..) - Any thoughts?

Joined: Feb 2004
Posts: 2,019
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2004
Posts: 2,019
The reason $hfind is slow when using wildmatches/regex etc. is because it can't take advantage of the hashing algorithm, which is the primary reason why hash tables are fast. Thinking logically, it can't do anything but a linear search through each item or data looking for a match, whereas when using $hget (without the item property) uses the hashing algorithm to transform the key, and finding it faster in the correct bucket. If your hash table grows as fast as you say it does, then a hash table will soon become inadequate, and you should look into using a real database, perhaps using a (my)sql dll which you can find doing a search here or doing a search for accessing mysql with COM by doing a search here.

Btw, I loved "Only built for Cuban linx".


Gone.
Joined: Dec 2005
Posts: 2
R
Bowl of petunias
OP Offline
Bowl of petunias
R
Joined: Dec 2005
Posts: 2
I understand that I will, eventually at least, have to change - but in a way I find a certain.. should I say 'leetness' in creating everything possible with a mIRC script, to spite the people that go 'tcl rules!!!!! mIRC scripting is useless!!!!!!'..

I decided to test a bit this way:

Wih the database of about 300.000 entries, I've tried to determine what the /hmake DataBase <NUMBER> actually does.

It seems, through my tests that the difference in search times is damn near unnoticable, however the time it takes to use /hload DataBase <FileToLoad> is prolonged by an extreme amount with 1000, compared to the max of 10000.

So, SQL is the next step. Damnit.
Oh, used a simple alias for testing.

Code:
alias TestSearch {
  var %ticks = $ticks
  var %hfind = $hfind(DataBase,*test*,0,w)
  %timetosearch = $calc($ticks - %ticks)
  echo -a Results: %hfind :: Time to search: %timetosearch $+ ms.
} 

Joined: Feb 2004
Posts: 2,019
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2004
Posts: 2,019
It's logical that hloading goes quicker to load your big hash table when specifying a bigger N, because the N that you specify (optional) when doing /hmake stands for the number of slots. Based upon this number, mIRC will initialize the hash table, taking into account around (10xN) hash table items. Naturally, if you specify 1000, it takes into account around 10000 items, but since you have 300k items, it has to free up additional memory which takes more time. When specifying 10000 slots, it takes into account 100k items, which means it also has to free up additional memory, but less than when you specified 10k items, which explains why it goes faster.

A hash table isn't suited for this kind of amount of data. Imagine if each hash table entry holds around 500 bytes of data and there are 500k entries, which takes up roughly 238mb of RAM. That's a lot of data to hold in memory in a hash table, which wouldn't be the case when using a real database.


Gone.
Joined: Apr 2004
Posts: 871
Sat Offline
Hoopy frood
Offline
Hoopy frood
Joined: Apr 2004
Posts: 871
Quote:
Based upon this number, mIRC will initialize the hash table, taking into account around (10xN) hash table items. Naturally, if you specify 1000, it takes into account around 10000 items, but since you have 300k items, it has to free up additional memory which takes more time. When specifying 10000 slots, it takes into account 100k items, which means it also has to free up additional memory, but less than when you specified 10k items, which explains why it goes faster.

I seriously doubt that /hmake preallocates memory for items. A more logical explanation is that given a smaller number of hash table slots, there are more items per slot, each of which has to be compared to each item from the file (in order to make sure the same item name isn't added to the hashtable twice). Comparing 150 strings per added entry (on average, for each of the 300k entries) is definitely slower than comparing 15..


Saturn, QuakeNet staff
Joined: Feb 2004
Posts: 2,019
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2004
Posts: 2,019
Doesn't creating a bucket mean that memory is to be allocated because it will have to hold items?

I'd be surprised/dissapointed if it works the way you say, because that would mean mIRC doesn't dynamically create extra slots when /hloading, when it sees that the data to be loaded in the hash table, is far more than what it expected when /hmake was issued.

Wouldn't $hget be slower then if you loaded a big file into a small hash table (small amount of buckets = more items to compare per bucket)? Since the average amount of items to traverse in a bucket when looking for a match is N/2, it would mean that $hget takes much longer on a big file loaded in a small hash table, compared to a big file loaded in a big hash table, whereas the original poster mentioned there's no significant difference in $hget/$hfind, only in /hload (haven't benched it myself)

I would have expected that when you load a big file into a small hash table, that the already assigned buckets will overflow, making mIRC create extra buckets, thus what I meant with allocating more memory (creating a bucket = allocating memory)

Of course, all of this is purely my reasoning, I know you're a comp sci student, so you're in a much better position to judge on this.


Gone.
Joined: Apr 2004
Posts: 871
Sat Offline
Hoopy frood
Offline
Hoopy frood
Joined: Apr 2004
Posts: 871
In general, you don't "create" individual buckets.

A hashtable is basically an array of N buckets (N being the size from /hmake) where each bucket is just a pointer (4 bytes) to a linked list of items. In principle you could enlarge the number of buckets later on by resizing the array, but that means the hash values for items change: the bucket number for an item, i.e. the index into the array of pointers, is determined by the formula "hashfunc(itemname) modulo N". Changing N on the fly essentially means you have to reassign most of the items to other buckets, which is a nightmare with large hashtables. As a result, hashtables typically have a fixed size, and I'm sure mIRC works that way too.

This means that $hget will indeed be slower if you take a smaller hashtable and fill it with the same amount of items as before, for exactly the reason you mention. A benchmark would show this, although you'd have to repeat $hget a large number of times of course, as the speed of individual $hgets is still negligible (150us vs 15us or whatever).

On the other hand, if your file contains M entries, /hload does M implicit $hgets to ensure that item names won't be added twice. If the factor M:N (i.e. items:size) is 10:1, then it will do (on average) 5 * M lookups, if it is 100:1, then it will do 50 * M lookups. For large values of M, this difference is noticeable, even with one single /hload. The moral of this story is that you should choose N very carefully before issuing a /hload..

Finally, $hget().item and $hfind don't use the hashtable as a hashtable, but just as an array of linked lists of items, and as iterating over arrays and iterating over linked lists is about the same speed, the speed of those identifiers fully depends on N and not at all on M. This is why searching/iterating through hashtables is inherently slow, and why making the hashtable bigger won't solve the OP's problem. (you mentioned this already btw)

Note that there is no such thing as "overflow" in hashtables; buckets point to linked lists that can be of arbitrary length. If you make a hashtable of size 1, you are essentially using a single linked list of items, and everything will still work fine (but $hget would be just as slow as $hfind in that case).

I hope this clears things up a bit.. otherwise please keep asking wink


Saturn, QuakeNet staff
Joined: Feb 2004
Posts: 2,019
Hoopy frood
Offline
Hoopy frood
Joined: Feb 2004
Posts: 2,019
Thanks for the clarification, I'm sure it'll be helpful for anyone wanting to know a little bit more about the internals of hash tables, and why one should be cautious when choosing the N when creating a hash table.


Gone.

Link Copied to Clipboard