Monday, April 11, 2011

Associative Arrays / Hash Tables / Collections in SCL

I've been using QUEST for just over five years now, and over time I've become more and more bothered by its lack of a hash table data structure, like I get out of the box in VBA (I know, VBA is pretty ancient stuff itself, but compared to SCL, VBA is a dream, in my opinion.)

A hash table is a data structure is basically a string array that instead of being indexed by numbers, is essentially indexed by strings. In VBA, a hash table is called a Collection, and provides methods for adding, removing, and iterating over its contents. Most implementations of hash tables allow you to put anything in the "bucket" of stuff, keyed by a string.

The purpose of a hash table, as far as I can tell, is to provide a somewhat faster mechanism for finding items in an array than by just looking at every item one by one. Call this the slowest possible case. The fastest possible case would be to just convert a string into a big long binary number, and have a gigantic string array with enough indexes to hold any length string we can throw at it. This is probably the fastest case (I think fetching an item from an array is done in constant time), but it's not really feasible, especially with QUEST SCL, because we don't have all the memory in the world to hold this huge array, which, by the way, would be pretty much empty.

So enter the hash table. According to the Algorithms in C++ book I picked up at the library book sale for 75 cents, we can instead create a small array, with each array space holding a linked list of items with a specific hash value. The linked lists are there because we know that by limiting the size of our array, we can have different keys with the same hash.

The idea then, for searching for a key value, is that you hash the key and get a number between 1 and 111, and get the base node in the linked list at that array index. Then, we just search through each of the nodes in that linked list until we find our exact key value, and return the original string.

This kind of thing seems pretty easy to pull off with SCL, as it's just a matter of providing the hashing function (nicely included in the book), and a way to hold an array of linked lists for each of our 111 possible hash values.

I've put together a hash table utility in SCL, though I called it a collection. It's available for download here with an example for using it here. I put the hash_table.scl file into my directory of always compiled utility routines.

To use the hash table, you have to include the SCL file (I had to specify the full path to the file, so you'll have to modify that to accommodate your system. To create a collection, call the new_collection routine, which initializes a collection structure. Use col_add to add an item, remove_item_from_collection to remove an item, and col_item_exists to check if there is an item in the collection already. The kill_collection procedure will deallocate all the memory for the collection, so make sure to call that when you're done with it, or you could end up eating a lot of memory, depending on how you use this.

Another note, you can loop through every item in the collection, starting (assuming you call your collection col) and looping through a linked list of every item in the collection.

I haven't tested it terribly well, but the example file seems to work as I intend, so unless I start using it and find it broken, this is how it'll stay.

I'm not sure how much faster this is, computationally, than just looping through each item in an array and checking it. It's probably a little slower for small collections, but I imagine as the thing grows the hash table becomes faster than just looping, but I don't know. I do think it'll make some lookups a bit easier, at least from a programmer's productivity perspective, but then again, what do I know? Either way, I'll have to try and implement more of these algorithms, especially the section on regular expressions.