One of my favorite data structures is the hash table because it’s simple & powerful.
You probably have used it before since it’s an efficient way to store key-value pairs.
There are some interesting computer science concepts behind the implementation of a hash table that are worth studying, and that’s exactly what we are going to do in this article!
Buckets & The Hash Function
The basic idea of a hash table is to allow us to efficiently (in O(1)
) retrieve data that is indexed by a key.
As a quick refresher, this is what using a hash table looks like in Ruby:
prices = { apple: 0.50, ice_cream: 3, steak: 10 }
To implement a hash table we need two components:
- A place to store the table entries
- A way to assign key/value pairs to a specific position (index) inside this data store
In other words, we need an array & a hash function.
Implementing a Simple Hash Function
The hash function is an important component of a hash table.
This function transforms the key into an index that can be used to lookup or update its associated value.
This is the big difference between plain arrays & hash tables.
In an array, you can only access values via their index and the index can only be a number. In a hash table, you access values via their key & the key can be anything (string, symbol, integer…), as long as you can write a hash function for it.
You can write a simple hash function for strings by converting all the letters into their ASCII values then adding them up.
Here is an example:
BUCKETS = 32 def hash(input) input.to_s.chars.inject(0) { |sum, ch| sum + ch.ord } % BUCKETS end
In this method we use to_s
to make sure we are working with a string.
This will help us avoid ‘undefined method’ errors. Then a combination of chars
(to convert the string into an Array
of its characters) & inject
for adding up the values.
Inside the block I used the ord
method to convert the characters into their ordinal values.
Finally, I used the modulo operator %
to make sure the resulting value fits into the array. We call each entry in this array a ‘bucket’.
Bucket Distribution
Ideally we want all our buckets to be filled evenly, this will give us the best performance when we need to retrieve a value.
Let’s look at what happens when we test our hash function using this code:
# Create an array of size BUCKETS with all elements set to 0 table = Array.new(BUCKETS) { 0 } letters = Array('a'..'z') 10_000.times do # Create a random string input = Array.new(5) { letters.sample }.join # Count hash distribution table[hash(input)] += 1 end
This produces the following results:
[302, 290, 299, 309, 321, 293, 316, 301, 296, 306, 340, 321, 313, 304, 318, 296, 331, 306, 348, 330, 310, 313, 298, 292, 304, 315, 337, 325, 325, 331, 319, 291]
It looks like our keys are evenly distributed…
…but what happens if we ramp up the number of buckets?
In this example I used a bucket size of 128 (it was 32 before):
[22, 24, 33, 36, 41, 58, 61, 66, 97, 77, 88, 110, 89, 82, 123, 121, 119, 111, 147, 178, 136, 176, 144, 180, 190, 193, 185, 192, 223, 209, 208, 196, 215, 251, 233, 226, 231, 236, 219, 218, 227, 221, 206, 220, 208, 213, 201, 191, 182, 165, 188, 141, 160, 135, 130, 117, 139, 106, 121, 85, 70, 93, 74, 61, 57, 54, 40, 46, 32, 36, 30, 21, 25, 17, 14, 16, 16, 14, 8, 11, 5, 5, 1, 1, 2, 1, 3, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 4, 3, 6, 0, 2, 9, 13, 11, 12, 14, 12, 23, 12, 22]
That doesn’t look like a great distribution anymore!
What’s going on?
The problem is that our hash function is not good enough because all the strings of the same size stay in a certain range. That’s why we have a lot of empty buckets in the middle.
A Better Hash Function
We need a better way to convert our string into an index. Let’s take a look at one possible implementation.
BUCKETS = 256 def hash(input) input.to_s.each_char.inject(0) do |sum, ch| (sum << 8) ^ (ch.ord) ^ (sum >> 4) end % BUCKETS end
What’s happening here is bit shifting (with the >>
& <<
operators). The values are combined using the “exclusive OR operator” (^
).
This bit shifting will mix things up, which will give us a better key distribution. Not perfect, but it’s better than our simple ASCII-based function 🙂
If you want a proper hash function you would be looking at something like MurmurHash, which I believe is what Ruby uses.
Handling Collisions
We don’t have a useful hash table yet.
Why?
Well, you may have noticed that when two keys hash to the same index they will overwrite the old value, and that’s not good!
This is called a hash collision & there are a few strategies to deal with these.
For example:
- Double Hashing
- Linear probing
- Separate chaining
Let’s take a look at separate chaining, where we use a linked-list to store the entries for a particular “bucket”.
So if we assume that :abc
& :ccc
hash to the same index, our hash table would look something like this:
3: [:abc, 100] -> [:ccc, 200] 4: nil 5: [:yx, 50]
Then we will need a linear search to find the correct key.
This will have an impact on our performance because our lookup time can slowly degrade towards O(n)
, instead of the expected O(1)
.
If you are not familiar with this
O(something)
notation that’s called “Big-O Notation“.
To avoid the linked list from growing too deep & therefore degrade the performance of the hash table, it’s necessary to recreate the hash table using a bigger number of buckets.
Ruby does this for you automatically, but still good to know.
Conclusion
The purpose of this article is not to have you writing a hash table implementation, but to help you understand how they actually work, so I hope that you found that interesting!
Don’t forget to share this post to keep the blog going 🙂