When faster is actually slower
This is a follow-up to my post on code optimization.
Today I did some experimenting with hashes and binary-search-trees, specifically playing with QHash and QMap in C++.
According to the Qt containers algorithmic complexity documentation, QHash provides “significantly faster lookups” than QMap. Your intuition may confirm this, because you’ve heard that QHash uses a constant-time algorithm for lookups, while QMap uses a logarithmic algorithm. However, when it comes to performance questions, intuition is not the best guide.
When considering the QMap vs. QHash speed question, we have to consider everything these classes do when you are looking up an entry. Otherwise, you may be led to assume that QHash lookups are faster than QMap lookups in all cases.
How does QMap work?
Like an associative array or dictionary, the QMap class allows you to store values identified by keys. You can insert key/value pairs into the QMap and retrieve values by key. You can use any type as the value, but the key type must implement the < operator. This is because QMap looks up keys using a binary search through a sorted list.
How does QHash work?
QHash also provides the ability to insert and lookup values like QMap, but its implementation is quite different. Keys used in a QHash must implement the == operator and there must exist a global qHash(key) function for the key type. The QHash class hashes each key on lookup to find the position for each key’s value. This is a constant time operation (amortized). This means that the time it takes to lookup a key does not depend on the number of keys already stored in the QHash.
When is QHash slower than QMap?
Consider the following example:
QHash<QString, int> myHash; QMap<QString, int> myMap;
To insert the key “my favorite number” with the value 42 into myHash, the QHash has to first generate a hash code for “my favorite number”. To do so, it uses the qHash() function, which reads every character of the string “my favorite number”. This generates an index that is used to locate the position for this entry. Notice that qHash() must read every character of the string. The QMap, on the other hand, uses the < operator to compare the string “my favorite number” with all other keys in the QMap.
Now let’s suppose we filled myHash and myMap with 1 million short string keys (like a few characters each), and then did a few lookups.
I measured this case using QTime, and found that QHash lookups are about 3 times faster than QMap lookups in this scenario.
This is what I would expect after reading the documentation, but what if the key strings are bigger, like 200 characters each, and all quite distinct from each other?
I measured this case and found that QHash lookups are about 3 times slower than QMap lookups.
How can this be? The documentation says that QHash lookups are supposed to be faster.
The reason is this: Because QMap uses the < operator, it does not need to inspect every character of each key string. Instead, it compares the key string to other key strings, one character at a time, and stops if it discovers a difference before reaching the end of the string. This results in much less computation when locating a key’s value. So much less computation in fact, that even though QMap’s lookup algorithm is logarithmic, it is actually faster than QHash’s constant-time lookup algorithm on the same dataset.
I should offer the disclaimer that QHash will likely be faster than QMap for larger quantities of entries with that key size, but I was unable to test larger quantities due to the limited amount of RAM in my laptop.
When optimizing for speed, always consider all factors. Just because an algorithm is constant time does not mean that it is always faster than a non-constant time algorithm. And always remember to measure execution speed before assuming that one approach is faster than another.
That’s a very good post.
Of course, QMap is faster that QHash on beg data sets, because if the strings are long but very different, let’s say the combination of the first 2 chars is different for every key, that you have a time for QHash like:
t = T(200)
and for QMap
t = T(2*log2(item_count))
T(cl) = cl * time needed to read 2 chars
But QHash will be faster on VERY huge data sets, because if you have
2^100 items or more (well, that should not be so often…) then QHash is faster again.
So you can see that as long as the key length is longer than a specific value and you know the longest common string in keys, than you can find out, when you will get this case:
item_count = key_length/(log2(item_count)*lcs)
where lcs is the longest common string in the keys.
Sorry, i need to correct something:
“item_count = key_length/(log2(item_count)*lcs)” is wrong of course and has to be
item_count = 2^(key_length/(log2(item_count)*lcs))
Interesting, but you should still use Qhash even if it is slower on these particular datasets. Computers will keep getting faster and memory will keep getting larger. On a computer made 10 years from now, you won’t be able to tell the difference between qhash and qmap for the data sets you use today, but for larger datasets, qhash is faster. This is the same rationale for using quick/merge/heap sort all the time even though insertion sort is probably faster for very small sets. We don’t that insertion sort is faster for sets < 10, because merge sort, even if it is 1000 times slower, is still fast enough. On data sets that are much much bigger, X*nlogn starts to be much smaller than Y*n^2 regardless of how much bigger X is than Y.
Wow, talk about premature optimization! ^^^
Just for the people landing here from google (like me), some corrections:
The assumption that a QMap does not need to read the entire key is false.
After all the binary splits in the search, when we end up with 1 possible match, the QMap needs to be sure that this candidate actually matches the search key. The source code shows this is done with another (2nd) call to the < operator. If the found key is not larger, and not smaller than the search key, then it must be equal. So in the end, when the binary search has found a match, it will at least have compared the full strings twice!
Also, the QHash lookup is not purely constant time. This will only be true for a perfect hash. In reality, there will be hash collisions. QHash falls back to a linear search in this case (linked list). This might explain (some of) the performance loss.
Still, the main point remains valid, computing the hash for long keys might be more costly than a binary search.
You have to consider just a simple thing : the logarithmic approach to QMap will always perform at worst at O(ln) while the constant approach of QHash will be of O(1) at best and O(n) at worst.
Hash tables were made for opportunistic situations but if you don’t know your data : choose safety and go logarithmic.
And this is not even taking into accounts the computation time for the hash function.
It seems that the qHash() function takes a const reference to the instance be being hashed, so once the instance is hashed, cache the uint value of qHash in your class for faster lookup next time. That may then optimize your qHash, or better yet, calculate the uint hash value in the constructor and just return the value when qHash(const YourClass& val) is called.