[lug] Hash vs list

Mon Apr 24 15:49:02 MDT 2006

>>>>> "Jeff" == Jeff Schroeder <jeff at zingstudios.net> writes:

Jeff> I'm curious whether searching for a single value in a hash is
Jeff> generally (always?) faster than searching for it in a list.  The
Jeff> programming I'm doing is in PHP, but I think it's a question
Jeff> with a general answer.

And the general answer is: "it depends".  It depends on key
distribution, access patterns (both temporal and key-distribution),
memory pressure (including cache size), language, processor, whether
ordering is important / relevant / free.

To do some research on this, realize that the data type you're looking
for is traditionally called a "set" (although some cultures call it a
"bag", but then again that latter term might indicate a set-with-
duplicates...).  A related data type "dictionary" which is a set that
carries additional data about its keys.  The two typically have nearly
identical implementation strategies (except that sets can in some
cases be represented purely by location, which can't carry any
additional information -- think bitmaps.)

Typical representations include:

 * unsorted dense sequences
   + array
   + singly-linked list
   + doubly-linked list
   + deque (dlist of array)
 * sorted dense sequences
   + (same as above)
 * unsorted sparse sequences
   + hash tables
     - open hashing
     - chained hashing
 * sorted trees
   + balanced or not
   + arity (binary, 2-3, b-trees)
   + dynamic (splay trees)
   + tries (n-way trees of strings from n-symbol alphabet)

And that's all still at a conceptual level.  Optimizing for particular
values of all the variables I mention above gets really ugly really
fast.

The Judy implementation that Sean mentioned is an attempt to code this
functionality very close to the hardware -- it keeps a fairly clean
interface, but under the hood it arranges things into b-trees such
that each node is exactly the size of a cache line; it embeds small
values into the tree node directly instead of allocating a new leaf
node; etc.  One could view this as a detail-vs-time tradeoff, but if
the library is solid and well-specified, then you can just use someone
else's effort "for free".

Even at the conceptual / whiteboard level, you can already make
tradeoffs.  Sure, you have 1M records -- but if you only search
through them once, you're obviously better off with a stupid linear
search (expected cost N/2), vs building up any sort of structure
(since you need to look at each element at least once, so your cost is
guaranteed to be N+change).

On the other hand, if you know you're going to be making many
comparisons, you're better off doing extra work up front so that your
lookups can be quick.  Especially if you have memory to burn, hash
tables are probably the winner here, as they should have just one or
two comparisions per lookup.  (Hash tables work by smashing the key
into a number, then looking at that "slot" in an array; the value is
either there, or it's empty, or the value is "chained" down from that
slot.  Either way, a properly implemented hash should have a very low
collision rate, and a low bucket size.)

But what if it's important to find the "bounds" of where a missing key
would go?  Hash maps are unsorted, so they're not suitable if there's
this additional requirement.  In this case, a balanced tree is
probably your best bet for dynamic collections, sorted dense array for
static collections.  (Each of which costs 20 comparisons per lookup
[20 being about the logarithm of 1M to base 2], but adding an element
to a balanced tree should only cost 20, while adding one to the array
likely costs N/2.)

But both hash tables and balanced trees use up more memory than the
dense array, sometimes by a factor of 3.  So if you're constrained for
memory, you might have to make the space/time tradeoff.

So some considerations when trying to determine the "best" represen-
tation to use for a given task:

How much setup time can/should you use?
How many lookups are you going to do?
How much memory can you throw at the problem?
How dynamic is the collection (inserts/deletes)?
How fast does the response have to be?
How much time can you spend writing your own?
What options does your language / libraries / system offer?
How dense are the keys?
How much data with each key?

And we can't decide any of this for you.  If it's a one-time deal,
just linear search the list.  If it's something that's going to be hit
multiple times a second, that's obviously a different story.

One thing that is hopefully clear: there is no such thing as "best",
not in the generic sense.  Different circumstances will lead to
different answers.

Guidance:

1. For the initial implementation, keep the code as clean and simple
   as possible.

2. Hide the mechanism / representation behind an interface, or at
   least behind a function call.  That way, if you want to change it
   to a higher-performance representation later, you should only have
   to change it in one place, and not throughout your code.

   This has the additional advantage of giving you an obvious test
   point; with a competent unit test, you can prove to your own
   satisfaction that *this* part of the program works.  If the program
   continues to have bugs, you know you can look elsewhere.

Good luck,
t.