[lug] Hash vs list

mohadib mohadib at openactive.org
Mon Apr 24 17:23:34 MDT 2006


On Mon, 2006-04-24 at 15:49 -0600, Tkil wrote:
> >>>>> "Jeff" == Jeff Schroeder <jeff at zingstudios.net> writes:
> 
> Jeff> I'm curious whether searching for a single value in a hash is
> Jeff> generally (always?) faster than searching for it in a list.  The
> Jeff> programming I'm doing is in PHP, but I think it's a question
> Jeff> with a general answer.
> 
> And the general answer is: "it depends".  It depends on key
> distribution, access patterns (both temporal and key-distribution),
> memory pressure (including cache size), language, processor, whether
> ordering is important / relevant / free.
> 
> To do some research on this, realize that the data type you're looking
> for is traditionally called a "set" (although some cultures call it a
> "bag", but then again that latter term might indicate a set-with-
> duplicates...).  A related data type "dictionary" which is a set that
> carries additional data about its keys.  The two typically have nearly
> identical implementation strategies (except that sets can in some
> cases be represented purely by location, which can't carry any
> additional information -- think bitmaps.)
> 
> Typical representations include:
> 
>  * unsorted dense sequences
>    + array
>    + singly-linked list
>    + doubly-linked list
>    + deque (dlist of array)
>  * sorted dense sequences
>    + (same as above)
>  * unsorted sparse sequences
>    + hash tables
>      - open hashing
>      - chained hashing
>  * sorted trees
>    + balanced or not
>    + arity (binary, 2-3, b-trees)
>    + dynamic (splay trees)
>    + tries (n-way trees of strings from n-symbol alphabet)
> 
> And that's all still at a conceptual level.  Optimizing for particular
> values of all the variables I mention above gets really ugly really
> fast.
> 
> The Judy implementation that Sean mentioned is an attempt to code this
> functionality very close to the hardware -- it keeps a fairly clean
> interface, but under the hood it arranges things into b-trees such
> that each node is exactly the size of a cache line; it embeds small
> values into the tree node directly instead of allocating a new leaf
> node; etc.  One could view this as a detail-vs-time tradeoff, but if
> the library is solid and well-specified, then you can just use someone
> else's effort "for free".
> 
> Even at the conceptual / whiteboard level, you can already make
> tradeoffs.  Sure, you have 1M records -- but if you only search
> through them once, you're obviously better off with a stupid linear
> search (expected cost N/2), vs building up any sort of structure
> (since you need to look at each element at least once, so your cost is
> guaranteed to be N+change).
> 
> On the other hand, if you know you're going to be making many
> comparisons, you're better off doing extra work up front so that your
> lookups can be quick.  Especially if you have memory to burn, hash
> tables are probably the winner here, as they should have just one or
> two comparisions per lookup.  (Hash tables work by smashing the key
> into a number, then looking at that "slot" in an array; the value is
> either there, or it's empty, or the value is "chained" down from that
> slot.  Either way, a properly implemented hash should have a very low
> collision rate, and a low bucket size.)
> 
> But what if it's important to find the "bounds" of where a missing key
> would go?  Hash maps are unsorted, so they're not suitable if there's
> this additional requirement.  In this case, a balanced tree is
> probably your best bet for dynamic collections, sorted dense array for
> static collections.  (Each of which costs 20 comparisons per lookup
> [20 being about the logarithm of 1M to base 2], but adding an element
> to a balanced tree should only cost 20, while adding one to the array
> likely costs N/2.)
> 
> But both hash tables and balanced trees use up more memory than the
> dense array, sometimes by a factor of 3.  So if you're constrained for
> memory, you might have to make the space/time tradeoff.
> 
> So some considerations when trying to determine the "best" represen-
> tation to use for a given task:
> 
> How much setup time can/should you use?
> How many lookups are you going to do?
> How much memory can you throw at the problem?
> How dynamic is the collection (inserts/deletes)?
> How fast does the response have to be?
> How much time can you spend writing your own?
> What options does your language / libraries / system offer?
> How dense are the keys?
> How much data with each key?
> 
> And we can't decide any of this for you.  If it's a one-time deal,
> just linear search the list.  If it's something that's going to be hit
> multiple times a second, that's obviously a different story.
> 
> One thing that is hopefully clear: there is no such thing as "best",
> not in the generic sense.  Different circumstances will lead to
> different answers.
> 
> Guidance:
> 
> 1. For the initial implementation, keep the code as clean and simple
>    as possible.
> 
> 2. Hide the mechanism / representation behind an interface, or at
>    least behind a function call.  That way, if you want to change it
>    to a higher-performance representation later, you should only have
>    to change it in one place, and not throughout your code.
> 
>    This has the additional advantage of giving you an obvious test
>    point; with a competent unit test, you can prove to your own
>    satisfaction that *this* part of the program works.  If the program
>    continues to have bugs, you know you can look elsewhere.
> 
> Good luck,
> t.
> _____________


good stuff Tkil :)




More information about the LUG mailing list