[lug] Hash vs list
mohadib
mohadib at openactive.org
Mon Apr 24 17:23:34 MDT 2006
On Mon, 2006-04-24 at 15:49 -0600, Tkil wrote:
> >>>>> "Jeff" == Jeff Schroeder <jeff at zingstudios.net> writes:
>
> Jeff> I'm curious whether searching for a single value in a hash is
> Jeff> generally (always?) faster than searching for it in a list. The
> Jeff> programming I'm doing is in PHP, but I think it's a question
> Jeff> with a general answer.
>
> And the general answer is: "it depends". It depends on key
> distribution, access patterns (both temporal and key-distribution),
> memory pressure (including cache size), language, processor, whether
> ordering is important / relevant / free.
>
> To do some research on this, realize that the data type you're looking
> for is traditionally called a "set" (although some cultures call it a
> "bag", but then again that latter term might indicate a set-with-
> duplicates...). A related data type "dictionary" which is a set that
> carries additional data about its keys. The two typically have nearly
> identical implementation strategies (except that sets can in some
> cases be represented purely by location, which can't carry any
> additional information -- think bitmaps.)
>
> Typical representations include:
>
> * unsorted dense sequences
> + array
> + singly-linked list
> + doubly-linked list
> + deque (dlist of array)
> * sorted dense sequences
> + (same as above)
> * unsorted sparse sequences
> + hash tables
> - open hashing
> - chained hashing
> * sorted trees
> + balanced or not
> + arity (binary, 2-3, b-trees)
> + dynamic (splay trees)
> + tries (n-way trees of strings from n-symbol alphabet)
>
> And that's all still at a conceptual level. Optimizing for particular
> values of all the variables I mention above gets really ugly really
> fast.
>
> The Judy implementation that Sean mentioned is an attempt to code this
> functionality very close to the hardware -- it keeps a fairly clean
> interface, but under the hood it arranges things into b-trees such
> that each node is exactly the size of a cache line; it embeds small
> values into the tree node directly instead of allocating a new leaf
> node; etc. One could view this as a detail-vs-time tradeoff, but if
> the library is solid and well-specified, then you can just use someone
> else's effort "for free".
>
> Even at the conceptual / whiteboard level, you can already make
> tradeoffs. Sure, you have 1M records -- but if you only search
> through them once, you're obviously better off with a stupid linear
> search (expected cost N/2), vs building up any sort of structure
> (since you need to look at each element at least once, so your cost is
> guaranteed to be N+change).
>
> On the other hand, if you know you're going to be making many
> comparisons, you're better off doing extra work up front so that your
> lookups can be quick. Especially if you have memory to burn, hash
> tables are probably the winner here, as they should have just one or
> two comparisions per lookup. (Hash tables work by smashing the key
> into a number, then looking at that "slot" in an array; the value is
> either there, or it's empty, or the value is "chained" down from that
> slot. Either way, a properly implemented hash should have a very low
> collision rate, and a low bucket size.)
>
> But what if it's important to find the "bounds" of where a missing key
> would go? Hash maps are unsorted, so they're not suitable if there's
> this additional requirement. In this case, a balanced tree is
> probably your best bet for dynamic collections, sorted dense array for
> static collections. (Each of which costs 20 comparisons per lookup
> [20 being about the logarithm of 1M to base 2], but adding an element
> to a balanced tree should only cost 20, while adding one to the array
> likely costs N/2.)
>
> But both hash tables and balanced trees use up more memory than the
> dense array, sometimes by a factor of 3. So if you're constrained for
> memory, you might have to make the space/time tradeoff.
>
> So some considerations when trying to determine the "best" represen-
> tation to use for a given task:
>
> How much setup time can/should you use?
> How many lookups are you going to do?
> How much memory can you throw at the problem?
> How dynamic is the collection (inserts/deletes)?
> How fast does the response have to be?
> How much time can you spend writing your own?
> What options does your language / libraries / system offer?
> How dense are the keys?
> How much data with each key?
>
> And we can't decide any of this for you. If it's a one-time deal,
> just linear search the list. If it's something that's going to be hit
> multiple times a second, that's obviously a different story.
>
> One thing that is hopefully clear: there is no such thing as "best",
> not in the generic sense. Different circumstances will lead to
> different answers.
>
> Guidance:
>
> 1. For the initial implementation, keep the code as clean and simple
> as possible.
>
> 2. Hide the mechanism / representation behind an interface, or at
> least behind a function call. That way, if you want to change it
> to a higher-performance representation later, you should only have
> to change it in one place, and not throughout your code.
>
> This has the additional advantage of giving you an obvious test
> point; with a competent unit test, you can prove to your own
> satisfaction that *this* part of the program works. If the program
> continues to have bugs, you know you can look elsewhere.
>
> Good luck,
> t.
> _____________
good stuff Tkil :)
More information about the LUG
mailing list