[lug] Threading WAS: Subversion user auth question

Sat Jul 24 16:48:15 MDT 2004

"John Hernandez" <John.Hernandez at noaa.gov> writes:
> This brings into question how "smart" these MUAs really are :)  Out of
> curiosity, which message header dictates threading operation?

As someone who's spent the last few months trying to get message
threading to behave properly, (mostly spent trying to work around
broken libraries and then eventually deciding that the best way to do
it right is to do it myself)... it seems like it should be easier than
it actually is.

The relevant headers are:
Message-ID, References, In-Reply-To, and Subject.

When you read the specs, RFC 822 and 2822, it *still* looks pretty
simple.  When you look at the wide variety of what MUAs *actually* do
when constructing replies, you become amazed that email actually
functions as well as it does.

Jamie Zawinski wrote the message threading algorithm that was used in
early Netscape Mail and News 2.0 and 3.0, put up a page on his site
describing his technique (http://www.jwz.org/doc/threading.html) which
is pretty straightforward once you can look at it without filling your
head with pointers.

I'm implementing his threading algorithm in Ruby, using its tmail
library for getting at information in the messages, and I made one
substantial change: instead of having containers have pointers for
parent, child, and next, as you would in C, I remembered that
sometimes, real lists really do make things simpler: so each container
just has an array of children.  This makes one operation harder
(figuring out the parent of a given message), but that turns out to
really not be used except to know whether a message has been assigned
to a parent - which I indicate with a boolean 'parented' flag on each
message instead: Problem solved, with no worries about pointers
becoming inconsistent with each other.

Anyway, a long story short, it took me about 35 hours to refactor and
implement this in Ruby.  It's high-level code, with no pointer
chasing.  It can thread about 1000 messages a second on my Pentium 3,
600Mhz laptop, and doesn't scale badly.

But while I'm discussing it, I should mention that JWZ's algorithm
does seem to make useful decisions, but I strongly disagree with his
criticisms about using databases to store threading information:

> Also bunk is the idea that databases are needed for ``scalability.''
> This code can thread 100,000 messages without a horrible delay, and
> the fact is, if you're looking at a 100,000 message folder

As far as I can tell, he's got some issues with the fact that the team
working on Netscape 4 (Communicator) threw away his code because they
didn't understand it, and implemented threading differently, using C++
and a database, and apparently (he claims) it never worked right.

I've not used communicator's MUA.  But what I need this for is the
website for my thesis project.  If I were *not* using some amount of
thread precomputing (that is, a database of message relationship
information), I'd have to load and thread the entire set of messages
(over 50,000) for each page request.  He's right that his algorithm
does an impressive job, but I can't afford to have each page load take
20 seconds for loading and threading, and then whatever else for
actually constructing the page.  As it was, I had some performance
issues related to an earlier design of table relations that would
cause it to take 12 seconds to bring up a page; it was completely
unbearable.

With a small redesign, I have it down to under 2 seconds now.  JWZ is
obviously a good programmer, but he blames C++ and the database for
being evil, when I expect the bigger issue was that the Netscape Mail
team ignored his code which really did work well enough for finding
most message relationships.

I'm using JWZ's algorithm, and representing its decisions in a
database so I can look it up later, faster, pretty easily.  Merging
new messages with existing threads already in the database also relies
on JWZ's algorithm: I feed it some messages from the database together
with the incoming messages, run it, see where each message is linked,
and insert the new messages in with appropriate relationships
indicated.  It's just not hard.

He says, "Just say no to databases!" but I say, "What does this have
to do with *threading*, again?"

My implementation of message threading in Ruby will be put up on RAA
under the LGPL when I'm done.

-- 
epistemological humility
   - Chris Riddoch -