[lug] Somewhat OT: Types of list email

Mon Nov 24 23:41:56 MST 2003

Ed Hill <ed at eh3.com> writes:
> Wow, good luck with what sounds like a fascinating (and difficult)
> task.  I think the older I get, the more I appreciate our human ability
> to parse English (or any human language).  Anyway, you have my
> permission to use any/all of my emails on the LUG lists (including
> future emails) for your study.

Thanks. I've thought hard about the copyright issues - and I need to
consult some people at the university to be sure about this, but
here's what I've come up with: I'm not republishing anybody's
email. If someone wants to argue that the summary is a derived work,
it will be hard because the summaries will be quoting less than 20% of
the original message, probably falling within fair use. I'm also not
modifying the original message, or claiming that I (or my system)
wrote it.

I'm constructing information (posting <messageid> might be a question,
this other one is an answer, etc.) about email, and since this is
really metadata, it's probably got the same legal status as a
bibliography.  And since it's not metadata someone else has made, it's
metadata that my program is trying to discover from textual
statistics, (and metadata that I create myself for the training set)
it really should be in the clear.

Legally, I think a *mirror* of a mailing list has more to be worried
about than a summarizer, and I'm not aware of many cases where
someone's, say, tried to sue their way to removing any traces of an
email they've personally sent to a public list.

It's not uncommon for the kernel traffic summaries to quote most of,
if not all of a message, and I'd be worried about doing that.  My
summarizer will almost certainly have a failure mode of showing too
much information about threads you *don't* care about and not enough
of the individual messages you *do* care about - and this is why I'm
not going to be supplanting Zack Brown (kernel traffic summary author)
anytime soon.  It'd be a lot of extra work to make a summarizer
customized to bias in favor of its users interests.  That might be a
doctorate, though. ;)

I emailed Zack a month or two ago, asking if I could use his data,
which he said was GPLed, and then after a lot of reading and thinking,
I realized that it won't be terribly useful since he does a different
*kind* of summary than what I'm intending to attempt.

In any case, thanks for the permission. I really hope (and expect,
from my research) that it's not legally necessary to get it from
everyone on LKML (or BLUG, though I'm more concerned with high-traffic
lists, and I'd rather not encourage people to suddenly develop odd
posting behaviors just so it messes with the statistics. [Hi, Ed,
Sean, Kevin, Evelyn, etc.])

> As I'm sure lots of other people have noticed, the actual subject (not
> just the "Subject:" header) of many email-list threads tend to drift. 
> So while the Subject: header may stay more-or-less the same, the real
> topic(s) discussed can change dramatically over just a few emails.  Are
> you going to tackle this "drift" issue?

Yes, in fact I'm calling it "topic drift," and that's an important
part of another piece of the project: grouping threads (and portions
of threads) together by topic.  If the topic of, say, software suspend
comes up on lkml in multiple threads, you'd want them all grouped
together first and *then* summarized, in the context of each other.

That's step one.  Step two is classifying messages by discourse-level
type (thus, the list of types I posted to get comments on...) and
using that as a basis for the structure and material around the quotes.

Step three is sentence extraction: choosing a representative sentence,
phrase, or small chunk (< 20%) of the source message to quote, unless
it's sufficiently short that the classification by itself likely
expresses the point of the message.  This will be done by using a
message and sentence similarity metric to find the most centrally
meaningful (in a funky semantic vector space) piece of text.  (The
math in some of this is truly scary.)

And finally, formulating the summary paragraphs combining the results
of step two and three, complete with quotes and links to the quoted
message in a web archive on the list.

Whew. And all this by mid-March... oh, and a hundred-something page
paper to describe why this is new, different, and a contribution to
linguistics.

Back to work...

-- 
epistemological humility
   - Chris Riddoch -