[lug] Typesetting Programs

Thu Dec 12 15:47:01 MST 2002

>>>>> "David" == David Morris <lists at morris-clan.net> writes:

David> The issue is, though, gathering information that already
David> exists.  The documents that will be auto-generated will contain
David> no information that is not in the source...on the other hand,
David> the source is not a human-readable format in some cases, or
David> requires heavy processing to get into a different format for
David> some specific purpose.

If you're saying that there are already semantics embedded in your
data, then you're most of the way there.  You need something that can
transform those semantics into markup for your desired output
processor.

What's interesting is that this transform engine is in fact the
"documentation" you're really looking for, in a sense.  Admittedly,
it's not in a pretty format, but it is the [procedural] incarnation of
the actual goal -- turning diverse, potentially not-human-readable
information into a consistent set of semantically valid tags.

Add in a style sheet, and you're basically done.  :)

(Keep in mind that this stuff can go on at different levels, too;
DocBook and LaTeX are most fruitfully thought about on the structural
and semantic levels; leave the physical formatting and output details
to the style sheets.  *roff, TeX, etc are more pure typesetters.)

>>>>> "Tkil" == Tkil <tkil at scrye.com> writes:

Tkil> (This is speaking from a position of some experience.  A few
Tkil> years back, I was working on a project that would go through PDF
Tkil> files and try to reconstruct a table of contents, based on font
Tkil> size and style, position on page, etc.  It worked, but not
Tkil> without a lot of struggle.)

David> This is one thing that I will most definately NOT do....items
David> such as a table of contents should be generated by the base
David> package I am using, NOT by some cludge that I come up
David> with....and it looks like LaTeX will do this nicely.

I think you might have my anecdote a little backwards.  Most of these
files we received only in PDF form, but they had been generated by
(say) FrameMaker outputting PS, then distilled into PDF.  In this
case, everything was consistent, we just didn't have the original
(semantic) markup.  The TOC extracter that I described was a tool we
used to recover a limited subset of that semantic markup.  (Which we
then promptly re-embedded into the PDF so that it would show up as
bookmarks.)

In your case, you're correct that "the base package" you're using
should indeed generate the TOC (and the PDF bookmarks, too!), and the
Index, and the Bibliography, Table of Figures, Table of Illustrations,
Table of Tables, and so on.  But all of this can be automatically
generated, labelled, cross-referenced, output to multiple formats --
only if you have good semantic markup to begin with.

As you already know, text processing is the only sane way to have a
large body of consistent output.  You can go a fair ways with MS word
and styles, but there's no way to enforce it.  Getting people to stop
thinking in WYSIWYG terms ("This should be italicized") and start
thinking in terms of semantic markup ("Why would you italicize it?  If
it's a citation, say so with <cite>.  Foreign phrase?  Say so: <span
class="italian">.)  Once you have meaningful markup in your content,
you can do all sorts of stuff with it.  (Also, text-based markup
allows you to use any revision control system you like, and to produce
meaningful diffs between releases.)