[lug] Combining pdf documents

rm at fabula.de rm at fabula.de
Tue Jan 14 09:51:02 MST 2003


On Tue, Jan 14, 2003 at 08:14:40AM -0700, Kenneth D. Weinert wrote:
> On Tue, 14 Jan 2003 15:30:56 +0100
> rm at fabula.de wrote:
> 
> > On Mon, Jan 13, 2003 at 11:17:05AM -0700, J. Wayde Allen wrote:
> > > On Mon, 13 Jan 2003 rm at fabula.de wrote:
> > 
> >            However: looking at your specific requirements 
> > (esp. modifying content like page numbers) i'd strongly advise against
> > using pdf as a submission format. PDF is a display format -- almost no
> > structural information is present. You'd have to request that all your 
> > authors use some special _visual_ markup to tag relevant bits of information
> > (like: use Adobe-Comic-Sans for page numbers so our program can identify them
> > during processing).
> 
> 	Not true, actually. There *can* be (note the emphasis) a lot
> of structural information in a PDF document. There aren't many tools
> that deal with a PDF on this level to be sure, but there is at least
> one other aside from Adobe, but none that are open source that I'm
> aware of and I've done a lot of looking (if someone knows about a
> library that deals with a PDF on the object level that is open source
> I'd appreciate hearing about it.)
> 
> 	Pages can contain Page Labels, so you could renumber all the
> pages by adding a PageLabels entry in the document catalog.

Hmm, as far as i understand 'Page Labels' are dictionary objects
used for page navigation -- modifying those will modify the navigation
tree, _not_ the numbering showing up on the pages (_unless_ your PDF
tool inserts pdf specials to use the page labels instead of hardcoded
string objects -- does any of the tool you know do this?).


> 	Note that this is not necessarily an easy thing to do - it
> might require your processing of each of the page dictionaries to
> remove any existing page label, but most PDFs just rely on the page
> index as the page number if it's displayed.

For the index display, yes. But you would also need to search and identify
the string objects that represent the page numbers created by the word
processing application (it might be possible with some sort of
heuristic ...).

> > > > In theory it should be possible to combine pdf documents by
> > > > reading their dictionaries (the last object in a file - the
> > > > toplevel/root object so to say) and adding all object trees
> > > > to a newly created root object (but you would need to renumber
> > > > all objects to avoid duplicated object IDs). Doable, but most
> > > > likely not fun ....
> 
> 	Definately not fun, and a lot more work than you'd think :)

Oh, i think i agree with that (and, just to mention it: oner would
have to deal with different page sizes and fonts as well ...)

Ralf Mattes

> 
> -- 
> /~\ The ASCII        Ken Weinert   Ken.Weinert at ihs.com 
> \ / Ribbon Campaign  303-858-6956 (V) 303-705-4258 (F)
>  X  Against HTML     GnuPG: 9274F1CE  GnuPG available at http://www.gnupg.org/
> / \ Email!           1D87 3720 BB77 4489 A928  79D6 F8EC DD76 9274 F1CE
> Black holes are where God is dividing by zero.
> 
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug



More information about the LUG mailing list