[lug] deconstructing pdf of voted Boulder ballots

Fri Dec 9 13:33:08 MST 2011

That reminds of an app I joked about back in the dark ages. I was writing
DOS code in x86 assembly language and there's an assembler command that
allows you to specify the offset of the following item. I don't recall
using it for anything other than setting the starting offset in .com files.

Somehow code obscuration came up. (Undoubtably targeted at the Boss From
Hell instead of the outside world.) I joked that we could do that to our
code with the right tool - it would look like:

  ; mov ax, bx section
  .off 231
  mov ax, bx
  .off 27a
  mov ax, bx
  .off 311
  mov ax, bx

  ; mov ax, cx section
  .off 188
  mov ax, cx
  .off 448
  mov ax, cx

You get the idea.

As Ken says some PDF generators have had the same idea. I wonder if it /is/
obscuration since it's a lot harder to extract the content of a file that
has individually positioned letters than it is to extract it from a file
where paragraphs are intact.

On Fri, Dec 9, 2011 at 1:19 PM, Ken Weinert <kenw at quarter-flash.com> wrote:

> On 12/9/2011 1:08 PM, Neal McBurnett wrote:
> > Thanks!  I assumed those were related to images, but hadn't succeeded in
> seeing any bits, so wasn't sure.  I also am very surprised to see so many
> images per page, since I'd expect just one.
> >
> > Then I found this explanation:
> >
> >
> http://www.jpedal.org/PDFblog/2010/04/understanding-the-pdf-file-format-how-are-images-stored/
> >
> > and this tool:
> >
> >   http://en.wikipedia.org/wiki/Pdfimages
> >
> > and discovered via "pdfimages" that the pdf has the images spliced up
> into strips each about 150 bits high.  Bizarre.  Sad....
>
> Yeah, sorry - I guess I just sort of thought everyone knew that. Comes
> from being immersed in them for so many years, you kind of forget what's
> common knowledge and what's not.
>
> You might find that the 150 pixel high images is related to the hardware
> that scans them. Might be related to a buffer size which was optimized
> for throughput. It's difficult to say why, but I'd be pretty sure it's
> so that there aren't any big data chunks that have to get moved which
> would make the process 'bursty' in nature. Again, pure speculation (and
> if you find out I'd be interested in knowing, if for no other reason
> than to satisfy my curiousity.)
>
> And that first article is correct. There are *a lot* of ways to produce
> PDFs that are identical in appearance but have vastly different internal
> structures. Some of the worst I've seen were from one of the Word to PDF
> converters (don't recall which one now) in which *every* *single*
> *character* is micro adjusted on the page.
>
>
>
> >
> > Thanks again,
> >
> > Neal McBurnett                 http://neal.mcburnett.org/
> >
> > On Fri, Dec 09, 2011 at 06:02:46AM -0700, Kenneth D. Weinert wrote:
> >> -----BEGIN PGP SIGNED MESSAGE-----
> >> Hash: SHA1
> >>
> >> On 12/09/2011 12:17 AM, Neal McBurnett wrote:
> >>
> >>> Can any of you pdf gurus help out?
> >> I don't know if I'd classify myself as a guru, but all of those
> >> CCITFaxDecode objects are images.
> >>
> >>   3824 0 obj<<
> >>      /BitsPerComponent 1
> >>      /DecodeParms<<
> >>        /Columns 1481
> >>        /K -1
> >>      >>
> >>      /Filter/CCITTFaxDecode
> >>      /Height 150
> >>      /ImageMask true
> >>      /Length 413
> >>      /Subtype/Image
> >>      /Type/XObject
> >>      /Width 1481
> >>>> stream
> >> There's an example of an image object. It's object #3824, generation
> >> 0. The<<  indicates the start of a dictionary.
> >> BitsPerComponent of 1 says it's a black and white image.
> >> Columns is self-explanatory and I don't recall what K is off the top
> >> of my head, although it might be an indicator of whether or not the
> >> image is reversed (white/black or black/white.)
> >>
> >> /Filter gives /CCITTFaxDecode, a very common encoding for B&W images.
> >> The /Subtype and /Type describe the type of object it is.
> >>
> >> Then the object dictionary ends (the last>>) and the data attached to
> >> the object begins (the stream) and there will be an endstream at the
> >> end of the data (which is 413 bytes long.)
> >>
> >> Does this help at all?
> >>
> >> Ken
> >> -----BEGIN PGP SIGNATURE-----
> >> Version: GnuPG v1.4.11 (GNU/Linux)
> >> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> >>
> >> iQEcBAEBAgAGBQJO4gbtAAoJELwlFgJPb4vs6mwH/jeKpw99OhS/beSlqPBeSJtZ
> >> E0KjBNjRmbuNAQAiGv2j9LuMW+/B6d1cXMHlxSdH8S/VBnabReTdMcvEWOVerQlH
> >> uf7O39LDZODWX43cpb8xxX5WZJuXPhaZDvmfsX+cprmc+65AVFLIcXzkr3mduipc
> >> 35MfJfTqQPu1/ZwLJIXa3WoYYzy57ipjje2uQ1cRgi2gqD1RHpG4WaQNLOa/ry1g
> >> Y5VN2WSV5a4WyVxGbHziUI5zxA/mnpBX28c66Y841wWhQk+6zBKkUe4ZB8Dgx5fT
> >> K6h1i6WLW1OrVuQTa7i3cjmJZXHsdsBzfRyMJxkxvCC1nGzTnQfnqEy5dyx9DO8=
> >> =Fico
> >> -----END PGP SIGNATURE-----
> >> _______________________________________________
> >> Web Page:  http://lug.boulder.co.us
> >> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> >> Join us on IRC: irc.hackingsociety.org port=6667
> channel=#hackingsociety
> > _______________________________________________
> > Web Page:  http://lug.boulder.co.us
> > Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> > Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.lug.boulder.co.us/pipermail/lug/attachments/20111209/7567a28c/attachment.html>