[lug] deconstructing pdf of voted Boulder ballots
Ken Weinert
kenw at quarter-flash.com
Fri Dec 9 13:19:19 MST 2011
On 12/9/2011 1:08 PM, Neal McBurnett wrote:
> Thanks! I assumed those were related to images, but hadn't succeeded in seeing any bits, so wasn't sure. I also am very surprised to see so many images per page, since I'd expect just one.
>
> Then I found this explanation:
>
> http://www.jpedal.org/PDFblog/2010/04/understanding-the-pdf-file-format-how-are-images-stored/
>
> and this tool:
>
> http://en.wikipedia.org/wiki/Pdfimages
>
> and discovered via "pdfimages" that the pdf has the images spliced up into strips each about 150 bits high. Bizarre. Sad....
Yeah, sorry - I guess I just sort of thought everyone knew that. Comes
from being immersed in them for so many years, you kind of forget what's
common knowledge and what's not.
You might find that the 150 pixel high images is related to the hardware
that scans them. Might be related to a buffer size which was optimized
for throughput. It's difficult to say why, but I'd be pretty sure it's
so that there aren't any big data chunks that have to get moved which
would make the process 'bursty' in nature. Again, pure speculation (and
if you find out I'd be interested in knowing, if for no other reason
than to satisfy my curiousity.)
And that first article is correct. There are *a lot* of ways to produce
PDFs that are identical in appearance but have vastly different internal
structures. Some of the worst I've seen were from one of the Word to PDF
converters (don't recall which one now) in which *every* *single*
*character* is micro adjusted on the page.
>
> Thanks again,
>
> Neal McBurnett http://neal.mcburnett.org/
>
> On Fri, Dec 09, 2011 at 06:02:46AM -0700, Kenneth D. Weinert wrote:
>> -----BEGIN PGP SIGNED MESSAGE-----
>> Hash: SHA1
>>
>> On 12/09/2011 12:17 AM, Neal McBurnett wrote:
>>
>>> Can any of you pdf gurus help out?
>> I don't know if I'd classify myself as a guru, but all of those
>> CCITFaxDecode objects are images.
>>
>> 3824 0 obj<<
>> /BitsPerComponent 1
>> /DecodeParms<<
>> /Columns 1481
>> /K -1
>> >>
>> /Filter/CCITTFaxDecode
>> /Height 150
>> /ImageMask true
>> /Length 413
>> /Subtype/Image
>> /Type/XObject
>> /Width 1481
>>>> stream
>> There's an example of an image object. It's object #3824, generation
>> 0. The<< indicates the start of a dictionary.
>> BitsPerComponent of 1 says it's a black and white image.
>> Columns is self-explanatory and I don't recall what K is off the top
>> of my head, although it might be an indicator of whether or not the
>> image is reversed (white/black or black/white.)
>>
>> /Filter gives /CCITTFaxDecode, a very common encoding for B&W images.
>> The /Subtype and /Type describe the type of object it is.
>>
>> Then the object dictionary ends (the last>>) and the data attached to
>> the object begins (the stream) and there will be an endstream at the
>> end of the data (which is 413 bytes long.)
>>
>> Does this help at all?
>>
>> Ken
>> -----BEGIN PGP SIGNATURE-----
>> Version: GnuPG v1.4.11 (GNU/Linux)
>> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>>
>> iQEcBAEBAgAGBQJO4gbtAAoJELwlFgJPb4vs6mwH/jeKpw99OhS/beSlqPBeSJtZ
>> E0KjBNjRmbuNAQAiGv2j9LuMW+/B6d1cXMHlxSdH8S/VBnabReTdMcvEWOVerQlH
>> uf7O39LDZODWX43cpb8xxX5WZJuXPhaZDvmfsX+cprmc+65AVFLIcXzkr3mduipc
>> 35MfJfTqQPu1/ZwLJIXa3WoYYzy57ipjje2uQ1cRgi2gqD1RHpG4WaQNLOa/ry1g
>> Y5VN2WSV5a4WyVxGbHziUI5zxA/mnpBX28c66Y841wWhQk+6zBKkUe4ZB8Dgx5fT
>> K6h1i6WLW1OrVuQTa7i3cjmJZXHsdsBzfRyMJxkxvCC1nGzTnQfnqEy5dyx9DO8=
>> =Fico
>> -----END PGP SIGNATURE-----
>> _______________________________________________
>> Web Page: http://lug.boulder.co.us
>> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
>> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
> _______________________________________________
> Web Page: http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: irc.hackingsociety.org port=6667 channel=#hackingsociety
More information about the LUG
mailing list