[lug] C(++) library to detect file type (a la 'file')

rm at fabula.de rm at fabula.de
Fri Nov 1 12:11:24 MST 2002


On Fri, Nov 01, 2002 at 11:29:22AM -0700, Tkil wrote:
> >>>>> "Ralf" == rm  <rm at fabula.de> writes:
> 
> Ralf> Or are you refering to perl's 'system("file $myfile")' function? 
> Ralf> That's not really an option since it spawns a shell which then
> Ralf> spawns the 'file(1)'application ...  nothing one would want to
> Ralf> do when indexing a largish set of documents ;-) (currently ~ 100
> Ralf> 000 docs).
> 
> Well, 'file' will open the file anyway (which costs a bit).  It can
> accept multiple files on the command line, which drops the ratio of
> exec-to-file down to 0.01 or even lower.  Versions exist (or could be
> created) that could even take the list of files from another file,
> which would drop you down to one exec for all your files.
> 
> Sadly, there's no "right" way to do this.

Sadly, this is something i find more often than i'd wish to :-/

> 1. Some 3-letter extensions are fairly reliable, but even they can
>    have subtypes (think .gif -- pretty unique, except there are two
>    GIF standards, and then there are animated vs. static gifs...)

No, extensions can't be used (the documents won't have any, that's one
of the reasons i need such a library -- the current indexer looks only 
at extensions and we don't want any).

> 2. Not all files have a unique signature.

Yeah, sometimes file is _really_ off :-) 

> 3. Not all files are architecture-independent (take a look at what
>    'file' does for executables, for instance).
> 
> 4. Your application has to determine its degree of paranoia w.r.t. how
>    much it trusts anything it discovers for itself (e.g., the Outlook
>    worms that use "foo.txt.vbs" or whatever.

Well, luckily, in my case this is only used to choose the right parser/indexer.
Default fallback would be the no-indexer. But a well-designed API would give
the user a choice for the paranoia level.

> The ideal world would have a metadata attribute for each file with the
> MIME type string in it.  I believe that BeFS had exactly this.  Macs
> have long had "application" and "type" fields, which were close but
> not quite there (meaning that, depending on which application created
> it, two GIFs might have different "application" values even though
> they were identical formats).  These fields were also limited to 4
> bytes each...

Hmm, Mac creator/filetype where both 32bits, as far as i remember --
that made 4 characters for each ...

> Interestingly enough, it looks like Extended Attributes will be
> showing up in most Linux FSs in the next stable series.  This could
> have interesting possibilities.

Strange: Apple recently tries to convince everyone that file extensions
aren't that bad (poor marketing guys: after running arround for two decades
and telling everyone that resouces (read metadata) in the filesystem is
the collest thing since sliced bread they now have to market these funny
'.doc' extensions :-)

> Either way, good luck.  I'd be curious to hear how this project works
> out.

Me too!

 thx ralf

> 
> t.
> _______________________________________________
> Web Page:  http://lug.boulder.co.us
> Mailing List: http://lists.lug.boulder.co.us/mailman/listinfo/lug
> Join us on IRC: lug.boulder.co.us port=6667 channel=#colug



More information about the LUG mailing list