[lug] Duplicate Files (With a catch)

Fri Aug 25 17:58:33 MDT 2000

>>>>> "tu" == theta user <thetateht at hotmail.com> writes:

tu> I have a common directory where multiple(10-15) users all store
tu> their files.  There are many duplicate files with potentially
tu> different names and in different directories. Also I want to
tu> delete partial copies. 

tu> I have been able to use the fdupes and finddups packages to take
tu> care of the first problem.

never heard of ``fdupes'' and ``finddups'', i'll have to investigate them.

my own version of this situation had an incoming directory, where new
files showed up.  i would calculate the MD5 checksum of the file, and
look in a list of ``already seen'' files; if it was already seen, i'd
delete the new file.  if not, i'd add that checksum to the list, and
move it to an ``archived'' directory.  i used the MD5 and GDBM_File
modules with perl; if this would be helpful to you, just let me know
and i will try to package up the stuff i've done for public consump-
tion (really just one script, but i have some supporting scripts as
well.)

as for ``partial files'', you will need to specify this in a little
more detail.  will a partial file only ever be a prefix of an existing
file?  any arbitrary subset?  i suspect you can do some of this with
``diff'', but you will have to do something more clever if you want to
avoid quadratic comparisons.  at least one trick is to tabulate counts
of sets of three or four characters (at every offset), and then use
that to try to identify ``similar'' files.  this seem slike overkill
for this situation, however...

t.