lundi 22 octobre 2012

App::Flo : Reclaim space used by duplicate files

Once again I have to handle the same situation : no more spaces and directories with a lot of duplicates.

That's not the first time I have to do it
(last time it was to remove all the duplicates .mp3 with different names when I merged my various 'Music' directories from different boxes/disks.), so I already have a small script waiting in my repos.

But this time I choose to explore another way, instead of removing duplicate files, I've tried the hardlink substitution way.

Using hardlinks is not always applicable but my perl5/perlbrew directory seemed a good candidate (read only duplicate data...).
And it was : after running my script on it, the size went from 765M to 670M, and all the test suites of the tested modules passed with all the Perl versions.

I first thought to release the script as a patch for perlbrew, but thinking more about it I realized that a need probably exists for a more generic tool.

That's why App::Phlo was created :-)

Not a killer module, but one that suit my needs, and that will enable me to test some ideas (multi digest algorithms, use with "unionfs like" fs, Perl dirs optimization...)

If you want to experiment with me, don't hesitate : Ideas, patches, comment are always welcome...

5 commentaires:

Steven Haryanto a dit…

I made a similar utility a while ago, it's called App::UniqFiles. It just reports duplicate files, so you need to add the action yourself.

anonymous a dit…

fdupes, doubles, dupeguru, dupfinder, File::Find::Duplicates, "duplicate files finder"

Congratulations, 5 hours of programming saved you a 5 minutes research on the Web for prior art.

Arnaud Assad a dit…

@Steven haryanto: I missed this one, I'll have a look when I have more time.
(currently there seems to have undeclared dependency or I goofed somewhere)

@Anonymous: Seems I was no clear enough about my objectives:

I was *not* willing to write a tool to handle (read delete) duplicates.

I wanted to:

A) Explore a new way of space optimization through hardlinking (especially against my perlbrew dirs)

B) Experiment various things (auto use of available digest algorithm)

C) Use Perl

None of the tools I've searched provided what I wanted.
I admit I considered using File::Find::Duplicates as a skeleton, but as I already had the File::Path recurse code from
previous experiment, I was quite reluctant to pay the dependency toll for a simple prototype.
(That and the fact that File::Find::Duplicate uses file size and MD5 only)
I might use it in the future, but allow me to evaluate the cost before.

And let me reassure you, it didn't took me hours to add the hardlinking code and options handling to an existing recursive traversal code.

But if you prefer I can also mention my Hubris and my Impatience as an excuse for my Lazyness (I haven't searched long enough)

binaryman a dit…

Directory Report can replace duplicate files with links

Arnaud Assad a dit…

@Binaryman: Thanks ! but no thanks !

Not free, closed source, Windows only (!!) so unlikely to be written in Perl...

Definitely not what I want for my *coding* experiments