Easily finding duplicate files
Some days ago I was asked by my mother if there was an easy way to find duplicate photos on her computer. I thought about it and I came up with the idea that the easiest way to do this is to just compare if some hash matches between the files (which works fine as long the images are not modified). Then came the implementation and I thought since I know PHP best for this job, why not use it. Now I know that PHP hasn’t much of a reputation as a command line scripting language, but bear with me.
The first step is to enumerate all the files we want to compare, for this we’ll need two parameters:
All that is missing now is the reporting functionality that reports duplicates we found back to the user.
If you just want the script then get it here, change the access rights:
The first step is to enumerate all the files we want to compare, for this we’ll need two parameters:
- The path from which to recursively get all files
- The pattern (in our case a Perl regular expression) that filters over all files
if(count($argv) < 2)
die("Usage:\n ".$argv[0]." path [regexp]\n");
$path = $argv[1];
$pattern = count($argv) < 3 ? "/.*/" : $argv[2];
The next step is to actually enumerate all files that are to be compared:
$files = preg_ls($path, true, $pattern);
function preg_ls ($path=".", $rec=false, $pat="/.*/") {
$pat=preg_replace ("|(/.*/[^S]*)|s", "\1S", $pat);
while (substr ($path,-1,1) =="/") $path=substr ($path,0,-1);
if (!is_dir ($path) ) $path=dirname ($path);
if ($rec!==true) $rec=false;
$d=dir ($path);
$ret=Array ();
while (false!== ($e=$d->read () ) ) {
if ( ($e==".") || ($e=="..") ) continue;
if ($rec && is_dir ($path."/".$e) ) {
$ret=array_merge ($ret,preg_ls($path."/".$e,$rec,$pat));
continue;
}
if (!preg_match ($pat,$e) ) continue;
$ret[]=$path."/".$e;
}
return (empty ($ret) && preg_match ($pat,basename($path))) ? Array ($path."/") : $ret;
}
With the preg_ls function borrowed from php.net. Next we calculate and collect the hashes, and at the same time check for collisions:
$hashes = array();
$duplicates = array();
foreach($files as $file){
$hash = sha1_file($file);
if(array_key_exists($hash, $hashes))
$duplicates[] = array($hashes[$hash], $file);
else
$hashes[$hash] = $file;
}
What this does is simply calculate the SHA1-hash for each file and checks wether we encountered it some time before. If we do know the hash it is a duplicate and should be memorized for later, if not it’s a new file so add it to our memory.All that is missing now is the reporting functionality that reports duplicates we found back to the user.
print("Duplicates:\n");
foreach($duplicates as $duplicate){
print(" ".$duplicate[1]." is a duplicate of ".$duplicate[0]."\n");
}
So there you go, a simple script that performs reasonably well, and has found most duplicate pictures in my mothers case If you just want the script then get it here, change the access rights:
$ mv ./duplicates.phps ./duplicates.php
$ chmod +x ./duplicates.phps
And you’re ready to go:
$ ./duplicates.php ~
