Easily finding duplicate files

Some days ago I was asked by my mother if there was an easy way to find duplicate photos on her computer. I thought about it and I came up with the idea that the easiest way to do this is to just compare if some hash matches between the files (which works fine as long the images are not modified). Then came the implementation and I thought since I know PHP best for this job, why not use it. Now I know that PHP hasn’t much of a reputation as a command line scripting language, but bear with me.
The first step is to enumerate all the files we want to compare, for this we’ll need two parameters:
  • The path from which to recursively get all files
  • The pattern (in our case a Perl regular expression) that filters over all files
So here goes the first part of the script:
if(count($argv) < 2)
	die("Usage:n  ".$argv[0]." path [regexp]n");
$path = $argv[1];
$pattern = count($argv) < 3 ? "/.*/" : $argv[2];
The next step is to actually enumerate all files that are to be compared:
$files = preg_ls($path, true, $pattern);
 
function preg_ls ($path=".", $rec=false, $pat="/.*/") {
    $pat=preg_replace ("|(/.*/[^S]*)|s", "1S", $pat);
    while (substr ($path,-1,1) =="/") $path=substr ($path,0,-1);
    if (!is_dir ($path) ) $path=dirname ($path);
    if ($rec!==true) $rec=false;
    $d=dir ($path);
    $ret=Array ();
    while (false!== ($e=$d->read () ) ) {
        if ( ($e==".") || ($e=="..") ) continue;
        if ($rec && is_dir ($path."/".$e) ) {
            $ret=array_merge ($ret,preg_ls($path."/".$e,$rec,$pat));
            continue;
        }
        if (!preg_match ($pat,$e) ) continue;
        $ret[]=$path."/".$e;
    }
    return (empty ($ret) && preg_match ($pat,basename($path))) ? Array ($path."/") : $ret;
}
With the preg_ls function borrowed from php.net. Next we calculate and collect the hashes, and at the same time check for collisions:
$hashes = array();
$duplicates = array();
 
foreach($files as $file){
	$hash = sha1_file($file);
	if(array_key_exists($hash, $hashes))
		$duplicates[] = array($hashes[$hash], $file);
	else
		$hashes[$hash] = $file;
}
What this does is simply calculate the SHA1-hash for each file and checks wether we encountered it some time before. If we do know the hash it is a duplicate and should be memorized for later, if not it’s a new file so add it to our memory.
All that is missing now is the reporting functionality that reports duplicates we found back to the user.
print("Duplicates:n");
foreach($duplicates as $duplicate){
	print("    ".$duplicate[1]." is a duplicate of ".$duplicate[0]."n");
}
So there you go, a simple script that performs reasonably well, and has found most duplicate pictures in my mothers case :-)
If you just want the script then get it here, change the access rights:
$ mv ./duplicates.phps ./duplicates.php
$ chmod +x ./duplicates.phps 
And you’re ready to go:
$ ./duplicates.php ~
  • Share/Bookmark

This website uses IntenseDebate comments, but they are not currently loaded because either your browser doesn't support JavaScript, or they didn't load fast enough.

Leave a Reply