Using md5 and Locate to Find Duplicate Files

h_wiedey

5.00/5 (1 vote)

Nov 17, 2022

CPOL

5 min read

5653

This tip shows how md5 and locate can be used to find duplicate files.

Introduction

This is a follow up on the usage of the locate Unix command. In my previous article, Using Locate Databases on MacOS Unix, I explained the internal workings of name databases and the locate command to access them. This tip now shows how to use them to identify duplicate files.

The examples below as well as the previous article are based on MacOS High Sierra. I have noticed newer MacOS versions have a newer locate command.

Background

The background to this article is that I recently found an old compact flash card on which I backed up files from a computer's hard drive years ago before formatting the drive and selling it along with the computer. I was searching for a quick way to check if there are files on this compact flash card that I had forgotten to copy to my new computer.

Using the Code

As explored in my previous article mentioned in the Introduction section above, the find command lies at the heart of the locate.updatedb command that is used to populate the names database which is then queried by the locate command.

Out of the box, the find command prints the file name with full path which is then stored in the names database. Interestingly, this output can be modified and the information stored in the names database can thus be enhanced and can be searched for by locate.

To demonstrate this, I modified the enhanced version of the locate.updatedb command from my previous article and extended the output by the md5 fingerprint:

$ cat locate.updatedb.md5 
#!/bin/sh
#
# Copyright (c) September 1995 Wolfram Schneider <wosch@freebsd.org>. Berlin.
# All rights reserved.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions
# are met:
# 1. Redistributions of source code must retain the above copyright
#    notice, this list of conditions and the following disclaimer.
# 2. Redistributions in binary form must reproduce the above copyright
#    notice, this list of conditions and the following disclaimer in the
#    documentation and/or other materials provided with the distribution.
#
# THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
# IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
# ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
# FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
# DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
# OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
# HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
# LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
# OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
# SUCH DAMAGE.
#
# updatedb - update locate database for local mounted filesystems
#
# $FreeBSD: src/usr.bin/locate/locate/updatedb.sh,v 1.20 2005/11/12 12:45:08 grog Exp $
#
# Modified h_wiedey Sept/Oct 2020 for CodeProject Articles

: ${LOCATE_CONFIG="/etc/locate.rc"}
if [ -f "$LOCATE_CONFIG" -a -r "$LOCATE_CONFIG" ]; then
       . $LOCATE_CONFIG
fi
: ${FCODES:=/var/db/locate.database}    # the database

if [ "$(id -u)" = "0" ]; then
	rc=0
	export TMP_FCODES=`sudo -u nobody mktemp -t updatedb`
	chown nobody $TMP_FCODES
	tmpdb=`su -fm nobody -c "$0"` || rc=1
	if [ $rc = 0 ]; then
		install -m 0444 -o nobody -g wheel $TMP_FCODES $FCODES
	fi
	rm $TMP_FCODES
	exit $rc
fi

# The directory containing locate subprograms
: ${LIBEXECDIR:=/usr/libexec}; export LIBEXECDIR
: ${TMPDIR:=/tmp}; export TMPDIR
if ! TMPDIR=`mktemp -d $TMPDIR/locateXXXXXXXXXX`; then
	exit 1
fi

PATH=$LIBEXECDIR:/bin:/usr/bin:$PATH; export PATH

# 6497475
set -o noglob

: ${mklocatedb:=locate.mklocatedb}      # make locate database program
: ${TMP_FCODES=$FCODES}                 # the database
: ${SEARCHPATHS:="/"}                   # directories to be put in the database
: ${PRUNEPATHS:="/private/tmp /private/var/folders /private/var/tmp */Backups.backupdb"} # unwanted directories
: ${FILESYSTEMS:="hfs ufs apfs"}        # allowed filesystems
: ${find:=find}

case X"$SEARCHPATHS" in 
	X) echo "$0: empty variable SEARCHPATHS"; exit 1;; esac
case X"$FILESYSTEMS" in 
	X) echo "$0: empty variable FILESYSTEMS"; exit 1;; esac

# Make a list a paths to exclude in the locate run
excludes="! (" or=""
for fstype in $FILESYSTEMS
do
       excludes="$excludes $or -fstype $fstype"
       or="-or"
done
excludes="$excludes ) -prune"

case X"$PRUNEPATHS" in
	X) ;;
	*) for path in $PRUNEPATHS
           do 
		excludes="$excludes -or -path $path -prune"
	   done;;
esac

tmp=$TMPDIR/_updatedb$$
trap 'rm -f $tmp; rmdir $TMPDIR; exit' 0 1 2 3 5 10 15
# search locally
# echo "$find $SEARCHPATHS $excludes -or -exec md5 -r {} \;" && exit
if $find -s $SEARCHPATHS $excludes -or -exec md5 -r {} \; 2> /dev/null |
        $mklocatedb -presort > $tmp
then
	case X"`$find $tmp -size -257c -print`" in
		X) cat $tmp > $TMP_FCODES;;
		*) echo "updatedb: locate database $tmp is empty"
		   exit 1
	esac
fi

</wosch@freebsd.org>

Using diff, you can check for the modifications. The relevant bits are the lines 92,93c95,96 where exec md5 is used instead of print.

    $ diff /usr/libexec/locate.updatedb locate.updatedb.md5
29a30,37
> #
> # Modified h_wiedey Sept/Oct 2020 for CodeProject Articles
> 
> : ${LOCATE_CONFIG="/etc/locate.rc"}
> if [ -f "$LOCATE_CONFIG" -a -r "$LOCATE_CONFIG" ]; then
>        . $LOCATE_CONFIG
> fi
> : ${FCODES:=/var/db/locate.database}    # the database
33,34c41,42
< 	export FCODES=`sudo -u nobody mktemp -t updatedb`
< 	chown nobody $FCODES
---
> 	export TMP_FCODES=`sudo -u nobody mktemp -t updatedb`
> 	chown nobody $TMP_FCODES
37c45
< 		install -m 0444 -o nobody -g wheel $FCODES /var/db/locate.database
---
> 		install -m 0444 -o nobody -g wheel $TMP_FCODES $FCODES
39c47
< 	rm $FCODES
---
> 	rm $TMP_FCODES
42,45d49
< : ${LOCATE_CONFIG="/etc/locate.rc"}
< if [ -f "$LOCATE_CONFIG" -a -r "$LOCATE_CONFIG" ]; then
<        . $LOCATE_CONFIG
< fi
60c64
< : ${FCODES:=/var/db/locate.database}    # the database
---
> : ${TMP_FCODES=$FCODES}                 # the database
90d93
< 		
92,93c95,96
< # echo $find $SEARCHPATHS $excludes -or -print && exit
< if $find -s $SEARCHPATHS $excludes -or -print 2>/dev/null |
---
> # echo "$find $SEARCHPATHS $excludes -or -exec md5 -r {} \;" && exit
> if $find -s $SEARCHPATHS $excludes -or -exec md5 -r {} \; 2> /dev/null |
97c100
< 		X) cat $tmp > $FCODES;;
---
> 		X) cat $tmp > $TMP_FCODES;;

Along with the locate.updatedb.md5 script, I was using the two configuration files below (locate.Kingston.rc is the one for the compact flash drive mounted on /Volumes/Kingston and locate.Documents.rc is the one for my document repository $HOME/Documents on my Mac):

$ cat locate.Kingston.rc
#
# Configuration for user home directory search
#
# temp directory
TMPDIR="/tmp"

# the actual database
FCODES="locate.Kingston.database"

# directories to be put in the database
# Make sure that there is no space in the directory names
SEARCHPATHS="/Volumes/Kingston"

# directories unwanted in output
PRUNEPATHS="/Volumes/Kingston/.Spotlight-V100 
            /Volumes/Kingston/.Trashes /Volumes/Kingston/Ignore"

# filesystems allowed. Beware: a non-listed filesystem will be pruned
# and if the SEARCHPATHS starts in such a filesystem locate will build
# an empty database.
#
# be careful if you add 'nfs'
FILESYSTEMS="msdos"

$ cat locate.Documents.rc
#
# Configuration for user home directory search
#
# temp directory
TMPDIR="/tmp"

# the actual database
FCODES="locate.Documents.database"

# directories to be put in the database
SEARCHPATHS="$HOME/Documents"

# directories unwanted in output
# PRUNEPATHS="/tmp /var/tmp /Users /Volumes"

# filesystems allowed. Beware: a non-listed filesystem will be pruned
# and if the SEARCHPATHS starts in such a filesystem locate will build
# an empty database.
#
# be careful if you add 'nfs'
FILESYSTEMS="hfs ufs apfs"

The name databases for locate can then be created by running the above configuration files with the enhanced locate.updatedb.md5.

$ export LOCATE_CONFIG="./locate.Documents.rc";./locate.updatedb.md5
$ export LOCATE_CONFIG="./locate.Kingston.rc";./locate.updatedb.md5

If no error message is printed, the name databases were created successfully and you can browse their contents and structure as below:

$ locate -d locate.Documents.database "*"|less

d4c5320cf104b5629d490976c1d8059e /Users/(...)/Documents/program.jpg
cc2f48fbb82a8840a9fcb9ab0d94d0b4 /Users/(...)/Documents/program.vthought
38279e78475f437c5ebdb508b9524c72 /Users/(...)/Documents/setArray.BAK.vthought
58a2e7ffd785b72868bf28bbcacb49c3 /Users/(...)/Documents/setArray.vthought
4d42423c69db19795f8169324ecf579e /Users/(...)/Documents/structure.jpg
c63b631236fb415838cb4b57ddb1ed92 /Users/(...)/Documents/structure.vthought

$ locate -d locate.Kingston.database "*"|less
...
a3d301ed3cc4442ff2df681a9f6dd0e1 /Volumes/Kingston/Addresses/Mac/Cards2ascii.vcf
ea4177dbe0aacdd4f0b962cc3a6ffe57 /Volumes/Kingston/Addresses/Mac/vCards.vcf
bf10d18437911231b07c172729e5516c /Volumes/Kingston/Addresses/Mac/vCards2.vcf
c00ca5c3b168936ebdb4d81172e11071 /Volumes/Kingston/Addresses/Mac/vCards2utf.vcf
a3d301ed3cc4442ff2df681a9f6dd0e1 /Volumes/Kingston/Addresses/Mac/vCards2utf16.vcf
a3d301ed3cc4442ff2df681a9f6dd0e1 /Volumes/Kingston/Addresses/Mac/vCards2utf8.vcf
162b1606424bca82fa234c726e94eeab /Volumes/Kingston/Addresses/Mac/vCards2w.vcf
2aace6876876d77691fb23d69319bab0 /Volumes/Kingston/Addresses/Mac/vCards3w.vcf
...

As you can see, the names databases have the md5 fingerprint separated by a space character from the full filename.

Intermediate Result

So let's see what we have achieved so far. We now have two name databases, one containing the filenames and their md5 fingerprint for the files in the Document folder on the local hard drive (locate.Documents.database) and one for the ones on the compact flash card (locate.Kingston.database).

Any part of the entry (md5 or file name) might be queried for as shown below:

$ locate -d locate.Kingston.database bf10d18437911231b07c172729e5516c
bf10d18437911231b07c172729e5516c /Volumes/Kingston/Addresses/Mac/vCards2.vcf
$ locate -d locate.Kingston.database vCards2.vcf
effbec4ddbd7fd21dac46e11cda782e8 /Volumes/Kingston/Addresses/Mac/._vCards2.vcf

The advantage of using the md5 value instead of the filename is that no matter where the file is stored in the directory structure or what its current name is, it can be uniquely identified.

However md5 values are not guaranteed to be unique. It is unlikely but still possible that two files produce the same md5 fingerprint even though they are not identical. To be sure that 2 files are identical, you will need to run cmp on them. This will, however, not be covered in this article.

You can now use the two databases to check for files that are only on the compact flash card as shown in the next section.

Finding Files Already Existing on the Target

You may now use the standard Unix tool comm to check and extract the md5 values that are only in the compact flash names database (locate.Kingston.database):

$ comm -23 <(locate -d locate.Kingston.database "*" |cut -f 1 -d " "|sort) 
<(locate -d locate.Documents.database "*" |cut -f 1 -d " "|sort) > KingstonOnly.md5.srt

Using the standard Unix tool join, you can expand the md5 value to the full filename:

join <(cat KingstonOnly.md5.srt) 
     <(locate -d locate.Kingston.database "*"|sort)|grep vCards2.vcf
bf10d18437911231b07c172729e5516c /Volumes/Kingston/Addresses/Mac/vCards2.vcf

In the example above, a grep is done on a particular file. If you want to see the complete output, you may pipe to less instead of grep or redirect the output to some file.

You need however be careful when you have filenames that have two or more consecutive blanks in it as Unix join uses blanks as default delimiter and will collapse consecutive ones so you better use the slash (/) as delimiter which cannot be part of a filename (at least under Unix). This can be accomplished by including awk in your command to add a slash to the output of the KingstonOnly.md5.srt file. The output then would look like below:

$ join -t / <(cat KingstonOnly.md5.srt|awk '{print $1 " / "}') 
            <(locate -d locate.Kingston.database "*"|sort)|grep vCards2.vcf
bf10d18437911231b07c172729e5516c / /Volumes/Kingston/Addresses/Mac/vCards2.vcf

Going from here, it is up to you to decide what to do with the files that were identified to only be on the compact flash card. The easiest is to simply tar them to an archive to your local hard drive and leave them for later investigation:

join -t / <(cat KingstonOnly.md5.srt|awk '{print $1 " /"}') 
<(locate -d locate.Kingston.database "*"|sort)|cut -c 35-|tar -cf KingstonOnly.tar -T -

The additional cut in the example above is necessary to remove the MD5 fingerprint before piping the filenames to tar.

The files on the compact flash card are now redundant and the card can be formatted and used for other purposes.

Finding Duplicate Files Within a Drive

People with some background on rational database management systems might have already become concerned at the point where the join command was used to expand the md5 value to the full filename as in a situation where there is a duplicate file on the compact flash card, the join will no longer produce a 1:1 match.

Starting from here, there is another nice feature of using name databases with md5 values which is that you can easily identify duplicate files within the database using the uniq commad.

$ join <(locate -d locate.Kingston.database "*"|cut -f 1 -d " "|sort|uniq -d) 
       <(locate -d locate.Kingston.database "*"|sort)|less
$ join <(locate -d locate.Documents.database "*"|cut -f 1 -d " "|sort|uniq -d) 
       <(locate -d locate.Documents.database "*"|sort)|less

(In the above example, the case that filenames might have two or more consecutive blanks in the filename are ignored. In this case, you need to use slash as delimiter in the join statement as shown in the examples in the previous section.)

Points of Interest

The two pipes in one command <(..) <(..) notation is taken from here: https://unix.stackexchange.com/questions/31653/two-pipes-to-one-command/31654

History

October 2020 - November 2022: Initial version