You know, sometimes you want to quickly find duplicate files on UNIX and UNIX-like systems, and there are many tools for this, CLI tools, GUI tools, with all sorts of features.
But what if you’re, say, SSHing onto a Linux server and you just want to quickly find duplicate files with your usual onboard tools?
You can, with this simple command (after cd’ing into the directory you want to scan, or replacing the dot with the path):
find . ! -empty -type f -exec sha512sum {} + | sort | uniq -w128 -dD
As you probably know, find
is the tool you use to, well, find things. In this case we’re telling it to fine all non-empty files, of the type “file” (i.e. not a directory). This would usually just return a list of files, but in this case we’re telling find
to run sha512sum
on each file.
We then pipe this list to sort, which will sort the list of files alphabetically, that will give us the hashes sorted nicely together.
Finally, we pipe this list of hashes/files to uniq
which is a tool to report or omit repeated lines, in this case we tell it to report repeated lines with -d
and make sure all duplicate lines are printed using -D
, we also make sure to only compare the lines by the first 128 characters using -w128
which means it will essentially just compare the hashes, not the filenames.
In the end, you will get a list of all duplicates, sorted by hashes.
Note: This is meant to just quickly find a handful of duplicate files, of course there’s no technical limit on how many files you can compare, but after a certain point you probably won’t want to manually go through too many files. In which case, you might want to find a dedicated application to help you sort through too many files.
Note 2: I tested this on macOS as well, and it works fine, but note that by default macOS does not understand sha512sum
and you need to alias sha512sum=shasum -a 512
.
I run this blog in my spare time, if I helped you out, consider donating a cup of coffee.
data