寫了一個找重複文件的 Bash 腳本,通過比較文件大小和校验和來判斷文件是否(可能)是重複的:
程序
#!/usr/bin/env bash
## Summary: find duplicate files
## Meng Lu <lumeng.dev@gmail.com>
DIR=${1:-`pwd`} ## use provided path if available, otherwise the current path
FILENAME=`basename $0`
TMPFILE=`mktemp /tmp/${FILENAME}.XXXXXX` || exit 1
## one-line version
#find -P . -type f -exec cksum '{}' \; | sort | tee $TMPFILE | cut -f 1-2 -d ' ' | uniq -d | grep -if - $TMPFILE | sort -nr -t' ' -k2,2 | cut -f 3- -d ' ' | while read line; do ls -lhta "$line"; done
## multi-line version with comments
find -P . -type f -exec cksum '{}' \; | # find non-directory files and compute their checksum; -P: never follow symbolic links
sort | # sort by {checksum, file size, file name}
tee $TMPFILE | # save a copy in a temporary file and pass along
cut -f 1-2 -d ' ' | # keep only the checksum and file size
uniq -d | # remove uniq ones
grep -if - $TMPFILE | # greps from previously saved file list the lines of duplicate files identified by having same file size and checksum; - is from redirecting stdout to stdin
sort -nr -t' ' -k2,2 | # sort by descending file size
cut -f 3- -d ' ' | # keep only file name
while read line; do ls -lhta "$line"; done # do informative ls on all found duplicate files
註釋
find -P . -type f -exec cksum '{}' \;
-P
不找符號鏈接文件(symbolic links);-type f
找文件而非文件夾;-exec cksum '{}' \;
對每個找到的文件('{}'
)計算校驗和,cksum
輸出校驗和 文件大小 文件名
,其中文件大小是八進制數個數;
sort
排序,爲uniq
做準備;tee $TMPFILE
把stdout
流的內容一方面保存到臨時文件,一方面繼續沿着 pipe 傳遞到下游;cut -f 1-2 -d ' '
只保留第1、2欄,欄目以空格分;uniq -d
刪除唯一的亦即無重複的行;grep -if - $TMPFILE
通過-
將輸出流轉換爲輸入流,在預存的文件目錄中找重複文件的{校驗和,文件大小}出現的行,注意,這裏的行包含文件名;sort -nr -t' ' -k2,2
對找出的重複的文件按大小降序排序;cut -f 3- -d ' '
之保留保留文件名;while read line; do ls -lhta "$line"; done
對每一文件打印詳細信息。