找重複文件的Bash程序

←	March 2021
S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31

工具	2
編程	2
Flickr	1
Gradle	1
Java	1
bash	1
blog	1
computing	1
博客	1
命令行	1
文件	1
筆記	1
網誌	1
翻牆	1
重複	1
1	0
2012 functional programming principle in scala	0
2014SedgewickAlgorithms	0
2014 Sedgewick Algorithms	0
Apache HTTP Server	0

寫了一個找重複文件的 Bash 腳本，通過比較文件大小和校验和來判斷文件是否（可能）是重複的：

程序

#!/usr/bin/env bash

## Summary: find duplicate files
## Meng Lu <lumeng.dev@gmail.com>

DIR=${1:-`pwd`} ## use provided path if available, otherwise the current path

FILENAME=`basename $0`

TMPFILE=`mktemp /tmp/${FILENAME}.XXXXXX` || exit 1

## one-line version
#find -P . -type f -exec cksum '{}' \; | sort | tee $TMPFILE | cut -f 1-2 -d ' ' | uniq -d | grep -if - $TMPFILE | sort -nr -t' ' -k2,2 | cut -f 3- -d ' ' | while read line; do ls -lhta "$line"; done

## multi-line version with comments
find -P . -type f -exec cksum '{}' \; | # find non-directory files and compute their checksum; -P: never follow symbolic links
sort | # sort by {checksum, file size, file name}
tee $TMPFILE | # save a copy in a temporary file and pass along
cut -f 1-2 -d ' ' | # keep only the checksum and file size
uniq -d | # remove uniq ones
grep -if - $TMPFILE | # greps from previously saved file list the lines of duplicate files identified by having same file size and checksum; - is from redirecting stdout to stdin
sort -nr -t' ' -k2,2 | # sort by descending file size
cut -f 3- -d ' ' | # keep only file name
while read line; do ls -lhta "$line"; done # do informative ls on all found duplicate files

GitHub 存檔。

註釋

find -P . -type f -exec cksum '{}' \;
- -P 不找符號鏈接文件（symbolic links）；
- -type f 找文件而非文件夾；
- -exec cksum '{}' \; 對每個找到的文件（'{}'）計算校驗和，cksum 輸出校驗和文件大小文件名，其中文件大小是八進制數個數；
sort 排序，爲 uniq 做準備；
tee $TMPFILE 把 stdout 流的內容一方面保存到臨時文件，一方面繼續沿着 pipe 傳遞到下游；
cut -f 1-2 -d ' ' 只保留第1、2欄，欄目以空格分；
uniq -d 刪除唯一的亦即無重複的行；
grep -if - $TMPFILE 通過 - 將輸出流轉換爲輸入流，在預存的文件目錄中找重複文件的{校驗和，文件大小}出現的行，注意，這裏的行包含文件名；
sort -nr -t' ' -k2,2 對找出的重複的文件按大小降序排序；
cut -f 3- -d ' ' 之保留保留文件名；
while read line; do ls -lhta "$line"; done 對每一文件打印詳細信息。

程序

註釋

相關文章