[Shell Scripting] Trying to create a script to find and delete duplicate files – failing because of spaces in file names

fhonb · April 10, 2022, 11:08pm

Hello everyone, I have no idea if this is the right spot for a question like this, but here goes:
I’m looking to create a little shell script that scans a directory for duplicate files (I’m going for image files).

So far, I managed to get it to scan the directory and successfully find every duplicate file. I can have them printed out and I could delete them manually then. However, I would like the files to be deleted automatically by the script, and this is where the trouble starts, because many of the files will have filenames containing spaces, sometimes even multiple spaces—i.e. pic of me.jpg, pic of me under a tree.jpg, pic 2.jpg, etc.

My script, as it is now, can provide rm with a list of files to delete, but rm will obviously treat spaces in the filenames as delimiters and consider ./pic, of, and me.jpg as three distinct files that don’t exist.

I just can’t figure out how to deal with this … Any help would be appreciated.

My script:

#! /bin/bash
#create a txt containing only the hashes of duplicate files
find . -type f \( -name "*.png" -o -name "*.jpg" \) -exec sha1sum '{}' \; | awk '{print $1}' | sort | uniq -d > dupes.txt

#create a txt containing hashes and filenames/locations of ALL files in the directory
find . -type f \( -name "*.png" -o -name "*.jpg" \) -exec sha1sum '{}' \; > allhashes.txt

#create a list files to be deleted by grep'ing allhashes.txt for the dupes.txt and only outputting every even-numbered line
to=$(grep -f dupes.txt allhashes.txt | sort | awk '{for (i=2; i<NF; i++) printf $i " "; print $NF}' | sed -n 'n;p')

rm $to

#clean up the storage txts
rm dupes.txt
rm allhashes.txt

Edit: I know stuff like rdfind exists, but I was trying to make something myself. As you can see, I still ran into a wall …

dalto · April 10, 2022, 11:40pm

One easy solution would be to wrap them in quotes in your awk statement.

I would probably use xargs and delete them directly instead of building a big list of files to delete with awk.

With shell scripts, there are about 100 ways to solve any problem.

xircon · April 11, 2022, 9:59am

rm "$to"

Should do it

fhonb · April 11, 2022, 10:08am

This has been proposed to me elsewhere and it doesn’t solve the problem, as this turns $to into a very long filename that also can’t be found.

Let me provide an example of what $to looks like in my test case, by echo'ing it:

$ ./test.sh 
./Pic of me 6.jpg ./Pic of me 9.jpg ./Pic of me 8.png ./Pic of me 7.jpg

Turning this over to rm instead returns:

$ ./test.sh 
rm: cannot remove './Pic': No such file or directory
rm: cannot remove 'of': No such file or directory
rm: cannot remove 'me': No such file or directory
rm: cannot remove '6.jpg': No such file or directory
rm: cannot remove './Pic': No such file or directory
rm: cannot remove 'of': No such file or directory
rm: cannot remove 'me': No such file or directory
rm: cannot remove '9.jpg': No such file or directory
rm: cannot remove './Pic': No such file or directory
rm: cannot remove 'of': No such file or directory
rm: cannot remove 'me': No such file or directory
rm: cannot remove '8.png': No such file or directory
rm: cannot remove './Pic': No such file or directory
rm: cannot remove 'of': No such file or directory
rm: cannot remove 'me': No such file or directory
rm: cannot remove '7.jpg': No such file or directory

This is expected and I can totally see what’s going on here.

Now, going by your suggestion, I can put inverted commas around the variable, like so: rm "$to". This then returns the following:

$ ./test.sh 
rm: cannot remove './Pic of me 6.jpg'$'\n''./Pic of me 9.jpg'$'\n''./Pic of me 8.png'$'\n''./Pic of me 7.jpg': No such file or directory

dalto · April 11, 2022, 10:11am

They need to wrapped in quotes individually.

fhonb · April 11, 2022, 10:11am

I’m pretty sure, my attempt is very far from being elegant. I can’t quite follow your suggestions, however.

How would I do that? No matter where in my awk statement I place the inverted commas, I get errors.

I’m going to have to look into xargs first, before I can say anything about this.

fhonb · April 11, 2022, 10:19am

Well … yeah, but how?

manuel · April 11, 2022, 1:24pm

A slighly different approach that seems to work here. Not well tested though.
It only prints duplicates, doesn’t delete them.
Spaces in file names are supported.

#!/bin/bash

Main() {
    local data=$(find . -type f \( -name \*.png -o -name \*.jpg \) -exec sha1sum {} \; | sort)

    local sums filenames
    local ix file sum prevsum="" to_delete=()

    readarray -t sums      <<< $(echo "$data" | awk '{print $1}')
    readarray -t filenames <<< $(echo "$data" | sed 's|^[0-9a-f]*  ||')

    for ((ix=0; ix < ${#sums[@]}; ix++)) ; do
        sum=${sums[ix]}
        file="${filenames[ix]}"
        if [ "$sum" = "$prevsum" ] ; then
            to_delete+=("$file")
        fi
        prevsum="$sum"
    done

    printf "'%s'\n" "${to_delete[@]}"
}

Main "$@"

fhonb · April 11, 2022, 9:41pm

Thank you for your reply, but I managed to to it in another way.

I turned my list of filenames into an array and turned that over to rm. The new working script also includes a loop, just in case any one file has several duplicates. Is it elegant or efficient? Absolutely not, but it works and I’m happy with that. Putting it here, if anybody else ever needs it:

This one just lists every duplicate file by their hashes for you to manually deal with:

#! /bin/bash
#create a txt containing only the hashes of duplicate files
find . -type f \( -name "*.png" -o -name "*.jpg" \) -exec sha1sum '{}' \; | awk '{print $1}' | sort | uniq -d > dupes.txt

#create a txt containing hashes and filenames/locations of ALL files in the directory
find . -type f \( -name "*.png" -o -name "*.jpg" \) -exec sha1sum '{}' \; > allhashes.txt

#list all duplicates by their hashes
grep -f dupes.txt allhashes.txt | sort

#clean up the storage txts
rm dupes.txt
rm allhashes.txt

This one removes all duplicates, keeping only one copy:

#! /bin/bash
#create a txt containing only the hashes of duplicate files
find . -type f \( -name "*.png" -o -name "*.jpg" \) -exec sha1sum '{}' \; | awk '{print $1}' | sort | uniq -d > dupes.txt

#create a txt containing hashes and filenames/locations of ALL files in the directory
find . -type f \( -name "*.png" -o -name "*.jpg" \) -exec sha1sum '{}' \; > allhashes.txt

mapfile -t r_array < <(grep -f dupes.txt allhashes.txt | sort | awk '{for (i=2; i<NF; i++) printf $i " "; print $NF}' | sed -n 'n;p')

while (( ${#r_array[@]}>0 ))
do
#create a list files to be deleted by grep'ing allhashes.txt for the dupes.txt and only outputting every even-numbered line
mapfile -t r_array < <(grep -f dupes.txt allhashes.txt | sort | awk '{for (i=2; i<NF; i++) printf $i " "; print $NF}' | sed -n 'n;p')

#delete the files in the array
for i in "${r_array[@]}"; do
  #printf "this will remove: %s\n" "${i}"
  rm -f "${i}"
done

#recreate the storage txts
find . -type f \( -name "*.png" -o -name "*.jpg" \) -exec sha1sum '{}' \; | awk '{print $1}' | sort | uniq -d > dupes.txt
find . -type f \( -name "*.png" -o -name "*.jpg" \) -exec sha1sum '{}' \; > allhashes.txt
done

#clean up the storage txts
rm dupes.txt
rm allhashes.txt

system · April 13, 2022, 9:41pm

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.