Shearer Software

Andrew Shearer’s Drivel

99 and 44/100 percent pure.

Friday, January 14, 2005

Dupinator II

Bill Bumgarner’s useful Dupinator script, for removing duplicate files, recently hit Python-URL. However, it has a logic bug that end up deleting too many files.

If you have several sets of duplicates that happen to share the same file size, all but one of the sets will be wiped out completely. The problem is that within each group of files of identical size, there’s at most a single generated “duplicates” list. The first file on the list is spared; the rest are deleted.

The net effect, when I tested the script on a large corpus of text files, was the program reported it would delete many files that were clearly not identical. (I had commented out the os.remove call for testing.)

There was an additional problem with iPhoto: the posted script follows symbolic links. iPhoto stores its albums as collections of symbolic links, so all photos in albums are flagged as duplicates of the original photos. An islink() test fixes this.

Here’s a modified version of the script. It has only been lightly tested, though the changes did successfully eliminate the false positives. Uncomment the os.remove() line only when you are satisfied with the list of redundant files generated.

Minor optimizations: all files < = 1024 bytes go directly into the dupes list, not potentialDupes, since the whole file has already been checked. Also, Mac OS X’s pesky .DS_Store files are skipped.

(I haven’t heard back from Bill yet on incorporating the fixes into his code, so I’m posting here.)

View Source Code (dupinator.py)

   Mac OS X, Python, Software, General  Posted at 2:04 AM   

3 Responses to “Dupinator II”

  1. Ian Bicking Says:

    I try to give any script like this a -n and/or a --simulate option, instead of relying on commenting.

  2. Paul Says:

    I am not much of a programmmer, I am just looking for a simple way to remove duplicate files from my Mac. I tried both your script, and Bill Bumgarner’s but neither one of them actually removes the files. Is there something that I am missing…and option that I have to pass to the command to actually make it remove the files? Please help.

    Paul

  3. Jack Says:

    Paul, it seems to require a folder to check. If you are already in the folder you want to check then run it like this:
    dupinator.py .

    That is dupinator.py (space) (dot)

Leave a Reply

January 2005
M T W T F S S
« Dec   Feb »
 12
3456789
10111213141516
17181920212223
24252627282930
31  
Recent Reading

A Heartbreaking Work of Staggering Genius, by Dave Eggers

Harry Potter and the Order of the Phoenix, by J. K. Rowling

Player Piano, by Kurt Vonnegut

Bad News, by Donald E. Westlake

The Blank Slate: The Modern Denial of Human Nature, by Steven Pinker

The Jungle, by Upton Sinclair

Gödel, Escher, Bach: An Eternal Golden Braid, by Douglas R. Hofstadter

Speaking With the Angel, by Nick Hornby (Editor)

In Progress

The Language Instinct, by Steven Pinker

The Corrections, by Jonathan Franzen