I’ve had issues in the past where a Kerio Connect mail server talking to Apple Mail can somehow end up duplicating hundreds or even thousands of emails in mailboxes. I never got to the bottom of what was causing it, but the symptoms were that people would see their mailbox sizes grow to crazy sizes and there would be many, many duplicates of many emails in these folders.
When the issue was occurring more frequently, I was using some command-line tools to clean them up, but having not encountered it in the wild again now for more than a year, I’ve totally forgotten which tools I used and how I invoked them.
In migrating a client’s email to Office 365 recently, one user had a massive mailbox, with one folder alone having over 35 GB of email in it. In searching for the tools I previously used to clean it up, I came across dupeGuru – an OS X GUI application that finds identical files and can trash them.
I haven’t been able to run it head-to-head against whatever I used to use (I think it was fslint but it may have been dupes, or it may have been something else altogether) but whatever, I only need to clean a couple of folders as a one-off task.
Anyway, dupeGuru seems to do the task and runs in a reasonable amount of time, so if you’re looking for an easy to use utility to find and nuke identical files, give it a spin.
If I’m ever back reading this post to remember what I used to use, dupeGuru didn’t cut it on a folder with ~500k files in it (a severely botched and highly duplicated email folder) and I’m now trying rdfind by Paul Dreik
https://rdfind.pauldreik.se
It’s available in MacPorts, is pretty lightweight and has a fairly intelligent duplicate finding algorithm.
Another tool I’ve used in the past, that claims to often be faster than rdfind is rmlint
http://rmlint.rtfd.org/
and on GitHub
https://github.com/sahib/rmlint
With fdupes, it goes through each file multiple times – first checking bytes at the start of each file, then again checking bytes at the end of each file and then again checking the md5 checksum of each file.
rmlint on the other hand will go through and first check a fingerprint of each file which is bytes at the start, middle and end of the file and then it calculates an md5 checksum on 2 MB blocks of the file until it finds a block that doesn’t match, rather than checksumming the entire file.