Thanks Sun, I know you were waiting to announce this once Apple oficially dropped support for ZFS like the proverbial hot potato.
You knew this day was coming: ZFS now has built-in deduplication.
If you already know what dedup is and why you want it, you can skip the next couple of sections. For everyone else, let’s start with a little background.
What is it?
Deduplication is the process of eliminating duplicate copies of data. Dedup is generally either file-level, block-level, or byte-level. Chunks of data — files, blocks, or byte ranges — are checksummed using some hash function that uniquely identifies data with very high probability. When using a secure hash like SHA256, the probability of a hash collision is about 2^-256 = 10^-77 or, in more familiar notation, 0.00000000000000000000000000000000000000000000000000000000000000000000000000001. For reference, this is 50 orders of magnitude less likely than an undetected, uncorrected ECC memory error on the most reliable hardware you can buy.
Chunks of data are remembered in a table of some sort that maps the data’s checksum to its storage location and reference count. When you store another copy of existing data, instead of allocating new space on disk, the dedup code just increments the reference count on the existing data. When data is highly replicated, which is typical of backup servers, virtual machine images, and source code repositories, deduplication can reduce space consumption not just by percentages, but by multiples.
From ZFS Deduplication : Jeff Bonwick’s Blog.