Cache a Small Check?

  • Posted On: Tuesday, August 31st, 2010
  • Post Categories: Blog

The de-duplication mechanisms used by all the most effective network acceleration technologies, rely on caching blocks of data and referring to them by “signatures” or “fingerprints” or “hashes” i.e. short identifiers generated algorithmically from the data blocks. Since the hashes are much smaller than the data blocks they represent, there will be many blocks which have the same hash. Very many indeed, but this is in a very, very, large number of possible blocks, so the possibility of a collision remains very low.

Having arrived at Replify after much of the initial design work, it was bothering me that I didn’t have a good handle on the math behind the de-duplication mechanisms so decided to amuse myself on a recent vacation, by figuring it out from first principles (sad I know).

Some interesting things emerged. Firstly these are big numbers – so big that they are hard to work with in a programming language that doesn’t have native support for arbitrary precision arithmetic. Consider that a 4000 byte data block has 2 to the power (4000*8) possible permutations.

If you represent that with a 128 bit hash then there are 2 to the power (4000*8 – 128) permutations that produce the same hash, but even so, picking thousands of pairs of hashes at random every second it would be many times the life time of the universe before you would expect to find a matching pair.

But now you have to factor in something else, often called the “Birthday Paradox”. In a large sample, the number of pairs to consider grows very rapidly indeed. In a cache of 20Gb this is approximately 26 trillion pairs for Replify, and so this combinatorial explosion of hashes in the cache can make the wildly improbable hash collision start to become something to think about.

One surprising finding– smaller block sizes actually increase the risk of a collision. On the face of it, the smaller the block size, the fewer the number of collisions for a given hash, but the combinatorial explosion of hashes in the cache races ahead of this and the risk goes up.

As it turns out, the designers did a great job – a combination of a short hash (for speed) and a longer hash (as a collision-check when a match is found) means that with Replify we’re still talking “life-times of the universe” before a collision becomes likely. Had we chosen a shorter hash, say 64 bits, and a shorter block size, then I would be right to lose some sleep, but exhausted from remembering my University math, I’m sleeping very well.

Paul Moorhead VP Engineering

Leave a Reply