Fuzzy Hashing and CTPH

So, in this case, we have one letter difference between our signatures, meaning our two file streams are largely similar!As you may have spotted there is an issue here; we have found a hash collision when using our algorithm.

In the prior example, the fourth block of each file is different; the first is mnop and the second is loop.

Since we are summing our file content to determine our signature value, we are bound to get an unhealthy amount of collisions.

These collisions may cause us to think files are more similar when they are not, and unfortunately are a product of summarizing file content without the use of a cryptographic hash algorithm.

For this reason, we have to find a better balance between summarizing file content and encountering hash collisions.

Our next example demonstrates what happens when an insertion occurs.

As you can see in the below image, the letter h was inserted after mn, adding 1 byte to the file and causing the entire content to shift right by one.

In this instance, our last 'block' will just contain the number 1, though some implementations may handle this differently.

Using our same formula, we calculate a hash of KaqyGUbx.

This hash is largely different than Kaq6KaU.

In fact, once we reach the block containing the change, the hash is completely different even though we have similar content in the latter half of the file.

This is one of the main reasons that fixed blocks are not the best approach when it comes to similarity analysis.

Any shift in content moves data across the boundaries and will cause us to calculate completely different hashes for similar content.

To address that, we need to be able to set these boundaries in another way.

Context triggered piecewise hashingAs you probably guessed, this is where CTPH comes into play.

Essentially we are aiming to calculate reset points with this technique.

Reset points, in this case, are similar to the 4-byte boundaries we used in the prior example, where we use these boundaries to determine the amount of a file we want to summarize.

The notable exception is that we instead pick the boundaries based on file content versus fixed windows.

What this means is we use a rolling hash, as employed by ssdeep and spamsum, to calculate values throughout the file and look for a specific value.

When this specific value is found, a boundary line is drawn and the content since the prior boundary is summarized.

In the example below, we are using a simplified calculation to determine if we have reached a reset point.

While both spamsum and ssdeep calculate the reset point number for each file, for our example, we will use 7.

As an additional note, this technique is meant for files with more than 28 bytes, so our hashes here will be really short and, therefore, less useful outside of illustrative purposes.

Before jumping into the example, let’s talk through what a rolling hash is.

Once again, we will use the same a-z01 file content we used in the prior examples.

We then use what is known as a rolling hash to calculate our value for each byte of the file.

A rolling hash works by calculating a hash value for all of the characters within a certain window of the file.

In our case, we will have a window size of three.

The window movement in our file would look like this across the first four iterations:['a', '', ''] = [97, 0, 0]['a', 'b', ''] = [97, 98, 0]['a', 'b', 'c'] = [97, 98, 99]['b', 'c', 'd'] = [98, 99, 100]As you can see, this rolling window would continue through the file, adding a new byte each iteration and removing the oldest byte, in a FIFO style.

To generate a hash of this window, we would then perform some series of further calculation against the values in the window.

The ssdeep and spamsum algorithms handle this rolling window calculation differently, though the product of the calculation is used in the same manner.

We have simplified the calculation to make this process easier to discuss.

For this example, as you likely guessed, we will sum the ASCII values to keep things simple.

This sum is shown in the first row of the below example.

To keep the numbers smaller though, we will then take our summed ASCII values (S) modulo 8 (S % 8) and use this integer to look for our boundaries in the file content.

This number is found in the second row of the below image.

If S % 8 == 7 we have reached a reset point and can create a summarization of the prior block.

Since our reset point is 7, as previously selected, we will define a chunk of a file any time our rolling hash calculation returns a 7.

This is represented visually in the below image with horizontal lines showing the blocks we’ve set within the file.

For each block, we will calculate our signature in the same way as before: summing up the ASCII integer values of the content within the entire block (as shown in the fourth row) and applying modulo 64 to get the character for the signature (as seen in the last row).

Please remember that the only relationship between rows 2 and 3 in this example is that row 2 tells us when to set the reset point and calculate the number shown in Row 3.

These two hashes are algorithmically independent from one another by design.

Row 3 is still the summation of the ASCII values for a + b + c + d + e + f and not the summation of our rolling hash output.

This produces the signature VUUD.

While much shorter, we now have context triggered hashes.

For our final example, let’s revisit what happens when we perform the same insertion of the letter h.

Using our rolling hash to calculate our context-based blocks (as shown in the first row), we can calculate the summarization of blocks using the same algorithm and generate the signature VUH1D.

As you can see, this technique is more resilient to insertions and allows us to more accurately compare differences in files than using the fixed blocks.

In this case, our signatures are showing that the two files are more different than they are, though this technique is more accurate than our fixed block calculation as it understands that the tail of our file is the same between our two versions.

Obviously, this technique requires files larger than 28 bytes in order to produce accurate results, though hopefully, this simplification can help explain how these fuzzy hashes are formed.

Additional resourcesFor more details about how ssdeep and spamsum work, please check out the below resources:While we primarily focused on ssdeep/spamsum in this post, please check out sdhash and the research that Vassil Roussev has conducted in the field of similarity matching.

Originally published at pythonicforensics.

com on November 4, 2018.


. More details

Leave a Reply