During my last year of PhD, I worked extensively with hashing, particularly in the context of Prefix-Free Parsing (Boucher et al., 2019). This got me interested in the history of these kinds of algorithms. First, I stumbled upon rsync by Tridgell et. al (1996). Next, I began reading about spamsum by Tridgell (2002) and discovered Context Triggered Piecewise Hashing. This is what I will be talking about in this post. You can find a list of references at the bottom of the page.

Introduction#

Context Triggered Piecewise Hashing (CTPH), also known as Fuzzy Hashing or Similarity Hashing, is a technique for detecting data that is similar, but not identical, to other data. Unlike traditional cryptographic hash functions, which produce very different outputs even with small input changes (avalanche effect), CTPH generates similar hash values for similar inputs. The concept of CTPH can be traced back to A. Tridgell, who developed spamsum, a spam mail detection tool. Building on this foundation, J. Kornblum adapted and formalized the technique, developing ssdeep (Kornblum, 2006).

Hash function#

A hash function is a series of mathematical operations that map data of arbitrary size to a fixed-size output. Popular hash functions include SHA-256 and MD5, which are still widely used to verify data integrity. However, these hash functions fail to identify similar data, as even a single-byte modification can completely change the resulting hash. Fuzzy hashing was developed to address this limitation: two similar inputs produce two similar hash values.

The “piecewise” aspect of the algorithm traces back to N. Harbour’s (2002) work on dcfldd (a forensic imaging tool). In this work, Harbour introduced Piecewise Hashing, which divides the file into multiple fixed-size segments (512-byte blocks) and calculates a hash for each piece. This was introduced to mitigate corruption problems (if part of the image were to be corrupted, only the segments regarding it would be affected). Piecewise hashing can use either cryptographic hashing (e.g., MD5) or more traditional hashing algorithms (e.g., Fowler/Noll/Vo - FNV). CTPH uses the idea of calculating a hash for different blocks of text, but these blocks are not of fixed size. Instead, the dimension of each block is determined by context-triggered boundaries. The resulting fuzzy hashes can then be compared to measure how similar two files are.

Rolling hash#

A rolling hash is a specific type of hash function designed for efficient, incremental computation over a moving window of data. Instead of recalculating the hash from scratch for each segment (window), a rolling hash quickly updates the hash as the window slides, removing the effect of the oldest byte and adding the effect of the newest.

Spamsum and ssdeep#

As introduced earlier, spamsum is a fuzzy hashing algorithm. Its origins trace back to the rsync algorithm developed by A. Tridgell and P. Mackerrass in 1996. Rsync was designed to efficiently synchronize files across networks by identifying and transmitting only the changed portions of files rather than entire files. You can find a deeper explanation of rsync and its later optimizations and applications in Tridgell’s PhD thesis (1999).

The original spamsum implementation was placed in the “junkcode” directory on the Samba project website. Spamsum adapted a rolling hash technique similar to the one employed by rsync to identify emails that were similar to known spam messages, allowing spam filters to catch variants of known spam even when spammers made minor modifications to evade exact hash matching.

The algorithm gained wider recognition when J. Kornblum published a paper at the Digital Forensic Research Workshop (DFRWS) in 2006. Here, Kornblum introduced the potential of such an algorithm for digital forensic applications and provided an implementation called ssdeep.

How spamsum works#

The spamsum algorithm operates in five phases:

Determine optimal block size(s) (based on file size)
Stream through the file, computing rolling hashes at each byte position
Identify reset points based on a specific criterion
Hash each block between reset points using an FNV-like hash
Encode and compare the resulting signatures

Determining block size#

The block size determines how frequently reset points occur. Spamsum calculates the size to have approximately 64 blocks per file. The initial block size is calculated as follows:

$$ b_{init} = b_{min} \cdot 2 ^{\lceil \log _2 (N/(S \cdot b_{min})) \rceil} $$

where $N$ is the total file size in bytes, $b_{min} = 3$, and $S$ is the target number of blocks.

The algorithm also computes a secondary block size of $2 \cdot b_{init}$ for later comparison of the signatures. This is implemented to enable comparison of files even when their sizes differ slightly.

The rolling hash#

As the algorithm streams through the file, it maintains a rolling hash computed over a 7-byte sliding window. This hash consists of three components:

h1: sum of all bytes in the 7-byte window

$$ h_1 = c_1 + c_2 + \dots + c_w $$

h2: weighted positional sum (each byte multiplied by its index in the window)

$$ h_2 = 1 \cdot c_1 + 2 \cdot c_2 + \dots + w \cdot c_w $$

h3: shift/XOR hash (shift left 5 bits and XOR with incoming byte)

$$ h_3 = (h_3 << 5) \oplus c_w $$

These three hashes are then summed to get the final rolling hash:

$$ h = h_1 + h_2 + h_3 $$

To roll the hash, we perform the following operations:

$h_1 ^{\prime} = h_1 - 1 \cdot c_1 + w \cdot c_w$
$h_2 ^{\prime} = (h_2 - h_1) + w \cdot c_w$
$h_3 ^{\prime} = (h_3 << 5) \oplus c_w$
$h ^{\prime} = h_1 ^{\prime} + h_2 ^{\prime} + h_3 ^{\prime}$

Note: In the implementation, the order of the first two operations is reversed.

Example: rolling hash

In this example, we will use the word libraryb. This means that our first window will be “library”, and then we will roll the hash on window “ibraryb”. Here I report the ASCII values (both decimal and binary) for our alphabet:

char	dec	binary
l	108	01101100
i	105	01101001
b	98	01100010
r	114	01110010
a	97	01100001
y	121	01111001

First window

window of size 7: library

$$ h_1 = 108 + 105 + 98 + 114 + 97 + 114 + 121 = 757 $$$$ h_2 = 108 + 2 \cdot 105 + 3 \cdot 98 + 4 \cdot 114 + 5 \cdot 97 + 6 \cdot 114 + 7 \cdot 121 = 3084 $$$$ h_3 = (00000000 << 5) \oplus 01111001 = 01111001 = 121 $$$$ h = 757 + 3084 + 121 = 3962 $$

Second window

window of size 7: ibraryb

$$ h_1 ^{\prime} = 757 - 108 + 98 = 747 $$$$ h_2 ^{\prime} = (3084 - 757) + 7 \cdot 98 = 3111 $$$$ h_3 ^{\prime} = (01111001 << 5) \oplus 01100010 = 01000010 = 66 $$$$ h ^{\prime} = 747 + 3111 + 66 = 3924 $$

Reset points#

A reset point occurs when the rolling hash satisfies the following condition (with $b = $ block size):

$$ h \bmod b = b - 1 $$

This occurs, on average, every $b$ bytes. Note that the algorithm checks for reset points at both block sizes simultaneously. Every reset point for block size $2b$ is also a reset point for block size $b$, but not vice versa.

Reset points are content-based. This means that if bytes are inserted or deleted before a reset point, all previously generated chunks remain identical, and future chunks will align correctly again when other reset points are encountered. This is why fixed-size piecewise hashing fails with in/dels and CPTH does not.

Hashing blocks#

Each block will lie between two reset points (and between the start of the file and the first reset point). Once a reset point is reached or the file ends, the block is hashed. Here, an FNV-like hash is used.

This hash is initialised with an offset:

$$ h_{FNV} = 2166136261 $$

Then, for each byte $c$ in the block we calculate:

$$ h_{FNV} ^{\prime} = (h_{FNV} \times 16777619) \oplus c $$

where $\times$ is multiplication modulo $2^{32}$. This calculation is performed simultaneously with the rolling window hashing. However, the value of the hash is saved only when we encounter a reset point. After that, the FNV-like hash is reset.

Spamsum stores only the six least significant bits of the FNV-like hash:

$$ h_{6-bit} = h_{FNV} \bmod 64 $$

Reducing the hash from 32 bits to 6 bits per block.

The 6-bit value is then converted to a Base64 character from the set [A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z,a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z,0,1,2,3,4,5,6,7,8,9,+,/].

Encoding#

The complete spamsum signature has the following format:

<block_size>:<signature_at_b>:<signature_at_2b>,<filename>

If a file has low entropy (e.g., repeated patterns), the algorithm might generate fewer blocks than desired. If the signature at block size $b$ is shorter than 32 Base64 characters, the algorithm halves the block size and reprocesses the entire file.

Comparison#

Once the two files have been hashed, their signatures can be compared to produce a similarity score from 0 to 100 ($0 \leq s(x,y) \leq 100$). Because the reset points depend on the block size, only signatures with an identical block size can be compared. It is possible to compare two signatures if the block sizes given in the signature are within a power of two (i.e., if $b_x = b_y$, or $2b_x = b_y$, or $b_x = 2b_y$).

The comparison is performed as follows:

Remove repeating characters longer than three from the signatures
Calculate a weighted edit distance (0-64)
Convert to similarity score (0-100)

Weighted edit distance#

Given two strings $t_1$ and $t_2$, the edit distance between them is defined as “the minimum number of point mutations required to change $t_1$ into $t_2$”, where a point mutation means either changing, inserting, or deleting a character. The spamsum algorithm uses a weighted version where each insertion or deletion is weighted as a difference of one, each change is weighted at three, and each swap (i.e., right characters but in reverse order) is weighted at five.

Running time (J. Kornblum, 2006)#

If a single CTPH processing of an input of size $n$ takes $O(n)$ time, and the block size may need to be adjusted up to $O(\log n)$ times, the overall running time is $O(n \log n)$.

When comparing sets of CTPH signatures, the algorithm must match the signature of an unknown file against all known signatures. CTPH signatures require a direct similarity comparison. Each comparison involves calculating the edit distance between the two signatures. Since computing the edit distance takes $O(l^2)$ time (where $l$ is the signature length, which is small), the time needed to compare one CTPH signature against $n$ known signatures is $O(n)$.

Applications of spamsum in ssdeep#

This section will provide a brief overview of the applications of the spumsum algorithm, as implemented in the tool ssdeep. For a deeper explanation of these applications, you can refer to J. Kornblum “Identifying almost identical files using context triggered piecewise hashing” (2016).

Altered document matching#

To demonstrate how CTPH can identify highly similar documents, J. Kornblum reported the following experiment. He modified a Word document version of Lincoln’s Gettysburg Address (1953) by altering font size, inserting new paragraphs, changing colors, substituting and deleting words, and rearranging sentences. He also appended the phrase “I AM THE LIZARD KING!” a few dozen times at the end of the document. Ssdeep was still able to recognize the document.

Partial file matching#

The tool can also be used to perform partial file matching, allowing fragments of a file to be linked back to the original source. When a file was truncated to include only the first or last third of its data, ssdeep could still recognize the relationship between the partial and complete versions. This is useful in forensic scenarios such as file carving, where only portions of files can be recovered. The caveat is that ssdeep requires segments of a file that was at least around 0.3 MB long to provide an accurate match. Below this threshold, block size differences make the comparison unreliable.

Conclusion#

Discovering CTPH has been a fascinating detour in a field I never dipped my toes in. What began as a curiosity about rsync and then spamsum quickly turned into an appreciation for how elegantly CTPH approaches signature-matching problems. Despite being nearly two decades old, the ideas behind this algorithm remain both clever and practical; proof that sometimes even the simplest heuristics can achieve surprisingly good results. I wanted to share it here because I feel like it’s not particularly well-known, yet it really sparked my interest.

References#

C. Boucher, T. Gagie, A. Kuhnle, B. Langmead, G. Manzini, T. Mun: Prefix-free parsing for building big BWTs. Algorithms for Molecular Biology (2019). https://doi.org/10.1186/s13015-019-0148-5

Context triggered piecewise hashing: https://forensics.wiki/context_triggered_piecewise_hashing/ (accessed 11.4.25).

Fuzzy hashing: https://en.wikipedia.org/wiki/Fuzzy_hashing (accessed 11.4.2025)

N. Harbour: dcfldd. Defence Computer Forensics Lab. (2002). https://dcfldd.sourceforge.net/ (accessed 11.5.2025)

J. Kornblum: Identifying almost identical files using context triggered piecewise hashing. Proceedings of Digital Forensic Research Workshop (2006). https://doi.org/10.1016/j.diin.2006.06.015

A. Tridgell: Efficient Algorithms for Sorting and Synchronization. PhD thesis (1999). https://maths-people.anu.edu.au/brent/pd/Tridgell-thesis.pdf

A. Tridgell: spamsum source code. https://www.samba.org/ftp/unpacked/junkcode/spamsum/ (accessed 11.4.25).

A. Tridgell, P. Mackerras: The rsync algorithm. Technical report (1996). https://openresearch-repository.anu.edu.au/items/15a1c428-0ad3-49d6-bb54-9238250cbbf0