Bloom Filters and Cryptographic Hashes
In this article, I'm going to describe some novel (to me, anyway) uses of Bloom filters. I'll also discuss how to construct Bloom filters, and the practical impacts.
Bloom filter review
First off, if you're not familiar with Bloom filters, let's take a quick step back and look at them in general (as they are commonly portrayed in the literature). If you're already an expert, just skip this section.
What a Bloom filter gives you is an answer to the question, “have I seen this data set before or not?” The data set can be anything: a directory pathname, an MP3 file, a picture; doesn't matter. The Bloom filter has the characteristic that it can give false positives (that is, it can tell you that you've seen a particular data set, even though you haven't), but it will never give you a false negative (that is, it will never say that it hasn't seen a given data set, even though it has).
In order to understand how and why this is the case, we need to understand how the Bloom filter works.
The Bloom filter operates by first converting your data set into several hashes. These hashes are then used as indices into a large array, which we'll call the “bitspace.” Since the Bloom filter converts your data set into hashes, it doesn't really matter what the content or format of the data set is (whether it's the MP3 file mentioned above, or a JPEG image or some number).
As a practical example, let's say you use three different hashes in your Bloom filter implementation. For a first data set, these hashes might produce the numbers 7, 303, and 9152. This means that you would set bits 7, 303, and 9152 in your bitspace to a one. If another data set came along whose three hashes gave you the numbers 7, 55, and 611, then you'd set those bits to a one as well. (It's fine that bit 7 is already set by the first data set.)
To test if a new data set is already present in the Bloom filter, you compute its hashes, and if all of the bits are already set, then you have some probability that the particular data set has already been encountered before. (We'll discuss the value of that probability below.) However, if zero or only some (that is, not all) of the bits are set, then you're guaranteed that the data set has not yet been seen.
The reason we get false positives (that is, the Bloom filter claims that a data set has already been seen, when in fact it has not) is because multiple different data sets' hashes may have contributed to the bits being set. Consider the case where a third data set produces the hash values 7, 303 and 611. All three of those bits are “on” in the bitset (7 was present in both of the previous data sets, 303 was seen in the first data set, and 611 was seen in the second data set). Because all three bits are on, the output of the Bloom filter is an indication that the data set has already been seen, even though there has not yet been a data set that uniquely sets those three bits (except for the third data set being discussed).
This is the false positive.
If you're familiar with the available literature about Bloom filters, you'll see that there are three things that I'm going to do differently. First off, the literature suggests that you require multiple hashes (I'll show you a neat trick to use just one), secondly it goes on to say that cryptographically secure hashes, such as sha256, aren't really suitable or required (depending which article you read — I'm going to argue that they are in fact awesome), and finally most implementations use just one bitspace (I'm going to show a two-dimensional extension).
In my use-case, I have a lot of data that I've already hashed. I happen to use the sha256 cryptographically secure hash on the data, for a number of reasons (collision resistance, obfuscation, and distributability being chief amongst them).
So I'm going to assume that your data sets have already been reduced to nice, long, cryptographically secure hashes, such as the sha256 hash. You could, in reality, use any one of a number of different hashes; this will become clear throughout the article. This means that you can now handle any type of data (pathname, MP3, JPEG) by simply converting it to a sha256 value.
A key property of cryptographically secure hashes, such as the sha256 hash, is that they have good distribution; that is, for a sufficiently large set of samples (say, hashes of all the filenames on your system), there's an even distribution of hashes throughout the hash space (which for sha256 is ridiculously large: 2^{256}).
Using a Single Hash
Because a sha256 is evenly distributed, and because it's big (256 bits long, to be exact), we can take a shortcut. We can consider the sha256 as a “parallel computed hash,” that is, we can pretend that the sha256 is actually a bunch of smaller hashes that we happened to have computed in parallel.
As a simple example, consider breaking the sha256 hash up into 16 "buckets" of 16 bits each (256 / 16 = 16). (We'll refer to these as 16-bit buckets, meaning that each bucket is 16 bits wide.) Effectively, we've now computed 16 16-bit hashes in parallel by doing just one hash function.
Setting up the bitspace to hold 2^{16} values is easy — create an array that's 2^{16} bits long (which takes up 2^{(16-3)} = 2^{13} = 8kB). Then, consider the 16 hashes individually, and set the entry at the index corresponding to each 16-bit hash value to a one.
Let me demonstrate what I mean. If our sha256 gave us the value
050c9dc96f6bcdf2458c0e48e866b233f6bd4081f18abd2f356751f5e283ebe2,
then we'd break that up into 16 16-bit buckets (050c, 9dc9, 6f6b, cdf2, 458c, 0e48, e866, b233, f6bd, 4081, f18a, bd2f, 3567, 51f5, e283, and ebe2).
We would then set each of the bitspace elements to a one for the corresponding bucket (consider the bitspace as an array of bits, like “bool bitspace [65536]” in C++, for example):
bitspace [0x050c] = 1; bitspace [0x9dc9] = 1; bitspace [0x6f6b] = 1; bitspace [0xcdf2] = 1; ... bitspace [0x3567] = 1; bitspace [0x51f5] = 1; bitspace [0xe283] = 1; bitspace [0xebe2] = 1;
The process to test if a data set has already been seen is analogous. Take the hash of the data set, and break it up into 16 16-bit buckets. Then, examine each entry at the index corresponding to each 16-bit hash value to see if it holds the value one. If (and only if) all entries contain the value one, then you may have seen this data set before. If any of the entries contain a zero, then you have (guaranteed) not seen this data set before. Finally, add the data set to the Bloom filter's bitspace by setting the corresponding elements to one, as described above.
Multiple Bitspaces
Instead of having all of the hash values set bits in a common bitspace, we could give each hash its own bitspace. Consider again the example from above, where we had the hash:
050c9dc96f6bcdf2458c0e48e866b233f6bd4081f18abd2f356751f5e283ebe2.
This time, let's implement a two-dimensional Bloom filter, with the first dimension being the hash number, and the second dimension being the hash value. This is simply 16 instances of the bitspace we had above (that is, “bool bitspace [16][65536]”). We'd fill this filter in a manner similar to the example:
bitspace [0][0x050c] = 1; bitspace [1][0x9dc9] = 1; bitspace [2][0x6f6b] = 1; bitspace [3][0xcdf2] = 1; ... bitspace [12][0x3567] = 1; bitspace [13][0x51f5] = 1; bitspace [14][0xe283] = 1; bitspace [15][0xebe2] = 1;
Yes, this takes up 16 times more space than the single Bloom filter bitspace, but the probability of a false positive is also lower. How much lower, you ask?
Practically Speaking...
Theory is all well and good; what about the real world? What kind of numbers do we actually see?
My data set consists of 371,874 pathnames on my system. I'd like to know if there's a duplicate in the data set (I use the list of files between multiple machines for controlling backups, so I need to know if the file is already present or not).
Recall that I said the characteristic of the data set (MP3, JPEG, pathname) doesn't matter — it all gets converted to sha256.
So, in my case, I generated a text file that has 371,874 lines of text, each line consisting of an ASCII representation of a sha256 value (which is the hash of the filename).
You can generate (rather inefficiently, I'll admit) a nice test file like this on your system too:
find / | while read i; do echo $i | sha256sum | sed -e "s/ *.*//"; done | sort > output
Here are the first few and last few lines of the file:
000086b94a7d532b94f7e6fe4f90d949276aad3d25b7c6fda0d6d8e1dfa62d6c 00008ea7916784fd3298cc89683f6aa47c0c4f81e5a48eb29d74f2d40cbe7f02 0000c14342b42b3ef5c511f06b2055a8a68f3d5450c80e5dcbef303fceac8f2c 0000d096169f0f3d3b4a72323b493f242e671553f38ef6a26a5f8758544477a9 ... ffff851ceb36158775b8b62f2137c14fb58cea53549118e2f03ccf55999d13e0 ffff86ac8406507154e2ac02862bff93dc1c0970fd78818a930a3e06317f236c ffffc6c0ad150da140c27a8cd5c502cde5479c9aa28c4c0017bba0342d0c3720 fffffcfda40c7f13db6562750f69de410201ea0dee848bd16ada19de42af171e
The fact that the file is in ASCII and is line oriented isn't important; I prefer that format because it's easy to generate, edit, and consume by my Bloom filter test program. The hash values could have been delivered by any means, such as a binary blob. The 371,874 sha256 hashes (of 32 bytes each) would consume 11,899,968 bytes if represented in binary.
Running a multiple-Bloom filter with 19-bit buckets (917kB bitspace size) gives me 5 collisions; a multiple-Bloom filter with 20-bit buckets (1.7MB bitspace size) gives me no collisions. Running a single-Bloom filter with 23-bit buckets (1MB bitspace size) gives me 3 collisions; a single-Bloom filter with 24-bit buckets (2MB bitspace size) gives me no collisions. 2MB of bitspace size versus 12MB of input file gives me a “compression factor” of 6. Awesome!
Here's a table of the 371,874 hashes, ordered by bitspace size. Some notation is in order. CTC is the Count To Collision, the computed average number of entries processed before a collision (data set size divided by number of collisions). The N_{b} # bits indicates how many bits I'm using for each bucket. The N_{h} # hashes indicates how many hashes I can get, given N_{b}. For example, if I set N_{b} to 19 (meaning I'm using 19-bit buckets), I can get 256/19 = 13.473... buckets. Obviously, I need an integral number of buckets, so in this case I round down to just 13. (I indicate that I rounded down by using the notation “> 13,” meaning that there really should be ever so slightly more than 13 buckets, but that I'm just sticking with 13.)
Type | Table size (kBytes) | # bits (N_{b}) | # buckets (N_{h}) | Collisions (actual) | CTC | Compression Ratio | Bits / Entry |
---|---|---|---|---|---|---|---|
Multiple | 917 | 19 | > 13 | 5 | 74,374 | 19.738 | |
Single | 1,048 | 23 | > 11 | 3 | 123,958 | 22.557 | |
Multiple | 1,704 | 20 | < 13 | 0 | 6.98 | 36.656 | |
Single | 2,097 | 24 | < 11 | 0 | 5.67 | 45.115 |
This gives you an idea of the space versus probability of collision tradeoff on a practical data set. For example, I had to move to 20 bits in a multi-Bloom filter in order to get no collisions, or 24 bits in a single-Bloom filter. The “Compression Ratio” column gives a ratio of the total data set size (the 11,899,968 bytes derived above) versus the bitspace size; effectively, comparing how much memory you saved by using a Bloom filter rather than storing the actual data. The “Bits / Entry” is an indication of how many bits each entry takes — the bitspace size divided by the number of entries in the data set.
Overnight, I generated a much bigger list of all of the files on my computer, and hashed them. This resulted in a much bigger data set: 1,637,207 files in total (@ 32 bytes per hash, this represents 52,390,624 bytes of data). Here's a table of those entries, ordered by bitspace size:
Type | Table size (kBytes) | # bits (N_{b}) | # buckets (N_{h}) | Collisions (actual) | CTC | Compression Ratio | Bits / Entry |
---|---|---|---|---|---|---|---|
Multiple | 3,407 | 21 | > 12 | 136 | 12,038 | 16.652 | |
Single | 4,194 | 25 | < 11 | 9 | 181,911 | 20.495 | |
Multiple | 6,291 | 22 | < 12 | 0 | 8.33 | 30.742 | |
Single | 8,388 | 26 | < 10 | 0 | 6.25 | 40.990 |
Derived Statistics
If we examine the CTC vs the cost (bits/entry), we come up with another interesting measure. Here I've included both the ones that had collisions and ones in which there were no collisions (indicated by an = sign in the CTC column)
Type | CTC | Bits/Entry | Capacity (2^{Bits/Entry}) | Occupancy (CTC / Capacity) |
---|---|---|---|---|
Multiple | 12,038 | 16.652 | 102,980 | 11.690% |
Multiple | 74,374 | 19.738 | 874,440 | 8.505% |
Single | 123,958 | 22.557 | 6,170,688 | 2.009% |
Single | 181,911 | 20.495 | 1,477,780 | 12.310% |
Multiple | =371,874 | 36.656 | 1.0828e+11 | 0.000343% |
Single | =371,874 | 45.115 | 3.8104e+13 | 0.000000975% |
Multiple | =1,637,207 | 30.742 | 1.7958e+09 | 0.0912% |
Single | =1,637,207 | 40.990 | 2.1838e+12 | 0.000075% |
This leads to an interesting research topic; on unsorted input, when does the first collision happen given arbitrary bit sizes? So, I modified my Bloom filter program to reset the filter tables every time there was a collision, and report the number of entries processed thus far. Here are the results (MCTC is the Mean Count To Collison, Bits/Entry represents the number of bits required to represent the values up to MCTC):
Type | Table size (bytes) | N_{b} | Collisions | MCTC | Bits/Entry | Occupancy (MCTC/2^{Bits/Entry}) |
---|---|---|---|---|---|---|
Single | 8k | 16 | 366 | 4,467 | 14.67 | 17.136% |
Single | 16k | 17 | 193 | 8,479 | 15.46 | 18.811% |
Single | 32k | 18 | 102 | 15,935 | 16.45 | 17.800% |
Single | 64k | 19 | 55 | 29,294 | 17.90 | 11.977% |
Single | 128k | 20 | 29 | 54,463 | 19.25 | 8.735% |
Multiple | 128k | 16 | 29 | 56,003 | 18.72 | 12.970% |
Multiple | 240k | 17 | 17 | 94,859 | 20.73 | 5.454% |
Single | 256k | 21 | 16 | 99,953 | 20.98 | 4.833% |
Multiple | 448k | 18 | 9 | 177,258 | 20.70 | 10.406% |
Single | 512k | 22 | 8 | 184,202 | 22.77 | 2.575% |
Multiple | 832k | 19 | 5 | 307,005 | 22.20 | 6.372% |
Single | 1M | 23 | 4 | 338,065 | 24.81 | 1.149% |
Multiple | 1.625M | 20 | 3 | 508,151 | 26.82 | 0.429% |
Single | 2M | 24 | 2 | 637,194 | 26.33 | 0.755% |
Multiple | 3M | 21 | 1 | 940,794 | 26.75 | 0.834% |
Single | 4M | 25 | 1 | 1,167,320 | 28.74 | 0.260% |
This is an amazing compression factor, when you think about it — we've taken a 256 bit entry (the sha256 hash) and used 16-25 bits to represent it (compression factors between 10 to 16 times)!
Tradeoffs
As we've seen above, there are a number of adjustments that we can make to the Bloom filter to trade off between the size of the bitspace required versus the probability of a false positive.
You can never cause the probability to go to zero, but you can get arbitrarily close. For example, assuming that you use the entire sha256 as your index, you can cause your probability of a collision to map directly to the sha256 probability of a collision — that is, 2^{-256}. But who has that much storage? [Spoiler: Not Google, not even the NSA]
Conclusion
Using a sha256 as the basis for the input to a Bloom filter is an excellent way to both compute multiple virtual hash functions and to diffuse values within the Bloom bitspace. Using single-Bloom versus multiple-Bloom filters means that you can tailor the amount of memory required.