Sorting Salted + Hashed Data

Published 2024-04-17

And why you might want to

I had an idle thought about sorting hashed data, and then almost immediately discarded it. Why would you want to sort hashed data? While the hash values are directly related to whatever it is you're hashing, they don't bear any useful sortable correlation.

In fact ... they're effectively "random".

And that's why I didn't throw out that idea! I think it's an interesting way of randomizing data, with a few different variations and properties worth exploring.

In this article, I'm going to be using a cryptographically secure hash, like SHA256. It has the following characteristics that we're interested in:

So, computing a hash over each member of a given input set of data (say a playlist consisting of artists and song titles) produces a second set of data as output. If the input data didn't have any duplicates, then the output data set won't either (due to the 2nd property above). Here's an example, consisting of a set of first values (the hash, shortened for clarity) and their corresponding inputs (artist and title):

6ad24df8 ("abba sos")
f31861ea ("acdc back in black")
343dcf61 ("bach toccata and fugue in d minor")
b7b17a9e ("led zeppelin the rain song")
d22cb1d8 ("olivia newton john and elo xanadu")

Notice that this input set is sorted by artist and title. What happens if you sort based on the output values? As you'd expect, you now have a sorted list of hashes. But take a look at the original input (artist and title) — it's effectively randomized!

343dcf61 ("bach toccata and fugue in d minor")
6ad24df8 ("abba sos")
b7b17a9e ("led zeppelin the rain song")
d22cb1d8 ("olivia newton john and elo xanadu")
f31861ea ("acdc back in black")

(Obviously, a bigger data set would be more impressive, but this is just a simple article).

So what? You've just sorted a hash!

Why yes. Yes I did just sort a hash. But the reason this is interesting is because of the first property mentioned above — the "even distribution". Notice how the hash values are reasonably "equidistant" (clearly not exactly equidistant, but certainly well spread out).

This means that I can compute relative distance — "acdc" is farther from "bach" than "olivia newton john and elo" is from "led zeppelin" (on so many levels!).

This is certainly an interesting property, no?

Repeatability and Robust Ordering

The nice thing about using a hash as a randomization is that it's completely repeatable. Given the same input set, you will get the same output set.

But there's another property that may not be as obvious — modifying the input set (by adding or deleting entries) results in the same ordering of the unmodified entries!

AND new entries are seamlessly inserted into the output.

That's another interesting property! Let me explain why.

Most algorithms that "randomize a playlist" do not have this property. A commonly used algorithm for playlist randomization takes the input set, and picks a random value and swaps it with the last value. The algorithm then picks another random value (not including the last value), and swaps it with the second last value. And so on. This is a traditional "shuffle" algorithm; it's similar to shuffling cards in a deck.

It's repeatable so long as you don't adjust the input set. The instant you add or delete even one entry, the ordering of the results is not the same as in the previous shuffle. This is what I'd call a "fragile ordering" — any perturbation in the input results in significant changes in the output. The algorithm I describe above, using hashes, is what I'd call "robust ordering" — perturbations in the input result in changes limited only to those perturbations themselves, not surrounding data.

Robust input ordering too

Another aspect is that we get robust input ordering too! That is, the input set can be in any order, because we completely ignore its order and rely on the hash value for the output order. This is a significant difference from the "shuffle" algorithm as well, in that it would produce a different shuffle with the input ordered differently.

Ok, now give me a different ordering

So the next logical question is, how do I get a different order? The next time this playlist comes around, I don't want to hear "abba" after "bach" again; I want a new shuffle.

The answer here is simple, and is a common trick with hashes: Add a salt.

A salt is like a "seed" that changes the result of the hash. You can add a salt easily to the example above; let's add a salt, literally the string DELETE ME I'M A SEED, to the data set, and hash it again:

18c1605f ("DELETE ME I'M A SEED acdc back in black")
1a1b8469 ("DELETE ME I'M A SEED olivia newton john and elo xanadu")
550ae26c ("DELETE ME I'M A SEED led zeppelin the rain song")
9a9100bb ("DELETE ME I'M A SEED bach toccata and fugue in d minor")
ff5d01ed ("DELETE ME I'M A SEED abba sos")

Here the salt is just prepended to the other string data — the hash function doesn't care what it's hashing, it's just data.

Notice now that the ordering is different. (Unfortunately, the universe seems to insist that "abba" follows "bach". Who am I to argue?)

Picking salts and managing the data is left as an exercise to the reader:

01842503 ("DELETE THE BAD SEED bach toccata and fugue in d minor")
0b8c15d5 ("DELETE THE BAD SEED olivia newton john and elo xanadu")
627d8b71 ("DELETE THE BAD SEED acdc back in black")
cf1c8ab2 ("DELETE THE BAD SEED led zeppelin the rain song")
e7034bf6 ("DELETE THE BAD SEED abba sos")

The universe is perverse; "abba" still follows "bach", but only after a deep cleansing of disco, blues, and solid rock.

I'm on a low sodium diet, no salt please!

Now, if we're being honest, you don't even need the salt. If you're using a hash function with a lot of bits (like the SHA256 we talked about above), all you need to do to get a different ordering is choose a different sort key!

By default, we humans want to sort as if all 256 bits of the hash represented some big number, and we want to order all the values. Why?

The cool thing about cryptographically secure hashes is that the bits are all independent. So, instead of using the first bit as the most significant bit, and the next bit as the next most significant bit, and so on, you can view them in any order. I'm not a math whiz, but I'll bet there are 256! (the "!" isn't me getting excited; it's factorial notation) ways to order the "significance" of the bits for sorting. That's one hell of a "salt", isn't it?

Put another way; if you're not comfortable writing sorting code that looks at different bits to determine sort order, then just swizzle the bits yourself. That is, create a transformation function that moves the bits in the hash around to whatever "order" you want, and then sort numerically like you would normally.

The advantage in either case, of the no salt approach, is that you aren't re-computing the hash each time; you're just doing a mechanical transformation on it and coming up with a different ordering.

But I want fragile input ordering!

Ok, if you really want the output to vary with the input ordering, then the solution is simple — run a single hash over the entire input, and then use that hash as the (or one of the) salt(s)! This way, you've entangled the input ordering with the hash computation. Voila!