Is This A Thing? Explicitly Colliding Hashes?

tl;dr⌗

Is there a way, besides brute force calculation, to solve the following where you know the function inputs, don’t know the function, and don’t care about the output besides needing certain inputs to have the same output and preferably to be within a certain range? Similar to a perfect hash but perfect in regard to the relation between input and output rather than no collisions in the output.

f(a0) = x
f(a1) = x
f(a2) = x
f(b0) = y
f(b1) = y
...

The Situation⌗

I’m working on an emulator. An important consideration in an interpreter is the instruction dispatch performance. It can need to happen millions of times a second so any cost savings possible can have large impacts.

In modern CPU design you have a lot of performance uncertainty. Are the segments of hot code going to fit into the L1 cache? Will the hot data sets? How’s that branch prediction going? Etc. When building an emulator these can have significant impacts.

On an ARM60 there are about 100 instructions. Well… if we consider different styles of operands. For example: most of the arithmetic opcodes (ADD,CMP,etc.) have 3 different shift operand styles. To differentiate between those ~100 instructions there are 13 bits in total used across the 32bit instruction. The number of relevant bits differs per instruction as does their location.

Now… granted… branch prediction would be a possible problem here but what if you could find a function which given those masked 13bits would spit out a number in the range [0,100) which could be used as an index into a jump table or switch statement. Or maybe fewer and force the specific instruction interpretation do a bit more work?

You can use a lookup table to simplify this but that expands the hot data set which can impact performance. If you do a more traditional bit inspection you’re increasing the amount of conditional code and branch prediction failures.

The really unfortunate thing is that you probably can’t predict the impact in practice of any one of these things. I’m considering trying a number of strategies to compare but having this hash method would be neat to have to compare against and I don’t think I’ve seen such a thing attempted (probably due to the cost of finding f).

I wrote a simple app to brute force the problem but no luck. I found the occasional function which would give the same output but only for small numbers of instruction masks and not across more than one instruction type. The pseudo code would be something like

foreach setOfMaskedInstructions
  foreach hashFunction
    if setOfMaskedInstructions hash to same value: 
      continue
    else: 
      exit program