Benchmarking memcmp() for timing attacks
We were interested in timing attacks against memcmp() for various reasons (CVE-2017-2624) and decided to explore the topic a bit further.
Timing attacks against
memcmp() have been successful in the past against light bulbs or even the Xbox 360.
The basic idea behind these kinds of attacks is, that
memcmp() takes longer if the bytes of the memory areas match, since an additional byte will be compared. If this difference could be measured, an attacker could brute force secrets much more efficient by testing all values for the first byte, taking the one which took the longest to compare and then go on with testing all values for the second byte. The difference in timing is pretty small since only a few instructions differ in contrast to other widely abused timing attacks, like the timing attack against OpenSSH which allowed to enumerate the users. In the OpenSSH case, the password send to the daemon was only verified against the locally stored hash, if the user exists, if the user did not exists, there was no hash to compare against and the login attempt was rejected instantly. In order to exploit this issue efficiently a very long password used, which takes a long time to hash. This increased the time between rejection of a existing user and a non existing one enough to be easily measurable over the network.
In order to protect against this, especially in cryptographically relevant code, there are several implementations of a timing constant memory comparisons.
How big is the timing difference when comparing “AA” to “BA”? This highly depends on the implementation of
memcmp(). Some implementations use a block size compare to speed things up, others use modern instructions like SSE for faster comparisons. Take a look at the glibc implementation which is quite lengthy to achieve the best speed for all use cases. We decided to do a quick benchmark.
This of course highly depends on the system and the load on the machine. In order to reduce side influences we used nice to give the process the highest priority and pinned it to a single CPU using taskset. Furthermore you want to disable powersaving during the test to make sure that the CPU speed is not changing.
Lets take a look of two runs, one comparing two and the other comparing 64 bytes. We run each comparison a 10000000 times and add the timings. To reduce the impact of other tasks even further, we interleave the tests:
How to measure the timing? We have opted to use rdtsc, which returns the processors time stamp counter, which is usually considered a high-resolution timer with low overhead. This is another reason why we bind the program to a single CPU using
taskset, since the counters on various CPUs could differ.
Disclaimer: In our benchmark, not all functions have the exactly the same
behaviour, some just return that there is a difference, others return
the difference between the first non-matching bytes.
What are the results of a run when only two bytes are compared? Note: our source can be found here
How does this differ for 64 bytes?
What we can easily see, even for the naiive implementation, we do not get a time that is 32 times longer when comparing equal strings. This means there is a high calling/setup overhead.
The second view reveals, that for some implementations, the equal memory areas are compared faster than the differing ones. This is for example the case for the consttime and openssl implementation in the 64 Byte run. The other timing safe implementations flicker. If several runs are performed sometimes the equal and sometimes the unequal cases take longer, depending on context switches and other load happening on the box. Furthermore it is clearly visible, that it is possible to perform timing attacks against
memcmp() if the implementation is not fully optimised but not on all systems.
In conclusion, it highly depends on the circumstances if a
memcmp() is subject to a timing attack.
Update: We were asked by @PaulM, if we did disable hyperthreading. In our experiments, HT did not make a difference, we assume because we bound the process to a single CPU.