Pushing Keccak to its limits on ARMv7-M

by Alexandre Adomnicăi

July 5th, 2023

During spring 2023, I have been working on optimizing software implementations of Keccak on ARMv7-M processors as a side project. Although I have already published a detailed note on eprint, I will also provide a concise summary in this blogpost.

Setting the stage

About Keccak

Keccak is a cryptographic hash function that was selected as the winner of the NIST hash function competition in 2012. The NIST selection of Keccak led to the standardization of the SHA-3 hash functions and the SHAKE extendable-output functions (XOF). On top of being a cryptographic primitive itself, Keccak is used as a fundamental builiding block in other cryptographic schemes as well, notably in many Post-Quantum Cryptography (PQC) algorithms such as the future NIST standards Kyber and Dilithium.

Implementations in the wild

Developing optimized implementations of a cryptographic algorithm on many processor architectures is a laborious and time-consuming endeavor. This also holds for Keccak, even though it has been meticulously designed to facilitate straightforward and constant-time implementations directly from its specification. When looking for an open-source implementation of Keccak, the eXtended Keccak Code Package (XKCP) is the ideal resource: it is a comprehensive software package that contains various optimized implementations and reference codes on many platforms, including ARMv7-M.

Keccak on ARMv7-M

ARMv7-M is a widely adopted 32-bit architecture primarily designed for microcontrollers. Many recent research papers on cryptographic implementations focus on the ARMv7-M architecture, particularly the Cortex-M4 core, as it is being considered by NIST as the reference platform for benchmarking embedded systems in the process of standardizing PQC algorithms. Since it comes with 13 general purpose registers, only 13*32 = 416 bits can be handled simultaneously. The internal state of Keccak-f[1600] being 1600-bit long, register spilling (i.e. loads and stores on the stack) is inevitable and it actually represents around half of the overall running time on the Cortex-M3 and M4 cores. While being non-optimal for Keccak itself, it also impacts other cryptographic primitives which rely extensively on it, and can even consitute a major bottleneck when looking for performance improvement. For instance, the following paper which introduces arithmetic optimizations for Dilithium on the M4 states that:

''We see that the run-time of the scheme is mostly dominated by Keccak. The proportion of cycles spent in hashing is up to 85% for key generation, 77% for signing, and 81% for verification, which greatly limits the speedup achievable by further optimizing the arithmetic of the scheme.''
It was enough reason for me to take a look at Keccak on this architecture. Since I have all the ARMv7-M cores at disposal (i.e. Cortex-M3, M4 and M7) I was able to play with all of them for this task 😁

Architecture specific optimizations

Pipelining memory accesses

When I studied the ARMv7-M implementation proposed in XKCP, the first thing that caught my eye was a macro where memory loads where not optimally pipelined, in the sense that they where not consecutive. Indeed, on the M3 and M4 cores, load and store instructions typically require 2 and 1 cycles, respectively, but $n$ $\texttt{ldr}$ can be pipelined together to be executed in $n + 1$ cycles. It was not that trivial to reorganize the code since I had to use another register allocation to relax the register pressure, but still manageable. Also, because $\texttt{str}$ following $\texttt{ldr}$ takes 0 cycle on those cores, I also moved store instructions after consecutive loads as much as possible. In the end, this led to a performance improvement of about 12% on the M3 and M4.

On the M7, this is a different story because of discrepancies in its pipeline architecture and it is actually recommended to interleave memory accesses with arithmetic/logic operations to maximize dual-issue pipeline capabilities. It turns out that the macro which initially caught my attention was in fact optimal on the M7, so I just made sure to interleave memory accesses elsewhere in the code. An annoying thing with this platform is the lack of public information regarding its performance. For instance, ARM did not publish which pair of instructions can be dual issued so it is up to developers to figure it out. Notably, the benchmark efforts from Mark Owen and jnk0le where really helpful to me. A useful observation is that rotating the result of the previous instruction leads to one-cycle delays, so I made sure it did not happen when rescheduling the instructions. In the end, I managed to get a 10% performance improvement compared to the original implementation.

Removing explicit bitwise rotations

Another optimization I wanted to take into account is the lazy rotation trick initially introduced by Becker and Kannwischer to boost Keccak performance on ARMv8-A. The idea is to take advantage of the inline barrel-shifter on ARMv7-M (which allows to shift/rotate second operands of some instructions for free) in order to discard explicit rotations in the linear layer. While this technique is rather straightforward to implement on ARMv8 thanks to its 64-bit architecture, I have to confess it was very tedious to adapt it to a 32-bit architecture when combined with bit-interleaving, as keeping track of all delayed rotations gave me some headaches 🤕

It was worth the pain since it resulted in a ~10% improvement on the M3 and M4. Unfortunately this did not apply to Cortex-M7 processors since instructions with shifted/rotated second operand cannot be dual issued. Because delaying rotations require an extensive use of the barrel shifter, it had a significant impact on the overall code on this platform. At least somebody did the dirty work to give it a try!

Results

The table below summarizes the best performance I could reach for the Keccak-f[1600] permutation, expressed in clock cycles. For more details regarding all the possible tradeoffs regarding code size, please refer to the eprint note. All in all, the performance improvement ranges from 21% to 10%.

Ref	ARM Cortex-M processors
Ref	M3	M4	M7
XKCP	11707	11749	6162
Ours	9206	9242	5578

While I'm rather confident that it will be difficult to get a faster implementation on the M3 and M4 processors, it might be the case that there is still room for improvement on the M7. I may have missed some specificities about its more complex pipeline architecture which could lead to optimizations. If you happen to find something out, I would be happy to hear from you 🙂