Something From Nothing - Breaking AES encrypted firmwares

Firmware encryption is becoming a common feature in modern devices. From a security standpoint, that’s welcome news. However, for anyone reverse engineering or testing device security, dumping firmware is often one of the first tasks — and encryption makes that task extremely challenging, if not impossible. So, why are we seeing more encryption? There are several reasons. New regulations, such as the EU’s Radio Equipment Directive, mandate secure storage, and manufacturers are increasingly focused on protecting their devices from competitors and thwarting potential attackers.

As many experienced cryptographers will tell you, cryptography is hard — even for those who know what they’re doing. That’s why you should never roll your own encryption algorithm; very few people can get it right.

Fortunately, we now have a wide range of libraries that correctly implement various cryptographic algorithms. But even with these robust tools at your disposal, using them properly is key to ensuring your cryptography remains as strong as it should be.

This post is a story about what can happen if you don’t know how to use the libraries.

Breaking AES Encrypted firmwares

Introduction

You’ve got a firmware in your hands — be it from the internet, an OTA package, or a flash dump. No matter how you obtained it, when you load it into a hex editor, more and more often a familiar pattern emerges.

Entropy Stats Extremely high (0.99) entropy stats indicating encrypted firmware image

When the entropy is nearly 1 (with 1 indicating completely random bytes), it’s a strong sign that the firmware is encrypted — often making any further analysis a dead end. My usual approach is simple: open the firmware in a hex editor, give it a quick scroll to get a feel for the content, and if no readable strings or patterns are found, check the entropy. If it’s extremely high, I archive the firmware and move on.

Sometimes, I load multiple firmware versions into a hex editor and switch to diff view, hoping that comparing them might reveal subtle clues. More often than not, the result is the same: the encryption is robust, and there’s really nothing to work with.

Then, one day, while revisiting a pair of encrypted firmware images in diff view, I noticed something unusual:

Diffing Patterns Strange patterns when diffing two firmwares with 0.99995 entropy

This prompted for a more in-depth investigation. Which led me down a rabbit hole — exploring cryptography, performing instruction set probability analysis, and eventually, deep learning models. But more on those later.

Cryptography detour

Everything so far has been fairly easy to follow, but don’t worry it will go downhill from here. Before diving deeper, here’s a quick detour to explain some key terms that will be popping up.

Stream Cipher (e.g., AES-CTR)

A stream cipher encrypts data one small piece at a time — often byte by byte. In AES-CTR mode, the AES block cipher is used with a counter that changes for every block, and its output is combined with the data. It’s fast and efficient, and seemingly very widely used, but using the same key and counter twice can lead to serious vulnerabilities. More on that later.

One-Time Pad

One-time pad is an encryption method where the key is as long as the message, completely random, and used only once. When used correctly, it’s theoretically unbreakable. However, managing such keys makes it impractical for most everyday applications. One-time pad is the ultimate theoretically unbreakable stream cipher.

Many-Time Pad

When a one-time pad is reused for multiple messages, it becomes a many-time pad — a practice that undermines its security. Reusing the key allows attackers to compare different messages, revealing patterns that can break the encryption. This is also the basis for the many-time pad attacks.

Nonce, Counter, IV

These terms refer to numbers or values used once during encryption to ensure that identical messages don’t produce identical ciphertexts.

  • A nonce is a number used just one time.
  • A counter is an incrementing value, often used in CTR mode.
  • An IV (initialization vector) is a value that adds randomness to the first encryption block.

Although the names differ, their role is to keep each encryption unique even with the same key.

Entropy

In cryptography, entropy is a measure of randomness. Higher entropy means more unpredictable keys and data, which leads to stronger encryption. Low entropy can make it easier for attackers to guess or reproduce the keys or the content.

Depth

In this post, “depth” refers to the additional clues that accumulate when encryption is incorrectly implemented. Gathering enough examples, ciphertexts or additional information from a badly designed system, the extra information can create layers of redundancy — making patterns more evident and the encryption easier to break.

Stream ciphers, why?

Stream ciphers are popular for a few key reasons. First, they require relatively low computational resources, making them a natural fit for devices with limited processing power. They also offer flexibility: since they operate on individual bytes (or small groups of bytes) rather than fixed-size blocks, you can decrypt random portions of data on the fly without needing to process the entire stream in order. This property is especially useful in real-time applications or when working with non-sequential data like firmware segments.

Let’s take AES-CTR as an example. In AES-CTR mode, the encryption process works in two stages:

  [Nonce + Counter]
          |
          V
 +-----------------+
 |  AES Encryption |   <-- Encrypts (Nonce + Counter) with the Key
 +-----------------+
          |
          V
  [Keystream Block]    <-- Pseudo-random block generated by AES
          |
          V
 +-----------------+
 |  XOR Operation  |   <-- Plaintext XOR Keystream
 +-----------------+
          |
          V
 [Ciphertext Block]
  1. Keystream Generation:
    A block cipher (AES) is used to generate a pseudo-random keystream. This is done by encrypting a combination of a unique nonce (or initial counter value) and a counter that increments with each block. The encryption operation here isn’t applied to the data itself but is used to create a keystream that is as secure as the underlying AES encryption.

  2. XOR Operation:
    The plaintext is then combined with this keystream using the XOR operation. The XOR process is simple but powerful — if the same keystream is used more than once, it can lead to vulnerabilities. However, when used correctly (i.e., with a unique nonce and counter for every encryption), it’s nearly impossible to recover the key. The encryption of the nonce/counter pair with the key means that even if someone intercepts the ciphertext, they cannot reverse-engineer the key without solving a computationally infeasible problem.

This two-step process — secure keystream generation followed by a simple XOR — provides a strong encryption, as long as the nonce and counter values are never reused.

Increasing the “depth”

After noticing the patterns in the diffing view, one thought immediately struck me: I need more data. The next logical step was to gather as many firmware versions as possible. In this case, I managed to collect 34 different firmware images. A quick analysis revealed that, despite the device using proper encryption (likely AES-CTR), the key and the nonce — or counter — values were identical across all versions. This observation points us toward a neat property of XOR.

If two files are encrypted with the same keystream (K), we have:

$$ [ C_1 = P_1 \oplus K \quad \text{and} \quad C_2 = P_2 \oplus K ]$$

where (P1) and (P2) are the original plaintext files, and (C1) and (C2) are the corresponding ciphertexts.

Now, if we XOR the two ciphertexts, we get:

$$[ C_1 \oplus C_2 = (P_1 \oplus K) \oplus (P_2 \oplus K) ]$$

Using the property that $$(K \oplus K = 0)$$ and the associativity of XOR, this simplifies to:

$$[ C_1 \oplus C_2 = P_1 \oplus P_2 ]$$

In other words, the keystream cancels out, leaving you with the XOR of the original files.

Armed with this knowledge, I wrote a simple Python script to calculate the “coverage” — that is, the percentage of the file that, when XORed across all samples, results in 0. The output looked something like this:

[*] Found 34 files. Largest file size = 521732 bytes.
[*] final_xor has 20430 bytes == 0x00.
[*] coverage (nonzero) = 501302 bytes, 96.08% of largest file size.

96% coverage, that’s a very promising starting point. While some data may be irretrievably lost, I at least have the theoretical possibility of recovering 96% of the original content. The only problem now being that my plaintext is actually a XOR of two different firmware files.

First fails

Armed with a solid understanding (or so I thought) of XOR properties and after reading several articles on how straightforward it can be to decrypt many-time pad I was extremely confident that this task would be a breeze.

I grabbed some existing code from these solutions and ran it against two XORed files, expecting to quickly crack the encryption. Instead, the results were, well, not great.

De-XORed file De-XORed file aka FAIL

I tried a variety of other ready-made scripts and tools from GitHub, but each attempt met with similar disappointing outcomes. This forced me back to the drawing board.

Digging deeper into the topic — especially through scientific papers — it quickly became evident that most of the prior work had been conducted against text. Analyzing text is relatively straightforward due to the properties of ASCII.

The only work I came across that ventured beyond plain ASCII was this paper, which mainly focused on documents and files containing predominantly text, where the structure is well defined (such as HTML or Microsoft Office documents).

This called for a deeper exploration into why the same techniques fall short when applied to firmware and other non-text data.

Redefining the problem

In my case, the problem was much broader. I had two firmware files that were XORed together at unknown offsets, and the underlying bytes could represent anything — op codes, memory addresses, integers, ASCII text, or any other data. This meant there were no reliable anchors on which to base my analysis. Normally, when breaking a many-time pad, you make byte-by-byte guesses and expand from a promising starting point. But here, with such heterogeneous data, that approach falls apart.

One of the most effective techniques in many-time pad attacks is “Crib Drag”. You take a piece of text you suspect might be present in one of the plaintexts and slide it along until you hit a spot where legible text appears on both sides. Of course, this method—like most others — assumes that both sides of the XOR are predominantly ASCII text.

I tried using Crib Drag and several other techniques on the XORed firmware files, but the results were no better than before. The core issue is that these firmwares contain very little ASCII text, and the small fragments that do exist rarely overlap; instead, you often find ASCII on one side of the XOR and op codes or other binary data on the other. Meaning, it will be difficult to see if both sides “make sense”.

I also implemented a Hidden Markov Model (HMM), following the approach described in this paper, but I encountered the same problem: there were no well-defined structures to latch onto. You might think, “What about the IVT (Interrupt Vector Table) or other common structures?” Unfortunately, those elements turned out to be nearly identical across all the firmwares, so when XORed together, they simply canceled out to 0x00.

With all the existing methods exhausted, it was time to think outside the box.

Gathering more source material

To make any headway, I quickly realized I needed more data. Since the device is based on a very common microcontroller platform, I turned to GitHub and collected hundreds of firmware files built for the same architecture. With this additional data, I could finally put some of my favorite statistical tools to work — tools I’ve discussed extensively in past work.

The first step was to visualize the data. I plotted the byte distribution from a known firmware file against a folder of XORed firmwares from my device. This served as a crucial sanity check, confirming that I was indeed dealing with firmware for the target device rather than some unrelated binary.

Byte Distribution (Same Architecture) Byte Distribution (Different Architecture)
Byte distribution same arch Byte distribution different arch

As you can see from the 2D plots above, the byte distribution is markedly different when comparing files from the same architecture versus those from a different one. For the same architecture, the most common values align with expected op codes. When you analyze a large number of firmware files, patterns emerge — certain instructions, like MOV, show up more frequently.

Given the uniformity across these firmware files, I decided to try a straightforward approach. I took my folder of known firmware files, XORed them together at all possible offsets, and built a large 2D probability distribution. The idea was that if you XOR two bytes from known firmware files, the output should fall within a predictable range. Unfortunately, as with my earlier attempts, when attempting to get plaintext from the unknown XORed files the method just yielded garbage.

Still, the probability distributions confirmed that there is structure hidden within these files. Moreover, the fact that firmware typically has only a finite set of valid instructions — meaning there are only so many bytes that can logically follow a given byte — suggests that it should be possible to build a more effective method to guess the correct values based on the context.

Machine Learning to the Rescue

Given the complexity and ambiguity of inverting XOR in this context — and the undeniable patterns hidden within the data — it quickly became clear that a more sophisticated model was needed to capture these relationships and structures. Rather than relying solely on manual techniques, I decided to let machine learning do the heavy lifting.

I could have phrased it like, “It was empirically discovered that…” if this were a scientific paper. But since this is a blog post, I’ll simply say that after months of trial and error — testing various algorithms and experimenting with countless parameter configurations — I found that LSTM (Long Short-Term Memory) neural network was the best bet. LSTMs have the ability to learn arbitrary dependencies in input sequences, which is why they’re a go-to in natural language processing, speech recognition, and similar fields.

Instead of painstakingly trying to decipher how each byte depends on its neighbors, I fed all the firmware data into the neural network. The goal was to have the model learn the underlying patterns and, given a specific context, predict the most likely sequence of bytes.

A little math puts the challenge into perspective:

Given a single byte, the probability of guessing it correctly is 0.39%.

For a sequence of 16 bytes, the chance of getting every byte right is 2.9e-37%.

These numbers illustrate just how daunting the problem is. Early experiments with the model weren’t very encouraging — the predictions were not much better than random guesses. In fact, during the process of tuning, I even encountered a set of parameters where the model’s predictions were almost three times worse than random guessing. That was an achievement of its own kind.

I initially trained the model on a cluster of two M4 Mac Minis. While memory was never an issue, the raw computing power soon became a limiting factor.

Expensive Toys

Firmware is a complex beast — even on simple microcontroller architectures. It quickly became clear that I needed to gather more training data and massively ramp up the complexity of my model to capture the dependencies between instructions and the overall structure of a firmware. This slightly changed my training time - what earlier took 5 minutes per epoch (an epoch being one round of training) all of the sudden started taking 6.5 hours. In my experiments, hitting peak accuracy required anywhere from 10 to 30 epochs, which meant that continuing to train on my M4 cluster was no longer practical.

After some research, I discovered services like Lambda Labs, Cudo Compute, and, of course, the big players such as AWS and Azure, which offer “spot instances.” These are spare computing resources that can be used at a steep discount. With a bit of shopping around, I managed to secure a spot instance featuring 22 vCPUs, 88GB of RAM, and — most importantly — Nvidia A100 GPU with 40GB of VRAM. The price? A mere 0.15€/hour!

Nvidia A100

From my initial trials, it was evident why this particular GPU carries a five-figure price tag. I doubled both the model complexity and the amount of training data, and still, the training data loading time dropped from 6 hours to just 2 minutes. Each epoch then took around 20 minutes — a massive improvement. This meant that training the model would only cost me a couple of Euros.

The Model and Its Performance

Thanks to the power of the expensive toys and countless hours of tuning, I finally managed to get my model to correctly predict over 35% of any XORed 16-byte sequence in previously unseen material.

Over 35% accuracy on any XORed 16-byte sequence — an astronomical improvement from the random guess probability of 2.9e-37%! Or 0.00..<35 zeros>..29%

The model’s breakthrough came from employing reinforcement learning, which rewarded correct sequences and helped the network learn long-range dependencies. Early on, I realized that while the model could predict individual bytes, scattered correct predictions weren’t enough. It was crucial to nail down contiguous sequences — just like in the “Crib Drag” technique — to use them as anchors for further decryption on areas the model couldn’t get right.

Even though the model only got around 35% of the bytes right (roughly 4 to 9 bytes out of 16 bytes in any given window), I wasn’t ready to throw in the towel. I implemented several strategies on the inferring stage to push the predictions further:

  1. Language Model Validation:
    For any prediction that resembled ASCII text, I used a standard natura language model to check if the output formed real words or coherent text. If the result wasn’t valid, the model was penalized, nudging it toward more accurate predictions.

  2. Instruction Set Validation:
    I encoded the microcontroller’s instruction set into Python arrays. This allowed me to verify whether a predicted opcode — say, a memory move instruction — was valid. If the prediction didn’t match any known instruction, the model would be guided to adjust its output.

  3. Cross-Validation Between Firmware Files:
    Since both files that produced the XORed output are valid firmware, any prediction made on one should yield valid content in the other. I ran the model in reverse to create a feedback loop, ensuring that the predictions remained consistent across both files.

Together, these techniques significantly improved the model’s reliability.

The Results

You made it this far! Congratulations — or if you skipped straight here, I don’t blame you. Here’s a summary of what was achieved:

Method Cumulative Correctness
The base model ~35%
ASCII validation (inference) ~38%
Instruction set validation (inference) ~52%
Correctness validation against (P2) ~65%

After all the validation steps, I’m now recovering ~65% of the original firmware from the XORed files. That still leaves about 35% of unknown data, which may or may not be problematic depending on your end goal.

I tested the results by importing several of the known firmwares — processed through the XOR inversion pipeline — into Ghidra. In most cases, Ghidra was able to piece together the high-level structure of the firmware, including various functions, which makes understanding the overall design much more approachable.

The main shortfall of the algorithm lies in handling addresses and their references. Even though the firmwares share a fixed memory map, firmware A isn’t guaranteed to use it in exactly the same way as firmware B. However, since the pipeline can extract contextual clues from instructions that use addresses, it’s often possible to make educated guesses and manually patch these instructions in Ghidra to restore the correct flow.

This method works particularly well for simpler firmware architectures with smaller memory spaces, though it naturally becomes more challenging with larger memories. That’s where your reverse engineering expertise comes into play — you will need to do some manual patching. Nevertheless, the pipeline provides a solid foundation by extracting enough context around the instructions.

Summary

I set out to break AES firmware encryption — a challenge that’s become all too common as manufacturers turn to secure storage driven by regulations and competitive pressures. At first glance, the high entropy in these files signaled a dead end, but a deeper look revealed hidden patterns waiting to be exploited.

I began by exploring the properties of XOR in many-time pad scenarios, only to discover that traditional text-based techniques like crib dragging fall flat when faced with a mix of op codes, addresses, and various data types. This forced me to redefine the problem: instead of having clear anchors, I had to contend with firmware that offered few reliable hints.

To tackle this, I gathered a vast amount of additional data from GitHub and analyzed byte distributions to confirm that familiar architectural patterns were indeed present. With this extra source material in hand, I turned to machine learning. By training an LSTM model with reinforcement learning and employing several validation strategies — ranging from ASCII checks and instruction set verification to cross-validation between firmware images — I steadily improved my predictions.

Leveraging powerful spot instance hardware (an Nvidia A100 at a fraction of the usual cost), I scaled up the model to handle the complexity of firmware. The result? I managed to recover roughly 65% of the original firmware from XORed files — a massive leap from the nearly zero chance of random guessing.

This project shows that even when starting from what appears to be “nothing,” a blend of some statistical analysis, machine learning, and “try harder” can turn that nothing into something.

The source code for experimentation and some notes

Python script using PyTorch for training a bidirectional LSTM model to inverse XOR.

Another script to perform inferring using the saved model on an unknown binary.

These scripts are a bare-bones implementation of the techniques described in this post, but should provide a good baseline model with roughly 30% accuracy. The natural language model, instruction set and cross-file validations all depend a lot on the architecture in question and will be left up to the reader to implement. As well as, the collection and pre-processing of the training data.

In my own experiments, I collected couple of hundred different firmware files from Github, and turned them into a synthetic data set of 2 million sample pairs of valid and xorred data. The data was pre-processed in 16 byte windows as in the experiments this proved to be the best compromise between complexity and context. The training was also done using 16 byte window.

The hyper parameters used in the training script were a result of trial and error. So they might need large adjustments depending on the training data, architecture and/or the hardware you are using.

When using bidirectional LSTM, the hidden dimensions double, so it is important to keep this in mind when doing inferrence as otherwise the model dimensions will be incorrect.

About training times and resource use.

In my experiments, using a training dataset of roughly 2 million samples, bidirectional LSTM and the RL optimizations I pulled consistently 15-20GB of GPU RAM, so to be on the safe side the GPU needs to have at least 24GB of RAM available. The training time per epoch on single A100 GPU was around 22 minutes per epoch, and the model would start to converge around 30 epochs. So expect training times of around 11 hours for a model when using similar parameters. I would say based on experiments, don’t go to a GPU that is lower spec than Nvidia A100 as the training times will increase considerably, making the training take weeks.

I also ran some benchmark test on Nvidia H100 and H200 machines. The performance increases were marginal while the price increase per hour is substantial. H100 ran the same training loop in 16 minutes while H200 took 10 minutes. But the price increase is 10x compared to A100.