All your piezo are belong to us

Microphones are everywhere: smartphones, IoT devices, and so forth. Speakers can be used as microphones too. But what is even more ubiquitous than either of those? Piezo buzzers. They are literally everywhere: smoke detectors, thermostats, electric kettles, fridges, washing machines, computers, and the list goes on. This post is about finding out whether piezos can be used as microphones.

All your piezo are belong to us

Idea

This story starts, as every good story should, with a photo of a PCB and a conversation about that photo. The photo, or rather the peculiar-looking piezo canister on the PCB, stuck in my head.

I pushed the idea away multiple times with “these emit only a single frequency, so they cannot be used as a microphone”, but the idea always came back with “what if…”.

It would be simple enough to try, so one late evening I decided that I would just hook up a piezo to an oscilloscope and run it through an FFT to see if there was anything to this.

Scope FFT Piezo connected to an Oscilloscope running FFT

Once it became clear that you could pick up formant-looking features in the FFT, I knew I had to build a minimal PoC just to put this to bed and get it out of my head.

Minimal PoC

The minimal PoC was just an RPi Pico 2 with some random piezo buzzer connected directly to an ADC pin on the Pico. Then some 80 lines of Python to record raw ADC values as PCM mono audio. Then the rabbit hole opened.

Spectrograms

After recording a few short clips and realizing there might be more to this than I initially thought, I put the LLMs to work creating C firmware that sampled the ADC and pushed those samples straight to my host over USB. Then I made a Python tool that I could run on the host to receive the raw samples and plot a spectrogram in “real time”. This allowed me to capture longer samples without being tied to the small flash on the Pico.

Listening to some of the first longer captures put some reality back into the whole thing. This thing ain’t no Shure SM58. It is more like something the SETI Institute would record with a radio telescope and then play back through a speaker in a tin can. The recordings are pretty much illegible. I tried various signal-processing tricks, filtering, and so on, but there is really no way of pulling any clearly legible audio out of the files.

Spectrograms Spectrogram flipped 90 degrees, horizontal axis is frequency DC-7khz, vertical is time

Now, if you look at the spectrogram picture closely, and you have done anything related to speech recognition in the past, there are one or two things you can notice. Firstly, different words look different in the spectrogram and, secondly, you can see some formant-type action going on there too. These can be on either side of the mouse cursor.

Speech recognition to the rescue

The fact that there is no human-legible audio in the captures does not mean you could not build something to recognize what is being said. You can train a speech recognition model using the aforementioned Shure SM58, but you should equally well be able to do the same using this makeshift piezo microphone.

The process works exactly the same way, but here we would just be training the model on a different representation of the words being said. The model does not really care as long as there are some reasonably reliable features it can latch onto, and there seems to be clear differences between the words.

Hello sample Spectrogram flipped 90 degrees, horizontal axis is frequency DC-7khz, vertical is time

For the speech recognition PoC I created a very small dataset. Using the USB sample-streaming pipeline, I recorded 15 samples of the word “Hello” being said from various angles and distances.

Then, to have the other necessary data for model training, I recorded 30 samples of random background noise and another 30 samples of other words being said, again from various angles and distances.

The dataset is not large, and intentionally so. The point was just to get some proof of whether it is possible to train such a model and make it work. I tried a bunch of different modern speech-recognition techniques and ultimately settled on MFCC (Mel Frequency Cepstral Coefficients), originally developed in the 1980s, which can be used for both speech and speaker recognition.

MFCC detection Horizontal axis time, vertical axis frequency DC-7khz

And oh boy, it did work. To my amazement, even with the very small dataset it was hitting 80–90% accuracy, properly picking up “hello” from other words and noises.

At this stage I was still cheating, meaning I was only doing the sampling on the Pico while the model inference was running on my host laptop. The next obvious question was: can this run on an MCU?

Will it run?

Way back in the 1970s, DARPA was already running a project with the goal of doing speech recognition with a vocabulary size of 1000 words. There are also various examples dating from the late 1970s to the early 1980s where speech recognition was running on mainframes such as the PDP-10.

These have significantly less processing power than a modern microcontroller such as the Cortex-M33 that the Pico 2 is running. So surely it should be possible to do this on a modern MCU platform.

The first test was to run this on the Pico 2. This is the ideal scenario, where there is nothing else in the way and the piezo is connected directly to the ADC.

The hello model is 57kB in size, easily fits in flash and RAM, and the ARM Cortex-M33 is fast enough to run this very non-optimized model with inference taking around 1–2 seconds. Yes, it is not super fast, but the model was not in any way optimized for the ARM platform and was not making use of all the fancy DSP hardware modern chips have.

Disassembling devices

The next step was to disassemble some common household devices that have piezos to see, first of all, whether any of these were running an MCU capable of doing something like this.

There needs to be a bit of luck involved to make this work on a real device. There are some preconditions: the piezo needs to be the kind that does not have a driver built into it, the device’s driving circuit needs to be directly connected to the MCU and ideally only use passives like protection diodes, and last but not least, the PWM pins driving the piezo need to also have the possibility of being reconfigured as ADC pins via pin multiplexing.

Fortunately, many devices seemed to have all three preconditions met, and with that the random selection ended up being a previously broken washing machine controller board (chosen because it did not require disassembling anything), a popular (?) “smart” kettle, and an IoT smoke detector.

OK, the selection was not entirely random, as I had already previously disassembled the devices and knew the firmware was easy to dump from the unprotected external flash. I had also at least partially reverse engineered the firmware.

These are also all devices that sit idle most of the time and only do their actual job for limited periods. Meaning, it should be possible to patch the firmware to do speech recognition while the device is idle and then switch back to “normal operating mode” once some actual function is activated, such as heating in the kettle.

All of the selected devices had 512KB–2MB of free space in their flash. A float32 model of 100 words would be around 1.25MB in size. With some quantization this can be reduced to int16 at around 624KB and int8 at around 312KB. So two of the devices would be able to hold the whole model plus the extra code (132KB). On one of the devices I would have to go to int8 quantization and probably squeeze some more stuff out of the code too.

Kettle it is

With some gracious help from an LLM, I patched the kettle firmware to run speech recognition while dormant. I chose the kettle because it had an easy “dormant” loop where it was just waiting for one of the buttons to be pressed before jumping into the PID logic and so on. After that it would always return to the same dormant loop. So I modified that dormant loop to run the inference code while still functioning as a normal kettle.

Kettle Mic Wav Horizontal axis time, vertical axis frequency DC-7khz

Above is a spectrogram sample recorded with the kettle. Words are clearly visible, and these were also detected by the speech recognition algorithm with 50–60% success rate.

I think the plastic and stainless steel body of the kettle distort the captures, meaning the model should probably be retrained with the kettle fully assembled to improve the accuracy. But hey, it works. In the recording above, the words were said from around 1–1.5m away from the kettle in a quiet room.

As a side-note, I have no idea what the periodic features in the spectrogram are or where they were coming from. But they look interesting.

Round-up

So where does this leave us? This obviously ticks all the boxes for conspiracy theories about “my kettle is listening to me”. But there are also things you could genuinely do with this.

If you ever wanted simple voice control for your kettle, not sure why anyone would, it seems it would be possible to do even without changing anything but the firmware on the device.

I think the biggest takeaway for me was just how surprisingly well piezos double as microphones.

Resources

Code available in Github