By the Outspoken Team · March 20, 2026
Wake Word Detection in Python with ONNX Runtime
ONNX Runtime makes it straightforward to run a trained wake word model in Python with minimal dependencies. No cloud calls, no latency from a remote API, no licensing per-device. The model runs entirely on the local machine -- including on a Raspberry Pi or a headless Linux server.
This guide covers two approaches: using the openWakeWord Python package as a convenient wrapper, and running the three-model ONNX pipeline directly with onnxruntime for full control. Both work with models trained on Outspoken.
Prerequisites
You need Python 3.9 or later. Install the required packages:
pip install onnxruntime sounddevice numpysounddevice gives you clean access to the system microphone through PortAudio. If you're on Linux and don't have PortAudio installed:
sudo apt-get install libportaudio2You also need your trained ONNX model. Download it from the Outspoken dashboard after training completes. The file will be named after your wake word, for example hey_jarvis.onnx.
Don't have a model yet?
You can train a custom wake word for €1 at Outspoken -- the first model is free. Enter your wake word, configure training parameters, and download the ONNX file when the job completes (~45 minutes on GPU). See the training guide for parameter advice.
How the Inference Pipeline Works
Before jumping into code, it helps to understand what actually runs at inference time. The openWakeWord pipeline uses three models in sequence:
melspectrogram.onnx-- Converts a chunk of raw audio (1280 samples at 16kHz, ~80ms) into a mel spectrogramembedding_model.onnx-- Produces a compact audio embedding from that spectrogramyour_wake_word.onnx-- Classifies the embedding: is this the wake word?
The classifier doesn't look at just one chunk. It maintains a sliding window of the last 16 embeddings (~1.3 seconds of audio) and scores the entire window on each step. This is what lets it detect the wake word even if the phrase spans multiple chunks.
The audio requirements are strict:
- Sample rate: 16kHz
- Format: 16-bit PCM or float32
- Chunk size: exactly 1280 samples per inference step
Option 1: Using the openWakeWord Python Package
The openWakeWord package handles the three-model pipeline internally. If you don't need to customize the inference loop, this is the quickest path to a working detector.
pip install openwakewordThe package downloads the shared melspectrogram and embedding models automatically on first use.
import openwakeword
import sounddevice as sd
import numpy as np
# Download shared models on first run
openwakeword.utils.download_models()
# Load your custom model
model = openwakeword.Model(
wakeword_models=["hey_jarvis.onnx"],
inference_framework="onnx",
)
SAMPLE_RATE = 16000
CHUNK_SIZE = 1280 # ~80ms
print("Listening... say your wake word.")
def audio_callback(indata, frames, time, status):
# indata shape: (1280, 1) -- reshape to (1280,)
audio_chunk = indata[:, 0]
prediction = model.predict(audio_chunk)
for wake_word, score in prediction.items():
if score > 0.5:
print(f"Wake word detected! [{wake_word}] score={score:.3f}")
with sd.InputStream(
samplerate=SAMPLE_RATE,
channels=1,
dtype="float32",
blocksize=CHUNK_SIZE,
callback=audio_callback,
):
input("Press Enter to stop...\n")The model.predict() call returns a dict mapping each loaded wake word to its current detection score (0.0 -- 1.0). A threshold of 0.5 works for most models, but you can adjust it up to reduce false activations or down to catch more real ones.
Adjusting the threshold at runtime
You don't need to retrain to tune sensitivity. Raise the threshold (0.6--0.8) if the model is firing on background speech. Lower it (0.3--0.4) if it's missing real activations. This is a tradeoff: higher threshold means fewer false positives but also fewer true positives.
Option 2: Raw ONNX Runtime Pipeline
If you need full control -- custom preprocessing, integration with an existing audio pipeline, or you want to avoid the openWakeWord dependency -- you can run the three models directly with onnxruntime.
You'll need to download two additional shared model files from the openWakeWord repository:
Or let the openWakeWord package download them and grab them from its install directory:
import openwakeword, os
model_dir = os.path.dirname(openwakeword.__file__)
print(os.path.join(model_dir, "resources", "models"))The full inference script
import numpy as np
import onnxruntime as ort
import sounddevice as sd
from collections import deque
# --- Load all three models ---
mel_session = ort.InferenceSession("melspectrogram.onnx")
embed_session = ort.InferenceSession("embedding_model.onnx")
ww_session = ort.InferenceSession("hey_jarvis.onnx")
# Sliding window: 16 embeddings (~1.3 seconds of context)
WINDOW_SIZE = 16
embedding_window = deque(maxlen=WINDOW_SIZE)
# Pre-fill the window with zeros so the first chunks don't under-index
embedding_dim = ww_session.get_inputs()[0].shape[-1]
for _ in range(WINDOW_SIZE):
embedding_window.append(np.zeros(embedding_dim, dtype=np.float32))
SAMPLE_RATE = 16000
CHUNK_SIZE = 1280 # exactly 1280 samples per step
THRESHOLD = 0.5
def run_inference(audio_chunk: np.ndarray) -> float:
"""
Run one inference step.
audio_chunk: float32 array of shape (1280,), values in [-1.0, 1.0]
Returns the wake word detection score.
"""
# Step 1: mel spectrogram
# Input shape expected by melspectrogram.onnx: (1, 1280)
mel_input = audio_chunk.reshape(1, -1).astype(np.float32)
mel_output = mel_session.run(None, {mel_session.get_inputs()[0].name: mel_input})
# mel_output[0] shape: (1, 32, 96) -- one spectrogram frame
# Step 2: audio embedding
# embedding_model.onnx expects (1, 1, 32, 96)
spec = mel_output[0] # (1, 32, 96)
spec = spec[:, np.newaxis, :, :] # (1, 1, 32, 96)
embed_output = embed_session.run(
None, {embed_session.get_inputs()[0].name: spec}
)
embedding = embed_output[0].squeeze() # 1D embedding vector
# Step 3: update sliding window and classify
embedding_window.append(embedding)
window_array = np.stack(list(embedding_window), axis=0) # (16, embed_dim)
window_input = window_array[np.newaxis, :, :] # (1, 16, embed_dim)
ww_output = ww_session.run(
None, {ww_session.get_inputs()[0].name: window_input.astype(np.float32)}
)
score = float(ww_output[0].squeeze())
return score
def audio_callback(indata, frames, time, status):
if status:
print(f"Audio status: {status}")
audio_chunk = indata[:, 0] # (1280,) float32
score = run_inference(audio_chunk)
if score > THRESHOLD:
print(f"Wake word detected! score={score:.3f}")
print("Listening... say your wake word. Ctrl+C to stop.")
try:
with sd.InputStream(
samplerate=SAMPLE_RATE,
channels=1,
dtype="float32",
blocksize=CHUNK_SIZE,
callback=audio_callback,
):
while True:
sd.sleep(100)
except KeyboardInterrupt:
print("Stopped.")Input shapes may vary
The exact input shapes for melspectrogram.onnx and embedding_model.onnx can differ slightly between openWakeWord releases. If you get shape errors, inspect the model inputs with session.get_inputs() and adjust the reshape calls accordingly. The wake word classifier's expected input shape is printed by ww_session.get_inputs()[0].shape.
Inspecting model I/O
If you're unsure about a model's expected inputs and outputs, this snippet prints them:
import onnxruntime as ort
for path in ["melspectrogram.onnx", "embedding_model.onnx", "hey_jarvis.onnx"]:
session = ort.InferenceSession(path)
print(f"\n--- {path} ---")
for inp in session.get_inputs():
print(f" input: {inp.name} shape={inp.shape} type={inp.type}")
for out in session.get_outputs():
print(f" output: {out.name} shape={out.shape} type={out.type}")Running It
Save either script as wake_word_detector.py and run it:
python wake_word_detector.pyIf you have multiple audio input devices, list them first and pick the right one:
import sounddevice as sd
print(sd.query_devices())Then pass device=<index> to sd.InputStream. On a Raspberry Pi with a USB microphone, the device index is typically 1 or 2.
Testing without a microphone
You can test the inference pipeline with a pre-recorded WAV file instead of live microphone input. Load the file with scipy.io.wavfile.read() or soundfile.read(), ensure it's 16kHz mono float32, then feed it in 1280-sample chunks through run_inference() in a loop.
Integration Patterns
Once detection is working, the most common next step is triggering something when the wake word fires. A few patterns that work well:
Triggering a shell command or subprocess
import subprocess
def on_wake_word_detected():
subprocess.Popen(["python", "my_assistant.py"])Sending an HTTP webhook
import urllib.request, json
def on_wake_word_detected():
payload = json.dumps({"event": "wake_word"}).encode()
req = urllib.request.Request(
"http://localhost:8080/wake",
data=payload,
headers={"Content-Type": "application/json"},
)
urllib.request.urlopen(req)Piping into a speech-to-text pipeline
The most common pattern is to start recording audio immediately after detection, pass it to a STT model (Whisper, for example), and act on the transcription. The basic flow:
import queue, threading
audio_queue = queue.Queue()
recording = False
def audio_callback(indata, frames, time, status):
global recording
score = run_inference(indata[:, 0])
if score > THRESHOLD and not recording:
print("Wake word detected -- starting STT recording")
recording = True
if recording:
audio_queue.put(indata.copy())
# Stop recording after ~3 seconds of audio
if audio_queue.qsize() > int(3 * SAMPLE_RATE / CHUNK_SIZE):
recording = False
process_speech(audio_queue)For a production voice assistant, consider running the wake word detector and the STT model in separate threads (or separate processes) to avoid blocking the audio callback.
Running as a background service
On Linux, you can run the detector as a systemd service so it starts automatically on boot:
# /etc/systemd/system/wake-word.service
[Unit]
Description=Wake Word Detector
After=sound.target
[Service]
ExecStart=/usr/bin/python3 /home/pi/wake_word_detector.py
Restart=always
User=pi
[Install]
WantedBy=multi-user.targetEnable it with:
sudo systemctl enable wake-word.service
sudo systemctl start wake-word.servicePerformance Notes
On a standard laptop CPU, ONNX Runtime runs the three-model pipeline well under the 80ms chunk duration, leaving plenty of headroom. On a Raspberry Pi 4, it comfortably keeps up in real time. On older single-core hardware, check CPU usage and consider using ort.InferenceSession with the providers=["CPUExecutionProvider"] argument and thread count options.
If you have a machine with a compatible NVIDIA GPU, onnxruntime-gpu can cut inference time dramatically:
pip install onnxruntime-gpuThen instantiate sessions with providers=["CUDAExecutionProvider", "CPUExecutionProvider"] to fall back to CPU when GPU is unavailable.
Running this on a Raspberry Pi? The Raspberry Pi guide covers hardware selection, audio device setup, and running the detector as a systemd service.
Ready to get your own model? Sign up for Outspoken -- train your first custom wake word for free, download the ONNX file, and drop it straight into the code above.