By the Outspoken Team · March 20, 2026

Wake Word Detection in Python with ONNX Runtime

ONNX Runtime makes it straightforward to run a trained wake word model in Python with minimal dependencies. No cloud calls and no latency from a remote API. The model runs entirely on the local machine -- including on a Raspberry Pi or a headless Linux server. Personal, hobby, and prototype use are included with standard credits; commercial fleet or venue deployment requires a separate license.

This guide covers two approaches: using the openWakeWord Python package as a convenient wrapper, and running the three-model ONNX pipeline directly with onnxruntime for full control. Both work with models trained on Outspoken.

Prerequisites

You need Python 3.9 or later. Install the required packages:

pip install onnxruntime sounddevice numpy

sounddevice gives you clean access to the system microphone through PortAudio. If you're on Linux and don't have PortAudio installed:

sudo apt-get install libportaudio2

You also need your trained ONNX model. Download it from the Outspoken dashboard after training completes. The file will be named after your wake word, for example hey_jarvis.onnx.

Don't have a model yet?

You can train your first Balanced model for free at Outspoken. Additional training uses one-time model credit packs. Personal, hobby, and prototype use are included; commercial venue, fleet, resale, or customer deployments require a deployment license. Enter your wake word, configure training parameters, and download the ONNX file when the job completes (~45 minutes on GPU). See the training guide for parameter advice.

How the Inference Pipeline Works

Before jumping into code, it helps to understand what actually runs at inference time. The openWakeWord pipeline uses three models in sequence:

melspectrogram.onnx -- Converts a chunk of raw audio (1280 samples at 16kHz, ~80ms) into a mel spectrogram
embedding_model.onnx -- Produces a compact audio embedding from that spectrogram
your_wake_word.onnx -- Classifies the embedding: is this the wake word?

The classifier doesn't look at just one chunk. It maintains a sliding window of the last 16 embeddings (~1.3 seconds of audio) and scores the entire window on each step. This is what lets it detect the wake word even if the phrase spans multiple chunks.

The audio requirements are strict:

Sample rate: 16kHz
Format: 16-bit PCM or float32
Chunk size: exactly 1280 samples per inference step

Option 1: Using the openWakeWord Python Package

The openWakeWord package handles the three-model pipeline internally. If you don't need to customize the inference loop, this is the quickest path to a working detector.

pip install openwakeword

The package downloads the shared melspectrogram and embedding models automatically on first use.

import openwakeword
import sounddevice as sd
import numpy as np
 
# Download shared models on first run
openwakeword.utils.download_models()
 
# Load your custom model
model = openwakeword.Model(
    wakeword_models=["hey_jarvis.onnx"],
    inference_framework="onnx",
)
 
SAMPLE_RATE = 16000
CHUNK_SIZE = 1280  # ~80ms
 
print("Listening... say your wake word.")
 
def audio_callback(indata, frames, time, status):
    # indata shape: (1280, 1) -- reshape to (1280,)
    audio_chunk = indata[:, 0]
 
    prediction = model.predict(audio_chunk)
 
    for wake_word, score in prediction.items():
        if score > 0.5:
            print(f"Wake word detected! [{wake_word}] score={score:.3f}")
 
with sd.InputStream(
    samplerate=SAMPLE_RATE,
    channels=1,
    dtype="float32",
    blocksize=CHUNK_SIZE,
    callback=audio_callback,
):
    input("Press Enter to stop...\n")

The model.predict() call returns a dict mapping each loaded wake word to its current detection score (0.0 -- 1.0). A threshold of 0.5 works for most models, but you can adjust it up to reduce false activations or down to catch more real ones.

Adjusting the threshold at runtime

You don't need to retrain to tune sensitivity. Raise the threshold (0.6--0.8) if the model is firing on background speech. Lower it (0.3--0.4) if it's missing real activations. This is a tradeoff: higher threshold means fewer false positives but also fewer true positives.

Option 2: Raw ONNX Runtime Pipeline

If you need full control -- custom preprocessing, integration with an existing audio pipeline, or you want to avoid the openWakeWord dependency -- you can run the three models directly with onnxruntime.

You'll need to download two additional shared model files from the openWakeWord repository:

Or let the openWakeWord package download them and grab them from its install directory:

import openwakeword, os
model_dir = os.path.dirname(openwakeword.__file__)
print(os.path.join(model_dir, "resources", "models"))

The full inference script

import numpy as np
import onnxruntime as ort
import sounddevice as sd
from collections import deque
 
# --- Load all three models ---
mel_session = ort.InferenceSession("melspectrogram.onnx")
embed_session = ort.InferenceSession("embedding_model.onnx")
ww_session = ort.InferenceSession("hey_jarvis.onnx")
 
# Sliding window: 16 embeddings (~1.3 seconds of context)
WINDOW_SIZE = 16
embedding_window = deque(maxlen=WINDOW_SIZE)
 
# Pre-fill the window with zeros so the first chunks don't under-index
embedding_dim = ww_session.get_inputs()[0].shape[-1]
for _ in range(WINDOW_SIZE):
    embedding_window.append(np.zeros(embedding_dim, dtype=np.float32))
 
SAMPLE_RATE = 16000
CHUNK_SIZE = 1280  # exactly 1280 samples per step
THRESHOLD = 0.5
 
def run_inference(audio_chunk: np.ndarray) -> float:
    """
    Run one inference step.
    audio_chunk: float32 array of shape (1280,), values in [-1.0, 1.0]
    Returns the wake word detection score.
    """
    # Step 1: mel spectrogram
    # Input shape expected by melspectrogram.onnx: (1, 1280)
    mel_input = audio_chunk.reshape(1, -1).astype(np.float32)
    mel_output = mel_session.run(None, {mel_session.get_inputs()[0].name: mel_input})
    # mel_output[0] shape: (1, 32, 96) -- one spectrogram frame
 
    # Step 2: audio embedding
    # embedding_model.onnx expects (1, 1, 32, 96)
    spec = mel_output[0]  # (1, 32, 96)
    spec = spec[:, np.newaxis, :, :]  # (1, 1, 32, 96)
    embed_output = embed_session.run(
        None, {embed_session.get_inputs()[0].name: spec}
    )
    embedding = embed_output[0].squeeze()  # 1D embedding vector
 
    # Step 3: update sliding window and classify
    embedding_window.append(embedding)
    window_array = np.stack(list(embedding_window), axis=0)  # (16, embed_dim)
    window_input = window_array[np.newaxis, :, :]  # (1, 16, embed_dim)
 
    ww_output = ww_session.run(
        None, {ww_session.get_inputs()[0].name: window_input.astype(np.float32)}
    )
    score = float(ww_output[0].squeeze())
    return score
 
 
def audio_callback(indata, frames, time, status):
    if status:
        print(f"Audio status: {status}")
    audio_chunk = indata[:, 0]  # (1280,) float32
    score = run_inference(audio_chunk)
    if score > THRESHOLD:
        print(f"Wake word detected! score={score:.3f}")
 
 
print("Listening... say your wake word. Ctrl+C to stop.")
 
try:
    with sd.InputStream(
        samplerate=SAMPLE_RATE,
        channels=1,
        dtype="float32",
        blocksize=CHUNK_SIZE,
        callback=audio_callback,
    ):
        while True:
            sd.sleep(100)
except KeyboardInterrupt:
    print("Stopped.")

Input shapes may vary

The exact input shapes for melspectrogram.onnx and embedding_model.onnx can differ slightly between openWakeWord releases. If you get shape errors, inspect the model inputs with session.get_inputs() and adjust the reshape calls accordingly. The wake word classifier's expected input shape is printed by ww_session.get_inputs()[0].shape.

Inspecting model I/O

If you're unsure about a model's expected inputs and outputs, this snippet prints them:

import onnxruntime as ort
 
for path in ["melspectrogram.onnx", "embedding_model.onnx", "hey_jarvis.onnx"]:
    session = ort.InferenceSession(path)
    print(f"\n--- {path} ---")
    for inp in session.get_inputs():
        print(f"  input:  {inp.name}  shape={inp.shape}  type={inp.type}")
    for out in session.get_outputs():
        print(f"  output: {out.name}  shape={out.shape}  type={out.type}")

Running It

Save either script as wake_word_detector.py and run it:

python wake_word_detector.py

If you have multiple audio input devices, list them first and pick the right one:

import sounddevice as sd
print(sd.query_devices())

Then pass device=<index> to sd.InputStream. On a Raspberry Pi with a USB microphone, the device index is typically 1 or 2.

Testing without a microphone

You can test the inference pipeline with a pre-recorded WAV file instead of live microphone input. Load the file with scipy.io.wavfile.read() or soundfile.read(), ensure it's 16kHz mono float32, then feed it in 1280-sample chunks through run_inference() in a loop.

Integration Patterns

Once detection is working, the most common next step is triggering something when the wake word fires. A few patterns that work well:

Triggering a shell command or subprocess

import subprocess
 
def on_wake_word_detected():
    subprocess.Popen(["python", "my_assistant.py"])

Sending an HTTP webhook

import urllib.request, json
 
def on_wake_word_detected():
    payload = json.dumps({"event": "wake_word"}).encode()
    req = urllib.request.Request(
        "http://localhost:8080/wake",
        data=payload,
        headers={"Content-Type": "application/json"},
    )
    urllib.request.urlopen(req)

Piping into a speech-to-text pipeline

The most common pattern is to start recording audio immediately after detection, pass it to a STT model (Whisper, for example), and act on the transcription. The basic flow:

import queue, threading
 
audio_queue = queue.Queue()
recording = False
 
def audio_callback(indata, frames, time, status):
    global recording
    score = run_inference(indata[:, 0])
 
    if score > THRESHOLD and not recording:
        print("Wake word detected -- starting STT recording")
        recording = True
 
    if recording:
        audio_queue.put(indata.copy())
        # Stop recording after ~3 seconds of audio
        if audio_queue.qsize() > int(3 * SAMPLE_RATE / CHUNK_SIZE):
            recording = False
            process_speech(audio_queue)

For a production voice assistant, consider running the wake word detector and the STT model in separate threads (or separate processes) to avoid blocking the audio callback.

Running as a background service

On Linux, you can run the detector as a systemd service so it starts automatically on boot:

# /etc/systemd/system/wake-word.service
[Unit]
Description=Wake Word Detector
After=sound.target
 
[Service]
ExecStart=/usr/bin/python3 /home/pi/wake_word_detector.py
Restart=always
User=pi
 
[Install]
WantedBy=multi-user.target

Enable it with:

sudo systemctl enable wake-word.service
sudo systemctl start wake-word.service

Performance Notes

On a standard laptop CPU, ONNX Runtime runs the three-model pipeline well under the 80ms chunk duration, leaving plenty of headroom. On a Raspberry Pi 4, it comfortably keeps up in real time. On older single-core hardware, check CPU usage and consider using ort.InferenceSession with the providers=["CPUExecutionProvider"] argument and thread count options.

If you have a machine with a compatible NVIDIA GPU, onnxruntime-gpu can cut inference time dramatically:

pip install onnxruntime-gpu

Then instantiate sessions with providers=["CUDAExecutionProvider", "CPUExecutionProvider"] to fall back to CPU when GPU is unavailable.

Running this on a Raspberry Pi? The Raspberry Pi guide covers hardware selection, audio device setup, and running the detector as a systemd service.

Ready to get your own model? Sign up for Outspoken -- train your first custom wake word for free, download the ONNX file, and drop it straight into the code above.