By the Outspoken Team · March 21, 2026

Custom Wake Word Detection in React Native

Wake word detection lets your app respond to a spoken phrase without any button press. The user lifts their hands, picks up a tool, or simply walks into a room -- and your app is already listening. On a phone, the use cases range from accessibility tools that need voice activation to smart home remotes, voice-driven fitness apps, and anything else where reaching for the screen is inconvenient.

The standard approach is to send audio to a cloud API. That means network latency, a dependency on connectivity, and sending a continuous audio stream to a third-party server -- none of which is acceptable for a feature that needs to feel instant and private. On-device inference solves all three: it runs locally, works offline, and the audio never leaves the device.

This guide walks through running a full wake word detection pipeline in a React Native app using onnxruntime-react-native and the Expo Audio API. By the end you'll have a working TypeScript implementation that detects a custom wake word in real time.

How the Pipeline Works

Wake word detection doesn't run directly on raw audio. The model trained by Outspoken expects pre-processed audio features, which means you need to run three ONNX models in sequence on every chunk of audio:

melspectrogram.onnx -- Converts raw 16kHz float32 audio into a mel spectrogram. This is a compact frequency-over-time representation that captures what matters acoustically while discarding irrelevant variation.
embedding_model.onnx -- Converts the mel spectrogram into a dense embedding vector. This model is shared across all wake words -- it's a general-purpose audio feature extractor trained on a large corpus.
your_wake_word.onnx -- Your custom model. Takes the embedding as input and outputs an activation score between 0 and 1. A score above a threshold (typically 0.5) means the wake word was detected.

You need all three models. The first two are provided by openWakeWord and don't change between wake words -- only the third model is specific to your phrase.

Audio requirements:

16kHz sample rate
float32 samples normalized to the range [-1, 1]
1280 samples per chunk (approximately 80ms of audio)

Setup

Install Dependencies

npm install onnxruntime-react-native

For Expo projects, install the audio recording package:

npx expo install expo-av

For bare React Native projects, react-native-audio-record is a common choice. The audio preprocessing code in this guide is the same either way -- only the recording setup differs.

Check for latest versions

Package APIs evolve quickly. Always check the onnxruntime-react-native npm page for the latest release before installing. As of early 2026, version 1.20.x is current.

Permissions

Android -- add to android/app/src/main/AndroidManifest.xml:

<uses-permission android:name="android.permission.RECORD_AUDIO" />

iOS -- add to ios/YourApp/Info.plist:

<key>NSMicrophoneUsageDescription</key>
<string>This app uses the microphone to detect the wake word.</string>

At runtime, request the permission before starting recording:

import { Audio } from "expo-av";
 
const { status } = await Audio.requestPermissionsAsync();
if (status !== "granted") {
  throw new Error("Microphone permission not granted");
}

Bundle the ONNX Models

You need to ship three .onnx files with your app: melspectrogram.onnx, embedding_model.onnx, and your custom wake word model downloaded from the Outspoken dashboard.

The melspectrogram.onnx and embedding_model.onnx files are available in the openWakeWord GitHub repository.

Place all three files in your project's assets/models/ directory, then configure Metro to treat .onnx files as assets:

// metro.config.js
const { getDefaultConfig } = require("expo/metro-config");
 
const config = getDefaultConfig(__dirname);
 
config.resolver.assetExts.push("onnx");
 
module.exports = config;

Then reference the files with require():

const MELSPECTROGRAM_MODEL = require("../assets/models/melspectrogram.onnx");
const EMBEDDING_MODEL = require("../assets/models/embedding_model.onnx");
const WAKE_WORD_MODEL = require("../assets/models/hey_jarvis.onnx");

Model file sizes

The melspectrogram and embedding models are shared infrastructure -- expect them to be a few hundred KB each. Your custom wake word model will be between ~50KB and ~400KB depending on the layer size you chose during training. A layer size of 96 (the default) produces a model around 200KB. See the training guide for details on how layer size affects model size and accuracy.

Implementation

Loading the ONNX Sessions

Load all three models once when the component mounts. ONNX Runtime sessions are expensive to initialize, so hold them in a ref and reuse them across audio chunks.

import { InferenceSession, Tensor } from "onnxruntime-react-native";
import { useRef, useEffect } from "react";
 
interface WakeWordSessions {
  melspectrogram: InferenceSession;
  embedding: InferenceSession;
  wakeWord: InferenceSession;
}
 
function useWakeWordSessions(): React.MutableRefObject<WakeWordSessions | null> {
  const sessions = useRef<WakeWordSessions | null>(null);
 
  useEffect(() => {
    let cancelled = false;
 
    async function load() {
      const [melspectrogram, embedding, wakeWord] = await Promise.all([
        InferenceSession.create(MELSPECTROGRAM_MODEL),
        InferenceSession.create(EMBEDDING_MODEL),
        InferenceSession.create(WAKE_WORD_MODEL),
      ]);
 
      if (!cancelled) {
        sessions.current = { melspectrogram, embedding, wakeWord };
      }
    }
 
    load().catch(console.error);
 
    return () => {
      cancelled = true;
    };
  }, []);
 
  return sessions;
}

Setting Up Microphone Recording at 16kHz

Expo's Audio.Recording API doesn't directly expose a PCM stream at an arbitrary sample rate, so you'll need to configure recording options carefully and handle resampling if the device doesn't support 16kHz natively.

import { Audio } from "expo-av";
 
async function startRecording(): Promise<Audio.Recording> {
  await Audio.setAudioModeAsync({
    allowsRecordingIOS: true,
    playsInSilentModeIOS: true,
  });
 
  const recording = new Audio.Recording();
 
  await recording.prepareToRecordAsync({
    android: {
      extension: ".wav",
      outputFormat: Audio.AndroidOutputFormat.DEFAULT,
      audioEncoder: Audio.AndroidAudioEncoder.DEFAULT,
      sampleRate: 16000,
      numberOfChannels: 1,
      bitRate: 256000,
    },
    ios: {
      extension: ".wav",
      outputFormat: Audio.IOSOutputFormat.LINEARPCM,
      audioQuality: Audio.IOSAudioQuality.HIGH,
      sampleRate: 16000,
      numberOfChannels: 1,
      bitDepthHint: 16,
      linearPCMBitDepth: 16,
      linearPCMIsBigEndian: false,
      linearPCMIsFloat: false,
    },
    web: {},
  });
 
  await recording.startAsync();
  return recording;
}

Sample rate support varies by device

Not all Android devices support 16kHz recording directly. If you get errors on specific devices, record at 44.1kHz or 48kHz and downsample in JavaScript before running inference. A simple decimation approach works well enough for wake word detection -- you don't need a high-quality resampler.

Audio Preprocessing

The inference pipeline expects raw float32 PCM samples. If you're recording to a WAV file, you need to read the file, strip the WAV header, and convert the 16-bit integer samples to float32.

In practice, for real-time detection you'll want to read audio in short segments. Here's a preprocessing utility:

/**
 * Converts a buffer of 16-bit PCM integers to float32 samples
 * normalized to the range [-1, 1].
 */
function pcm16ToFloat32(buffer: ArrayBuffer): Float32Array {
  const int16 = new Int16Array(buffer);
  const float32 = new Float32Array(int16.length);
  for (let i = 0; i < int16.length; i++) {
    float32[i] = int16[i] / 32768.0;
  }
  return float32;
}
 
/**
 * Splits a float32 audio buffer into chunks of the required size.
 * Drops any incomplete trailing chunk.
 */
function chunkAudio(samples: Float32Array, chunkSize = 1280): Float32Array[] {
  const chunks: Float32Array[] = [];
  for (let i = 0; i + chunkSize <= samples.length; i += chunkSize) {
    chunks.push(samples.slice(i, i + chunkSize));
  }
  return chunks;
}

Running the Three-Model Pipeline

With the sessions loaded and a chunk of 1280 float32 samples ready, inference runs in three steps:

async function runInference(
  sessions: WakeWordSessions,
  audioChunk: Float32Array
): Promise<number> {
  // Step 1: raw audio -> mel spectrogram
  const audioTensor = new Tensor("float32", audioChunk, [1, audioChunk.length]);
  const melResult = await sessions.melspectrogram.run({ input: audioTensor });
  const melSpectrogram = melResult["output"];
 
  // Step 2: mel spectrogram -> embedding
  const embeddingResult = await sessions.embedding.run({
    input: melSpectrogram,
  });
  const embedding = embeddingResult["output"];
 
  // Step 3: embedding -> activation score
  const wakeWordResult = await sessions.wakeWord.run({ input: embedding });
  const scores = wakeWordResult["output"].data as Float32Array;
 
  // Return the activation score (single float between 0 and 1)
  return scores[0];
}

Check tensor names for your models

The input and output tensor names used above ("input" and "output") match the openWakeWord model exports. If you're working with a different model version, inspect the actual tensor names using a tool like Netron -- just drag and drop your .onnx file to see the graph.

Threshold-Based Activation Detection

Wrap the inference call in a detection loop with a configurable threshold. Scores above the threshold trigger the wake word callback.

const ACTIVATION_THRESHOLD = 0.5;
 
// Debounce: minimum ms between activations to avoid repeated triggers
const DEBOUNCE_MS = 2000;
 
interface WakeWordDetectorOptions {
  threshold?: number;
  onActivation: () => void;
}
 
function useWakeWordDetector({ threshold = ACTIVATION_THRESHOLD, onActivation }: WakeWordDetectorOptions) {
  const sessions = useWakeWordSessions();
  const lastActivationRef = useRef<number>(0);
  const isRunningRef = useRef(false);
 
  const processChunk = async (audioChunk: Float32Array) => {
    if (!sessions.current || isRunningRef.current) return;
 
    isRunningRef.current = true;
    try {
      const score = await runInference(sessions.current, audioChunk);
 
      if (score >= threshold) {
        const now = Date.now();
        if (now - lastActivationRef.current > DEBOUNCE_MS) {
          lastActivationRef.current = now;
          onActivation();
        }
      }
    } catch (err) {
      console.error("Wake word inference error:", err);
    } finally {
      isRunningRef.current = false;
    }
  };
 
  return { processChunk, sessionsReady: sessions.current !== null };
}

Putting It Together

Here's a minimal but complete component that records audio in a polling loop and runs wake word detection:

import React, { useEffect, useRef, useState, useCallback } from "react";
import { View, Text, Button } from "react-native";
import { Audio } from "expo-av";
import * as FileSystem from "expo-file-system";
 
const CHUNK_SIZE = 1280;
const POLL_INTERVAL_MS = 200;
 
export function WakeWordListener() {
  const [listening, setListening] = useState(false);
  const [activated, setActivated] = useState(false);
  const recordingRef = useRef<Audio.Recording | null>(null);
  const intervalRef = useRef<ReturnType<typeof setInterval> | null>(null);
 
  const { processChunk, sessionsReady } = useWakeWordDetector({
    threshold: 0.5,
    onActivation: useCallback(() => {
      setActivated(true);
      setTimeout(() => setActivated(false), 2000);
    }, []),
  });
 
  const handleAudioSegment = useCallback(async (uri: string) => {
    try {
      const base64 = await FileSystem.readAsStringAsync(uri, {
        encoding: FileSystem.EncodingType.Base64,
      });
      const raw = Buffer.from(base64, "base64");
      // Strip 44-byte WAV header
      const pcmBuffer = raw.buffer.slice(44);
      const float32Samples = pcm16ToFloat32(pcmBuffer);
      const chunks = chunkAudio(float32Samples, CHUNK_SIZE);
 
      for (const chunk of chunks) {
        await processChunk(chunk);
      }
    } catch (err) {
      console.error("Audio processing error:", err);
    }
  }, [processChunk]);
 
  const startListening = useCallback(async () => {
    if (!sessionsReady) return;
 
    const recording = await startRecording();
    recordingRef.current = recording;
    setListening(true);
 
    // Poll for new audio data at a fixed interval
    intervalRef.current = setInterval(async () => {
      if (!recordingRef.current) return;
      const status = await recordingRef.current.getStatusAsync();
      if (status.isRecording && status.uri) {
        await handleAudioSegment(status.uri);
      }
    }, POLL_INTERVAL_MS);
  }, [sessionsReady, handleAudioSegment]);
 
  const stopListening = useCallback(async () => {
    if (intervalRef.current) clearInterval(intervalRef.current);
    if (recordingRef.current) {
      await recordingRef.current.stopAndUnloadAsync();
      recordingRef.current = null;
    }
    setListening(false);
  }, []);
 
  useEffect(() => {
    return () => {
      stopListening();
    };
  }, [stopListening]);
 
  return (
    <View>
      <Button
        title={listening ? "Stop" : "Start Listening"}
        onPress={listening ? stopListening : startListening}
        disabled={!sessionsReady}
      />
      {activated && <Text>Wake word detected!</Text>}
    </View>
  );
}

Real-time streaming requires a different approach

The polling approach above works as a starting point but isn't ideal for production. Reading and re-parsing the growing WAV file on every poll cycle is inefficient. For a production implementation, consider using react-native-audio-record (bare RN) or a native module that provides a direct PCM buffer callback. That lets you process audio chunks as they arrive rather than re-reading from disk.

Performance Considerations

Run Inference Off the Main Thread

ONNX Runtime's React Native binding runs inference synchronously on the JS thread by default. For wake word detection, where you're calling inference every 80-200ms, this can cause dropped frames in your UI.

Move the inference loop to a web worker equivalent using React Native's MessageChannel or a library like react-native-worker-threads if your threading requirements are strict. Alternatively, batch inference calls and schedule them with InteractionManager.runAfterInteractions to avoid blocking user interactions.

For most use cases, running inference at 5 Hz (every 200ms) produces acceptable latency without significant UI impact. The three models together take 5-20ms on a modern phone, leaving ample headroom.

Battery Impact

Continuous microphone recording and inference will drain the battery faster than a passive app. Practical mitigation strategies:

Voice activity detection (VAD) as a gate. Run a lightweight energy threshold check before running the full pipeline. If the RMS amplitude of the chunk is below a floor (e.g., silence or near-silence), skip inference entirely.
Pause detection while the screen is off. Register an AppState listener and stop recording when the app is backgrounded, if your use case permits.
Reduce chunk overlap. Processing every 1280-sample chunk sequentially gives 80ms resolution. You can process every other chunk (160ms resolution) and most wake words will still be detected reliably.

In informal testing, continuous inference at 5 Hz increases battery drain by roughly 5-10% compared to a passive app on a mid-range Android device. Your mileage will vary.

Model Size on Disk

Layer Size	Approximate Model File Size
32	~50 KB
64	~100 KB
96	~200 KB
128	~400 KB

The shared melspectrogram.onnx and embedding_model.onnx models add roughly 300-500 KB combined. Total on-disk footprint for the full pipeline is typically under 1 MB, which is negligible for a mobile app.

Getting Your Custom Model

Train a custom wake word model at Outspoken. Enter your wake word, configure the training parameters, and a GPU worker will produce a trained ONNX model in about 45 minutes. Download it from your dashboard and drop it into assets/models/.

If you want to test a model before integrating it into your app, use the Playground -- it runs the same three-model pipeline in your browser using your microphone, so you can validate detection accuracy without writing any code.

For guidance on choosing training parameters and picking a phonetically strong wake word, see the training guide.

First model is free

Your first Balanced model is free, and additional training uses one-time model credit packs. You can download standard ONNX/TFLite files for personal, hobby, and prototype use; commercial venue, fleet, resale, or customer deployments require a deployment license.

Deploying to other platforms? See the Home Assistant integration guide for a turnkey voice assistant setup, or the Raspberry Pi guide for always-on detection on low-power hardware.

Ready to add wake word detection to your app? Create an account and train your first model.