← Back to blog

By the Outspoken Team · February 20, 2026

How to Train a Better Wake Word Model: Parameters, Phonetics, and Practical Tips

You've picked a wake word, hit "Train", and gotten a model back. It works — mostly. But sometimes it misses the phrase, or it activates when nobody said anything. The difference between a mediocre model and a great one comes down to two things: the parameters you choose for training, and the phonetic properties of the wake word itself.

This guide covers both. By the end, you'll understand what every training parameter does, why some wake words are inherently easier to detect than others, and how to get the best possible results.

Part 1: Training Parameters

When you train a wake word model on Outspoken, you're configuring an openWakeWord training pipeline. Under the hood, it has three phases:

  1. Generate — A text-to-speech engine (Piper TTS) synthesizes thousands of audio clips of your wake word, each with slightly different voice characteristics
  2. Augment — Those clips are mixed with real-world background noise, room reverberations, and volume variations to simulate messy real-world conditions
  3. Train — A small neural network learns to distinguish your wake word from everything else

Each parameter you set influences one or more of these phases. Here's what they do.

Examples (100 – 50,000)

What it controls: How many synthetic speech clips the TTS engine generates.

This is the size of your training dataset. More examples means the model hears more variations of the wake word — different speeds, intonations, and vocal qualities. After generation, these clips get augmented with noise and reverb, so the effective training set is even larger.

RangeEffect
100 – 1,000Fast training, but the model may miss unusual pronunciations
2,000 – 10,000Good balance for most wake words
10,000 – 50,000Maximum variation; useful for short or phonetically ambiguous words

Rule of thumb: Start with 10,000 (the default). If your wake word is short (one syllable) or contains sounds that could be confused with common words, increase it. If you're iterating quickly and want faster training, drop to 2,000–5,000.

More examples won't fix a bad wake word

If a wake word is too short or phonetically similar to common speech, no amount of training data will eliminate false activations. Fix the word first (see Part 2), then add more examples.

Steps (1,000 – 50,000)

What it controls: How many training iterations the neural network goes through.

Each step feeds a batch of audio features through the model and updates its weights. More steps means the model has more opportunities to refine its understanding of the pattern. But there's a point of diminishing returns — past a certain number, the model stops improving and starts memorizing the training data (overfitting).

RangeEffect
1,000 – 5,000Quick training; may underfit on complex phrases
5,000 – 15,000Good for most wake words
15,000 – 50,000Useful for difficult or short phrases that need extra refinement

Rule of thumb: Keep steps roughly equal to your example count. The default of 10,000 works well for most cases. If you notice the model activates on similar-sounding words, more steps (with a higher penalty) can help it learn finer distinctions.

False Activation Penalty (100 – 5,000)

What it controls: How harshly the model is punished during training when it incorrectly fires on non-wake-word audio.

This is technically the max_negative_weight parameter in the openWakeWord training config. A higher value tells the optimizer: "false activations are really bad — prioritize avoiding them even if it means missing some real activations."

RangeEffect
100 – 500Lenient — the model will catch more real activations but fire more on noise
500 – 1,500Balanced (default: 1,500)
1,500 – 5,000Strict — fewer false activations, but may miss quieter or faster pronunciations

Rule of thumb: If you're getting false activations in testing, increase the penalty. If the model is missing real wake words, decrease it. The default of 1,500 is deliberately on the conservative side — most users prefer a model that rarely fires by mistake over one that catches every utterance.

The recall vs. false positive tradeoff

There's no free lunch here. A model that catches every single wake word will also fire on noise more often. A model that never false-fires will miss some real activations. The penalty, target recall, and target FP rate together define where your model sits on this spectrum.

Target False Positives per Hour (0.1 – 1.0)

What it controls: The maximum number of times the model should incorrectly activate per hour of non-wake-word audio during training validation.

During training, the model is periodically tested against a validation set of general audio (speech, music, silence). This parameter sets the acceptable false activation rate on that data.

ValueEffect
0.1 – 0.3Very strict — almost no false activations, but harder to detect the real wake word
0.5 – 0.7Balanced (default: 0.7)
0.8 – 1.0Permissive — will catch more real activations at the cost of occasional noise triggers

Rule of thumb: For quiet environments (office, bedroom), you can afford 0.5–0.7. For noisy environments (kitchen, living room with a TV), drop to 0.2–0.4 so background chatter doesn't trigger the model.

Target Recall (0.5 – 0.9)

What it controls: The minimum percentage of real wake word utterances the model should detect during training validation.

A target recall of 0.7 means the model should correctly fire on at least 70% of the positive test clips. Higher values push the model to be more sensitive, which typically increases false activations as well.

ValueEffect
0.5 – 0.6Conservative — misses more real activations, but very few false positives
0.7Balanced (default)
0.8 – 0.9Aggressive — catches more real activations, more false positives

Rule of thumb: Leave this at 0.7 unless you have a specific reason to change it. After training, you can also adjust the detection threshold at runtime (in your application code) to fine-tune sensitivity without retraining.

Layer Size (32, 64, 96, 128)

What it controls: The width of the neural network's hidden layers.

A larger layer size gives the model more capacity to learn complex patterns. This matters for wake words with unusual phonetic patterns or for distinguishing between acoustically similar phrases.

SizeModel File SizeBest For
32~50 KBVery simple, distinct phrases; resource-constrained devices
64~100 KBShort phrases with clear consonants
96~200 KBMost wake words (default)
128~400 KBComplex or ambiguous phrases that need maximum capacity

Rule of thumb: Use 96 (the default) unless the model file needs to be tiny (embedded devices) or you're struggling with a difficult phrase. Bumping to 128 adds minimal latency but can help with tricky words.

Bigger isn't always better

A 128-layer model with 50,000 examples trained for 50,000 steps won't help if the wake word itself is phonetically weak. Overparameterized models can overfit to training data and actually perform worse in the real world. Match your parameters to the complexity of the phrase.

Languages

What it controls: Which text-to-speech voice models are used to generate training clips.

When you select multiple languages, the training pipeline generates synthetic speech using each language's Piper TTS model separately, then combines all the clips into a single training set. The result is one model that can detect the wake word across different accents and pronunciations.

For example, training with English, Dutch, German, and French means the model hears the wake word spoken with American English intonation, Dutch vowel patterns, German consonant emphasis, and French rhythm — all mixed together.

SelectionEffect
Single languageBest per-language accuracy; fastest training
2–3 languagesGood multilingual coverage with modest accuracy tradeoff
All 4 languagesMaximum accent coverage; examples are split across voices

Rule of thumb: If your device will only be used in one language, select just that language. If you need multilingual support, select all relevant languages — the training pipeline splits examples evenly, so consider increasing the total example count to compensate.


Part 2: Why Your Wake Word Choice Matters More Than Parameters

You can perfect every training parameter, but if the wake word itself has weak phonetic properties, the model will struggle. Here's why, and what makes a good wake word.

How Wake Word Detection Actually Works

The model doesn't understand language. It listens for an acoustic pattern — a specific sequence of sound frequencies over time. When you say "hey Jarvis", the model sees a mel spectrogram: a visual fingerprint of the sound. It's trained to recognize that specific fingerprint and ignore everything else.

This means the quality of a wake word depends entirely on how acoustically distinctive it is. A phrase that sounds unique and different from everyday speech is easy to detect. A phrase that sounds like common words or background noise is not.

Syllable Count: More Is Better

The most important factor is length. Each syllable adds another chunk of acoustic information that the model can use to distinguish your phrase from noise.

SyllablesDetection DifficultyExamples
1Hard — easily confused with background"Go", "Hey", "Start"
2Moderate — workable with good consonants"Jarvis", "Alexa"
3+Easy — highly distinctive"Hey Jarvis", "OK Google", "Computer"

One-syllable wake words are the most common mistake. The word "Go" is acoustically almost identical to dozens of sounds that occur in normal speech, coughing, or TV audio. The model can't reliably distinguish it no matter how much you train.

The two-word trick

Adding a prefix like "Hey", "OK", or "Hi" is the single most effective way to improve a wake word. "Jarvis" alone is decent (2 syllables, strong consonants), but "Hey Jarvis" is significantly better (3 syllables, 2 words, more phonetic diversity).

Strong Consonants: The Backbone of Detection

Not all sounds are created equal. Plosive consonants — sounds produced by briefly stopping airflow and then releasing it — create sharp, high-energy spikes in the audio spectrum that are easy for a model to detect.

The strongest consonants for wake word detection:

SoundIPAExampleWhy It's Good
/k/kkite, wakeSharp burst, high frequency energy
/t/ttop, startClean onset, distinct from vowels
/b/bbox, JarvisLow-frequency burst, cuts through noise
/d/ddog, redSimilar to /t/ but voiced — easy to detect
/p/ppop, stopStrong release burst
/g/ɡgo, bigBack-of-throat burst, distinct spectral shape

Weak consonants like /h/, /l/, /w/, and /r/ are smooth and gradual — they blend into surrounding sounds and don't create distinctive spikes. A wake word made entirely of soft sounds (like "Hello") is much harder to detect than one with hard consonants (like "OK Computer").

Sound Diversity: Mix It Up

A good wake word uses sounds from multiple phonetic categories: plosives, fricatives (like /s/, /f/), nasals (like /m/, /n/), and vowels. This creates a rich, varied acoustic fingerprint that's harder to accidentally match.

Consider the wake word "Bumble":

That's 4 different sound categories in one word, but notice the repeated /b/ — the word is somewhat internally repetitive. Compare it to "Biscuit":

More diverse, with three different plosives and a fricative. "Biscuit" would likely outperform "Bumble" in detection accuracy.

Multi-Word Phrases Have Lower False Activation

Single words, even good ones, will occasionally match against random speech. Multi-word phrases are dramatically better because the model needs to match a longer, more complex pattern.

False activation rates from our testing:

Phrase TypeTypical False Activations / Hour
Single short word ("Go")5–20+
Single long word ("Computer")1–3
Two-word phrase ("Hey Jarvis")0.1–0.5
Three-word phrase ("OK Hey Jarvis")< 0.1

The improvement from one word to two words is massive — often a 10x reduction in false activations.

What Makes a Bad Wake Word

Here are real patterns that cause problems:

Too short: "Go", "Play", "Yes" — one syllable, too many common sounds match.

Too soft: "Hello", "Hi Leah" — dominated by /h/ and /l/, which are acoustically weak and blend with ambient noise.

Too common: "Hey" by itself — appears constantly in everyday speech. (But "Hey Jarvis" is fine because "Jarvis" adds distinctiveness.)

Rhymes with common words: "Okay Rex" sounds similar to "OK, next" or "OK, let's" — the model can't reliably tell them apart.

Repetitive sounds: "Mama" or "Papa" — the same sounds repeat, giving the model less information to work with.


Part 3: Putting It Together

Here's a practical workflow for getting the best results.

Step 1: Choose a Good Wake Word

Before you touch any parameters, make sure your wake word passes these checks:

When you type your wake word in Outspoken, the phonetic quality analysis gives you real-time feedback on these factors. Aim for "Good" or "Excellent" before training.

Step 2: Start With Defaults

The default parameters are tuned for a good balance across most wake words:

ParameterDefaultWhy
Examples10,000Enough variation without excessive training time
Steps10,000Matches example count for balanced learning
Penalty1,500Conservative — prioritizes low false activations
Target FP/Hour0.7Allows reasonable detection sensitivity
Target Recall0.7Balanced detection rate
Layer Size96Good capacity without being wasteful

Train once with these defaults and test the result.

Step 3: Diagnose and Adjust

After testing, you'll likely see one of three situations:

Too many false activations: The model fires when nobody said the wake word.

Missing real activations: You say the wake word and nothing happens.

Works well but you want it even better:

Step 4: Iterate

Wake word training is empirical. The interaction between your specific phrase, the TTS voices, the background noise, and the model architecture means there's no single formula that works for every case. Train, test, adjust, repeat.

Most wake words converge on good performance within 2–3 training iterations. If you're still struggling after that, the wake word itself is usually the bottleneck — revisit Part 2 and consider changing the phrase.

Use the playground for testing

After each training run, test your model in Outspoken's Playground. It runs the ONNX model directly in your browser with your microphone — no setup required. Try saying the wake word at different distances, speeds, and volumes to stress-test it.

Quick Reference

GoalWhat to Adjust
Fewer false activationsIncrease penalty, lower target FP/hour
Catch more real activationsIncrease target recall, lower penalty, more examples
Faster trainingFewer examples and steps (2,000–5,000 each)
Better accuracy on short wordsMore examples, larger layer size, more steps
Multilingual detectionSelect multiple languages, increase examples
Smallest model fileLayer size 32 or 64

And the most important rule of all: a phonetically strong wake word with default parameters will outperform a weak wake word with perfect parameters, every time.


Once you have a trained model, see the Home Assistant guide to set it up as a voice assistant wake word, or the wake word tools comparison for context on how Outspoken stacks up against other options.


Ready to train? Sign up for Outspoken and try it yourself. Type your wake word, check the phonetic feedback, adjust parameters if needed, and have a trained model in minutes.