Cameras + voice · Home Assistant

Voice Assistants.

Reading time
~16 min · 3,242 words
FAQ
0 questions
Status
Draft 1 · under review
Section
All Home Assistant pages

Voice assistance is the AI capability that most directly changes how a grower interacts with their operation. Hands dirty with soil, holding a hose, moving trays between benches, walking through a greenhouse — the phone interface is awkward and sometimes impractical. A voice command to start a zone's irrigation cycle, check current temperature, or log an observation solves a real ergonomic problem. Home Assistant supports entirely local voice pipelines — speech recognition through Whisper, intent handling through Home Assistant's Conversation system, speech synthesis through Piper — which means voice commands work without internet, without cloud dependencies, and without sending audio to external services. The hardware can be anything from a laptop with a microphone to a purpose-built satellite device mounted in a greenhouse. This page covers the voice pipeline architecture, the choices for each component, hardware options including the ESP32-based Home Assistant Voice satellite, practical patterns for agricultural voice commands, the challenges of voice in noisy outdoor environments, and the failure modes that affect production voice deployments. Voice works well for specific use cases and is frustrating when deployed outside its fit; this page aims to help growers pick the right fit from the start.

Before adding voice.

Prerequisites and realistic expectations.

Home Assistant is operational. Voice layers on top of a working Home Assistant. Automations, scripts, and scenes that voice commands invoke need to exist and work correctly. A voice command to "start Zone 1 irrigation" depends on an irrigation script existing; voice does not substitute for the automation work.

Capable hardware for local voice. A graybox host (per [Choosing Your Hardware](/home-assistant/hardware/choosing)) handles local voice pipelines comfortably. A Raspberry Pi can run voice but with trade-offs — smaller Whisper models, slower responses, limited capacity for concurrent satellites. Local voice on a business-class repurposed desktop is significantly better than on a Pi.

Realistic use case in mind. Voice works well in reasonably quiet indoor environments — a greenhouse propagation room, a packing shed, a farm office. Voice works poorly in noisy outdoor environments — a field with machinery running, a greenhouse with loud ventilation, outdoor spaces with wind. Before investing in voice, be clear about where it will actually be used.

Tolerance for imperfect recognition. Even a well-tuned voice pipeline misrecognizes commands occasionally. A design that assumes voice will work perfectly produces frustration when it does not. Confirmation prompts, visual feedback, and fallback paths (phones, dashboards) keep voice usable when it stumbles.

The voice pipeline architecture.

A voice interaction passes through a series of stages. Understanding the architecture makes configuration and debugging clearer.

1. Wake word detection. The assistant listens continuously for a specific wake word (the modern equivalent of "Alexa" or "Hey Google"). Wake word detection runs locally on the satellite device or the host, is lightweight, and is always on. Only audio following the wake word is processed further.

2. Audio capture. Once the wake word fires, the microphone captures the following command audio. Some implementations capture until the user stops speaking (silence detection); others capture for a fixed duration.

3. Speech-to-text. The captured audio is transcribed to text. Home Assistant's typical local choice is Whisper, running on the host. Cloud alternatives (Google Cloud Speech, OpenAI's Whisper API) exist but bring cloud dependencies.

4. Intent recognition. The text is parsed to understand what the user wants. Home Assistant's built-in intent engine handles a growing set of intents natively — turning things on and off, setting values, querying states, running scripts. Intents that the built-in engine cannot handle can route to a large language model for broader interpretation.

5. Action execution. The recognized intent triggers the appropriate service call. "Turn on the Zone 1 irrigation" becomes a call to the script that starts irrigation.

6. Response generation. A text response is generated — a confirmation, an answer to a query, or a clarification request.

7. Text-to-speech. The response text is synthesized to audio. Home Assistant's typical local choice is Piper. Cloud alternatives exist.

8. Audio playback. The synthesized audio plays through the satellite's speaker (or through whatever device the response was routed to).

The pipeline is modular — each stage can use a local or cloud component. An all-local pipeline requires no internet; a hybrid pipeline uses cloud components for specific stages (typically for more capable LLM-based intent handling).

Speech-to-text choices.

Whisper, local. An OpenAI-released open-source speech-to-text model. Runs on the grower's hardware. Several model sizes exist — tiny, base, small, medium, large — with larger models producing better recognition at the cost of more resources. A graybox host runs the small or medium models comfortably; medium is a good balance for most operations. Larger models benefit from GPU acceleration.

Whisper, faster-whisper variant. A community implementation optimized for CPU performance. Faster than the original on equivalent hardware; widely used in Home Assistant voice deployments. The Wyoming protocol (Home Assistant's standard for local voice components) typically runs faster-whisper for the speech-to-text stage.

Cloud alternatives. Google Cloud Speech, OpenAI Whisper API, Azure Speech. Generally more capable than small-to-medium local Whisper models; require internet; involve sending audio to the cloud provider. For most agricultural operations, the local Whisper models are sufficient; cloud STT is more useful when recognition accuracy is a persistent problem and local capacity is limited.

Model selection. For English, the medium Whisper model produces good recognition on most agricultural vocabulary — zone names, metric names, common control verbs. For other languages, model choice matters more; multilingual models exist. For operations with specialized vocabulary (cannabis cultivar names, specific integration names), custom vocabulary hints can improve recognition.

Intent handling.

Home Assistant built-in intents. Home Assistant ships with intent recognition for common operations: turning things on and off, setting values, getting states, running scripts. The built-in engine is fast, local, and deterministic. For commands that fit the standard patterns, this works well.

Exposing entities and scripts to voice. Entities and scripts must be explicitly exposed to voice to be callable. This prevents unrelated entities from being accessible through voice — a grower probably does not want the voice assistant to be able to disable every switch in the operation. Exposure is configured per-entity in Home Assistant's voice settings.

Aliases for entities. Voice commands refer to entities by name. Long technical names ("switch.greenhousezone1irrigationvalve_main") do not say well. Aliases ("Zone 1 irrigation," "main valve") make voice commands natural. Aliases can be set per-entity in Home Assistant's configuration.

Custom sentences. Home Assistant supports custom sentence patterns. A grower who wants "start the morning routine" to run a specific script can define that sentence. Custom sentences extend the built-in intent engine for operation-specific commands.

LLM-based intent handling. When the built-in engine does not recognize a command, the pipeline can fall back to an LLM that interprets the command more flexibly. "What was the highest temperature in Zone 2 yesterday" is harder for a rule-based engine than for an LLM. Hybrid intent handling (built-in first, LLM for the rest) produces good results for agricultural operations with both routine commands and ad-hoc questions.

Agricultural intent examples.

- Zone control: "Turn on Zone 1 irrigation," "Open the roof vents," "Start the cooling cycle." - Queries: "What is the temperature in the propagation room?" "How much DLI has Zone 3 accumulated?" "Is the main pump running?" - Scripts: "Run the morning routine," "Start fertigation," "Apply day mode." - Environmental: "Set Zone 1 target to 75 degrees," "Increase humidity in propagation," "Turn off supplemental lighting." - Reporting: "Give me the morning summary," "What alerts fired overnight?"

The specific commands reflect the operation's vocabulary and the aliases configured.

Text-to-speech choices.

Piper, local. An open-source neural text-to-speech system. Produces natural-sounding speech, runs on CPU, lightweight enough for nearly any capable hardware. Many voices available — different accents, different genders, different tones. Runs entirely locally. The standard local TTS in Home Assistant's voice pipelines.

Cloud alternatives. Google Cloud TTS, Amazon Polly, Microsoft Azure TTS, and others. Often produce slightly more natural speech than Piper at the cost of cloud dependency. For most agricultural use cases, Piper is sufficient.

Voice choice. Piper provides a library of voices. Picking a voice affects the experience — a clear, neutral voice is usually better for operational use than a distinctive-character voice. Test several; the voice that sounds natural in brief tests may become grating with many daily uses.

Response phrasing. The text-to-speech output reflects the response text. Short, clear responses work better than long, flowery ones. "Zone 1 irrigation started" is better than "I've initiated the irrigation sequence for zone 1, please let me know if you need anything else." Voice responses should respect the user's time.

Hardware options.

Microphones and speakers on the Home Assistant host. The simplest approach — a USB microphone and speaker connected to the graybox host. Works for a single location near the host. Not mobile; not useful if the host is in a farm office and the grower wants voice in a greenhouse.

Home Assistant Voice satellite (ESP-based). A purpose-built ESP32-based device with microphone array, speaker, and voice-pipeline integration. Small, plug-in power, WiFi connection to Home Assistant. Multiple satellites can run from one Home Assistant; each can have its own wake word and its own location context. Good fit for agricultural operations with multiple greenhouse zones or rooms.

DIY ESPHome voice satellites. ESP32-based devices built using ESPHome's voice-pipeline components. Custom hardware — specific microphone, specific speaker, specific enclosure for the environment (water-resistant for greenhouses, robust for outdoor use). More work to build; more customization possible. Particularly useful for operations with specific environmental requirements (wet areas, dusty areas, outdoor use where off-the-shelf devices fail).

Smartphone or tablet as a satellite. The Home Assistant mobile app supports voice control. Useful for mobile voice — the grower uses their phone's microphone wherever they are. No dedicated hardware needed. Battery life and phone accessibility are limitations.

Nabu Casa's voice hardware. Home Assistant's commercial tier offers a packaged voice satellite with integrated support. For operations preferring commercial support, this is an option. Open-source DIY remains the primary pattern in the OpenAgTechnology collective's voice.

Placement considerations. Where the satellite sits affects how well it works. Near working areas where voice will be used. Away from loud equipment (fans, pumps, machinery) if possible. Wall-mounted or shelf-placed rather than on the floor. Within range of the WiFi network. Powered reliably.

Agricultural voice patterns.

Voice fits some agricultural operations better than others. The patterns that work:

The propagation room voice assistant. A single satellite in the propagation area. Commands to check conditions, start specific routines, set growth-stage-specific setpoints. The environment is quiet (propagation rooms typically are), the grower's hands are often full, and the pace is deliberate enough that voice feels natural.

The packing shed assistant. A satellite in the packing area for cold-storage queries, operations coordination, and brief logging. Good fit for facilities where staff move between tasks and hands-full interaction with Home Assistant matters.

The mobile phone in the greenhouse. Using the Home Assistant mobile app's voice for field commands. Works when the grower's phone is accessible. Less reliable when the phone is in a pocket; a satellite device is often more useful for hands-free use.

The workshop or farm office assistant. A satellite in the main working area. Used for morning briefings, query answers, and starting scheduled operations. General-purpose voice for the operator's primary location.

The outdoor assistant (with reservations). Outdoor voice is hard — wind, machinery, distance from the microphone, and general noise produce poor recognition. A satellite in a covered outdoor area (under an eave, inside an equipment shed) works better than one exposed to the elements. Honest about the limits is important.

Per-satellite context. Home Assistant supports per-satellite area context — a satellite in the propagation area is associated with the propagation area, and commands like "turn on the fans" default to that area's fans. This makes commands feel natural without requiring the grower to specify the zone every time.

What voice does not do well.

Critical decisions. Voice commands can be misheard. A misheard "open the vents" heard as "open the tents" does nothing; a misheard "turn on irrigation for Zone 1" heard as "turn on irrigation for Zone 4" does the wrong thing. Voice is a suggestion layer; for commands with real consequences, confirmation or visual verification should be in the loop.

Complex numeric input. "Set the target to one point two five" works but is slower and less reliable than tapping an input on a dashboard. For anything more complex than setting a round number, the dashboard is more efficient.

Noisy environments. Voice outdoors, near running machinery, or in ventilation-dominated greenhouses is unreliable. The misrecognition rate climbs until voice is more frustrating than helpful. Accept the boundary; use dashboards where voice does not fit.

Non-verbal context. Voice cannot convey "I am pointing at this specific tray." For operations that involve specific physical locations in the moment, tapping on a dashboard often makes more sense than describing the location in words.

Detailed configuration. Configuring an automation by voice is painful. Voice for running existing things works well; voice for building new things works poorly.

Multi-user voice.

Operations with multiple users have additional considerations.

User identification. Home Assistant's voice pipelines can identify users by voice if configured. This enables user-specific behavior — "my schedule" means something different depending on who is asking. For most operations, identification is not essential; for operations where per-user permissions matter, it can be useful.

Language support. Home Assistant's voice pipelines support multiple languages. An operation with multilingual staff can configure each satellite or each user for the appropriate language. Mixed-language environments are possible; commands work in the speaker's language.

Permissions. Different users can have different voice permissions. Staff might be able to ask queries and run routine scripts but not change setpoints or disable automations. This is configured through Home Assistant's user-management system rather than through the voice layer directly, but the effects show up in voice — a staff member asking to disable an automation gets a decline.

Shared satellites. A satellite in a common area serves whoever is nearby. The voice pipeline identifies the commanding user based on voice (if configured) or treats commands as anonymous-but-authorized. For simple operations, anonymous-authorized is sufficient; for operations with complex permissions, voice identification matters.

Privacy considerations.

Voice is where privacy questions get concrete.

What gets recorded. Local voice pipelines process audio on the grower's hardware. Audio is used for wake word detection, speech recognition, and then discarded — it is not stored by default. Cloud STT services store audio per their terms of service; what happens to stored audio varies.

The wake word issue. The wake word detection is always on — the microphone is always listening. This is how wake-word systems work. Local detection means the audio is not sent anywhere; only the command following a recognized wake word is further processed.

Operational privacy. Employees and visitors who enter a space with a voice satellite are in the range of an always-listening microphone. Disclosure is often appropriate — a sign or verbal notice that the space has voice assistance running. Laws vary by jurisdiction; check local requirements.

The difference between local and cloud. Local voice means no audio leaves the operation. Cloud voice means audio of commands (and incidental audio that happens to trigger the wake word) is sent to the cloud provider. For operations where privacy matters, local is the right default.

Common failure modes.

Specific voice-pipeline problems from real operations.

The wake word that never triggered. The grower said the wake word; nothing happened. Investigation: the microphone was too far, the environment was too loud, or the wake word volume threshold was set too high. Fix: reposition the satellite; tune wake word sensitivity; use a microphone with better pickup pattern.

The wake word that triggered constantly. The satellite kept activating from background audio — a radio, nearby conversations, machinery noise. Fix: reduce sensitivity; relocate the satellite; consider a different wake word with less ambiguity.

The command that was misrecognized. "Start Zone 1" was heard as "Start Zone 2" or "Start zones." The wrong zone ran irrigation. Fix: use zone aliases that are phonetically distinct; use confirmation for destructive commands; use the trace viewer to see what was actually recognized.

The response that was slow. The grower said the command; ten seconds later, nothing had happened. By then the grower had tapped the dashboard. Fix: faster Whisper model (or more hardware), closer-to-the-host Wyoming satellite, or accept that local voice has a response latency floor.

The voice pipeline that stopped working after an update. Home Assistant updated; the voice configuration broke. Fix: test voice on staging before production updates; keep dashboards available as fallback; voice is a convenience layer and should never be a single-point-of-dependency.

The satellite that disconnected from WiFi frequently. ESP32 satellites sometimes have unreliable WiFi connection; dropped connections interrupt voice service. Fix: WiFi coverage in the satellite's location matters; for unreliable networks, consider wired Ethernet ESP32 boards or move the satellite to better coverage.

The voice feedback loop. The satellite's TTS output was loud enough that its own microphone heard it; the response triggered interpretation; a loop formed. Fix: mute the microphone during TTS playback; reduce speaker volume relative to microphone sensitivity; use a satellite with acoustic echo cancellation.

The voice command that worked for the owner but not for a staff member. Different voice, different accent, different phrasing. The Whisper model recognized one well and the other poorly. Fix: train or fine-tune per-user if the issue is severe; simpler alternatives include adjusting phrasing, using shorter commands, and providing dashboard fallback for staff who find voice unreliable.

The voice assistant that leaked context. An LLM-based intent handler, asked about operational details, responded with information that should not have been shared with whoever was in the room. Fix: scope the LLM's context carefully; do not include sensitive data in the prompt context; consider local LLMs for operations with sensitive data.

The outdoor voice that was nearly unusable. A satellite mounted under an overhang at the greenhouse entrance worked poorly; wind, distant machinery, and ambient noise produced frequent misrecognitions. Fix: honest about the use case — outdoor voice is hard. Move the satellite inside; use a phone-based alternative for outdoor commands; accept that the right answer is sometimes "voice does not fit here."

What not to do.

Patterns to avoid.

Don't assume voice will work perfectly. Design with the misrecognition case in mind. Confirmation prompts, visual feedback, and fallback interfaces keep the system usable when voice stumbles.

Don't expose every entity to voice. Exposing everything means a misheard command can do anything. Expose only the entities and scripts that make sense for voice control; keep configuration and sensitive operations off voice.

Don't route safety-critical operations through voice alone. A misheard command should not lead to catastrophic consequences. For commands that matter, confirmation is appropriate; for truly critical operations, voice is not the right channel.

Don't over-engineer responses. A voice assistant that speaks in long formal sentences is exhausting. Responses should be brief and clear. "Done" is often better than "I have successfully executed the command you requested."

Don't forget to test recognition with actual users. The owner's voice and vocabulary are not the staff's voice and vocabulary. Recognition that works well for one person may fail for another. Test broadly.

Don't deploy voice in environments where it cannot work. Outdoor, noisy, or acoustically difficult spaces produce frustrating voice experiences. Dashboards work everywhere; voice should be deployed where it has a fighting chance.

Don't skip the privacy disclosure. Workers and visitors have reasonable expectations about always-listening microphones. Disclosure is a small gesture that prevents a lot of friction.

Don't make voice the only way to do something. Every voice-invokable action should also be accessible through a dashboard or other interface. When voice fails, the operation should continue.