Monitoring Home Assistant Itself — Home Assistant on OpenAgTechnology

A Home Assistant installation monitors everything in the operation — zone temperatures, soil moisture, equipment state, crop conditions. Something has to monitor Home Assistant. A host whose disk has silently filled, whose memory is exhausted, whose database is corrupted, or whose critical integrations have been failing for days is a Home Assistant that cannot do its job. For agricultural operations, the failure mode is specific: the grower trusts Home Assistant to watch over the operation, and when Home Assistant stops watching, crops suffer. The monitoring does not need to be elaborate — a handful of template sensors and well-scoped automations catch the most common problems. It does need to exist, and it needs to alert someone who will act. This page covers what to monitor (host health, Home Assistant's own health, database size, integration status, automation execution, sensor freshness), what to alert on and what to watch silently, how to avoid alert fatigue, and the specific failure modes that a monitoring layer catches versus misses. The goal is not surveillance of every statistic; it is detection of the conditions that would otherwise lead to crop-affecting outages.

Before setting up monitoring.

Prerequisites and framing.

A working notification channel. Monitoring is only useful if alerts reach someone. [Notification Services](/home-assistant/integrations/notifications) covers the delivery mechanisms — mobile app, SMS, email, Telegram, and others. Before setting up monitoring, ensure the notification path works end to end.

A clear recipient. Alerts need to go to someone who will act on them. For most agricultural operations, that is the grower's phone. For operations with multiple people, it may be a rotating on-call arrangement or a group channel. Whatever the pattern, alerts should not go into a channel nobody reads.

Awareness of false-alarm cost. Every false alarm wastes the recipient's attention and slightly erodes trust in future alerts. Monitoring should be tuned to avoid crying wolf. Starting conservative (fewer alerts, higher confidence) and adding coverage as patterns become clear is usually better than starting aggressive and dialing back.

Understanding what monitoring cannot catch. No monitoring catches everything. A subtle configuration bug that produces wrong automation behavior may not show up as "broken" — everything technically works, just not right. A sensor reading that is drifting out of calibration may continue reporting values within expected ranges. Monitoring catches the obvious failure modes; operational judgment catches the rest.

What to monitor.

The categories that deserve attention.

Host health. CPU usage, memory usage, disk space, disk I/O, temperature (for systems in hot environments). The underlying hardware has to be healthy for Home Assistant to work.

Home Assistant's own health. The Home Assistant process is running. The database is healthy. Integration status — integrations that have errored out or lost connection. Log errors and warnings. Performance — if startup times are growing or service calls are slow, something is worth investigating.

Sensor freshness. Sensors that have not updated recently are suspect. A temperature sensor that last reported six hours ago is either a sensor failure or a network issue. Monitoring for stale sensors catches silent failures that do not produce explicit errors.

Automation execution. Automations that should run on a schedule or in response to conditions actually running. A cooling automation that has not fired in days during summer heat is suspicious; a scheduled morning briefing that did not arrive is suspicious. Silent failure of automations is worse than explicit failure.

Network connectivity. The Home Assistant host's connection to the rest of the network. Internet connectivity for operations that need it. Specific service availability (MQTT broker, database, any external services).

Backup execution. Scheduled backups completing successfully. An absent backup is a concerning signal — it means recovery options are narrowing silently.

Certificate expiration. For operations using HTTPS with certificates that expire, monitoring expiration dates prevents surprise outages when certificates lapse.

Host monitoring.

The hardware layer.

CPU usage. Sustained high CPU indicates something is working hard — often a specific automation firing continuously, a template that is expensive to evaluate, or an integration polling too aggressively. Monitoring for sustained high CPU (say, above 80% for more than 10 minutes) catches these patterns. Short spikes are normal; sustained load is not.

Memory usage. Home Assistant's memory footprint grows with the installation — more integrations, more entities, more history. Monitoring free memory catches memory leaks (rare but real) and capacity issues (adding too much to a constrained host). Alert when free memory drops below a threshold.

Disk space. The most common host failure mode in production. Home Assistant's database grows; logs accumulate; add-on data (Frigate recordings, InfluxDB) expands. A full disk can stop Home Assistant from writing state changes, corrupt the database, and cause cascading failures. Alert when disk usage exceeds 80%; act when it exceeds 90%.

Disk I/O. Not usually alert-worthy on its own, but high sustained I/O can indicate database issues, heavy logging, or runaway processes. Visibility is useful for investigation.

System temperature. For hosts in warm environments (farm office in summer, greenhouse equipment area), CPU temperature matters. Systems running hot for extended periods degrade faster and may thermally throttle (reducing performance) or shut down (causing outages). Alert when temperatures exceed the manufacturer's recommended limits.

The System Monitor integration. Home Assistant's built-in System Monitor integration exposes CPU, memory, disk, and related values as entities. Enabling it is the starting point for host monitoring. Additional custom sensors can be added for specifics the built-in integration does not cover.

Home Assistant application monitoring.

The software layer.

Home Assistant uptime. The Uptime sensor tracks how long Home Assistant has been running since its last restart. A restart that the grower did not initiate is worth knowing about — it means Home Assistant crashed, was killed by the OS, or restarted for an unexpected reason. An automation that notifies on unexpected restarts catches this.

Recorder database size. Home Assistant's recorder stores state history. The database grows with every state change. An out-of-control database (tens of gigabytes) affects performance and backup size. Monitoring the database file size over time catches growth patterns before they become problems.

Integration status. Integrations can fail silently — a cloud integration whose authentication expired, a local integration whose device is unreachable, a custom integration that errored out after an update. Home Assistant logs these errors but does not necessarily surface them prominently. A monitoring pattern that watches the system log for integration errors and surfaces them as notifications catches silent failures.

Log warnings and errors. The Home Assistant log accumulates warnings and errors. A sudden increase in error frequency is a signal that something has changed and is worth investigating. Monitoring log error rates (rather than individual errors) catches patterns without alerting on every minor issue.

Entity unavailable counts. An entity in the "unavailable" state is a sensor or integration that is not reporting. A count of unavailable entities over time reveals ongoing issues. A single unavailable entity is often normal (a device that is temporarily offline); a growing count suggests a broader problem.

Automation failure counts. Automations that error during execution show up in the automation trace. Template errors, service call failures, unavailable referenced entities. Monitoring automation error counts catches automations that are no longer working correctly.

Sensor freshness monitoring.

Catching silent sensor failures.

The dead-man switch pattern. For each critical sensor, an automation checks whether the sensor has updated within an expected interval. If not, an alert fires. This catches sensors that silently stop reporting — battery dead, network dropped, device failed — without producing explicit errors.

Implementation options.

- Template-based. A template sensor compares the sensor's `lastupdated` time to the current time. If the gap exceeds a threshold, the template sensor goes to a "stale" state. - Age-based automations. An automation triggers on an entity's `lastupdated` being older than a threshold. The automation sends an alert and potentially takes other action. - Dedicated integrations. Some community integrations provide dead-man-switch functionality directly, reducing configuration overhead.

Which sensors to cover. Every critical sensor. Temperature and humidity sensors in each zone. Soil moisture sensors. Flow sensors. CO2 sensors. Battery-powered sensors (which often fail silently due to dead batteries). The dead-man pattern is cheap per sensor; erring toward more coverage is usually the right call.

Threshold tuning. A sensor reporting every 10 seconds is stale at 5 minutes; a sensor reporting every 5 minutes is stale at 20. The threshold should be a small multiple of the expected reporting interval — enough slack for occasional missed updates without false alarms, not so much that a real failure goes unnoticed for hours.

Per-sensor versus aggregate. Some operations want individual alerts per sensor; others want aggregate "X sensors offline" alerts to avoid multiple notifications during a common-cause failure (network outage, for example). Both patterns are legitimate; choose based on how the operation wants to respond.

Automation execution monitoring.

Making sure automations are doing their job.

The missed-schedule pattern. For automations that run on a known schedule, monitoring that they actually ran. A morning briefing automation that did not fire at 06:00 is a problem; if it did not fire today and also did not fire yesterday, the problem is probably not transient. Monitoring through an input_datetime that the automation updates on each run, with an external check that the datetime is recent, catches this.

Expected-trigger-count monitoring. Some automations should fire multiple times per day under normal conditions. A "soil moisture irrigation" automation that has fired zero times in a dry week is suspicious. A counter tracking automation fires over time, with an alert on counts below the expected range, catches patterns that individual monitoring would miss.

Trace review. Not an automation pattern but an operational practice. Periodic review of the trace history for critical automations reveals subtle behavior problems — an automation that fires but whose actions fail, a trigger that fires on wrong conditions, a condition that is blocking execution more than expected.

Automation disabled monitoring. An automation that gets disabled and forgotten is a silent failure of operational intent. A template sensor that counts disabled automations, or specifically flags critical automations when disabled, catches the "someone turned this off and forgot" case.

Alert design and fatigue.

The discipline of sending alerts that matter.

Every alert should be actionable. An alert that the grower receives but cannot act on is noise. If the monitoring detects a condition that does not require action, it should be logged but not alerted. Reserve alerts for conditions that require attention now.

Severity tiering. Not every alert is urgent. A cooling automation that failed during a heat wave is urgent. A backup that failed last night is important but not emergency. A log warning rate slightly elevated is informational. Tiering — critical, warning, info — lets the notification channel reflect the severity. Critical goes to SMS or push with a distinct sound; informational goes to an email digest.

Aggregation for common-cause failures. A network outage that takes out many sensors at once should not produce fifty alerts. Monitoring that detects multiple related failures and aggregates them into one alert ("20 sensors offline — possible network issue") is more useful than fifty individual "sensor offline" alerts.

Time-of-day awareness. Some alerts that are urgent during work hours are less urgent overnight — or vice versa. A 3 AM alert for a situation that can wait until morning erodes trust; a next-morning alert for a situation that should have been handled at 3 AM misses the point. Severity and timing should align with actual operational priorities.

Deduplication. An alert condition that persists should not produce an alert every minute. Once notified, the grower does not need repeat notifications while actively responding. Deduplication — "this alert has fired; do not fire again for 30 minutes unless it resolves and re-fires" — keeps the channel usable.

Clear recovery signals. When a problem resolves, a recovery notification (not always needed, but often useful) tells the grower they can stop worrying. An alert that fires and then goes silent leaves the grower uncertain whether it resolved or the monitoring itself stopped working.

Regular review. Quarterly (or whatever cadence fits) review of what alerts have fired. Are there patterns of false positives? Are there alerts that never fire (possibly because their conditions are wrong)? Are there conditions that happened but did not alert? The review catches miscalibrated monitoring before it becomes a chronic problem.

Dashboards for operational awareness.

Monitoring does not have to be only alerts.

A system-health dashboard. A dashboard showing CPU, memory, disk, Home Assistant uptime, recent error rate, and integration status. Checked occasionally — not every time the grower opens Home Assistant, but when curiosity or troubleshooting calls for it. See [Dashboard Design for Growers](/home-assistant/dashboards/design) for broader dashboard principles.

Key status indicators on the primary dashboard. A single "system OK" indicator on the main dashboard that turns yellow or red if something is wrong. Catches things at a glance without dedicating the primary dashboard to monitoring data.

Historical trends through Grafana. For operations running Grafana (covered in [Grafana Integration](/home-assistant/dashboards/grafana)), long-term trends of CPU, memory, disk usage, and other system metrics reveal growth patterns that short-term monitoring does not.

Graybox service monitoring. Operations running other services alongside Home Assistant (InfluxDB, Grafana, Frigate) benefit from monitoring those services too. The graybox principle — one host, multiple services — means one host's monitoring covers multiple components.

External monitoring.

Monitoring from outside the operation.

The limit of self-monitoring. A Home Assistant installation monitoring itself cannot detect that Home Assistant is down. If the whole thing is offline, there is no active monitoring to send an alert. For operations where complete outages matter, external monitoring fills this gap.

Simple external uptime monitoring. A service outside the operation (UptimeRobot, Better Stack, Healthchecks.io, a simple script on a VPS, or a second Home Assistant instance at another site) periodically pokes the Home Assistant installation to confirm it responds. If it does not respond, the external service sends an alert.

The ping-back pattern. Home Assistant periodically pings an external service (a URL, an MQTT broker, a third-party monitoring service); the external service alerts if pings stop arriving. Less direct than the external-poll pattern but does not require the Home Assistant installation to expose a reachable endpoint to the internet.

Nabu Casa status. For operations using Nabu Casa, the service provides some visibility into whether Home Assistant is reachable. Not a full monitoring solution but a useful additional signal.

Cellular-based alerts for critical operations. For operations where a complete outage can cost a crop, a separate cellular-connected device that can alert independently of the operation's network is worth considering. Overkill for most; essential for high-value operations.

Common failure modes.

Specific monitoring failures from real deployments.

The monitoring that only alerted on things already alerted elsewhere. Every condition the monitoring caught was something the grower would have noticed anyway. The real issues — silent sensor failures, gradual disk fill — were not covered. Fix: monitor the things that are actually silent; redundant alerts for already-visible problems are not useful.

The disk that filled because nothing monitored it. Home Assistant's database grew steadily; the disk reached 100%; the database corrupted; hours of recovery work. Fix: disk space monitoring with alerts at 80%; retention policies that prevent unbounded growth.

The alert channel that nobody was watching. Monitoring alerts went to an email inbox the grower had stopped checking. Real alerts sat unread. Fix: alerts go to channels actively monitored (phone push, SMS); verify periodically that the recipient is actually seeing the alerts.

The dead-man alert on a sensor that was deliberately offline. A seasonal zone was shut down; the dead-man alert on its sensor fired repeatedly. The grower disabled the alert; forgot to re-enable it when the zone came back. Fix: disable-with-expiration patterns; monitoring that is aware of operational state (zone mode) and suppresses expected silence.

The alert storm during a network outage. A five-minute internet outage produced dozens of cloud-integration-unavailable alerts. Fix: aggregation; monitor for network state first and suppress downstream alerts during known outages.

The monitoring that stopped working after an update. A Home Assistant update changed behavior; the monitoring's expectations were wrong; alerts stopped firing. The grower noticed only when a real problem appeared. Fix: periodic testing of the monitoring itself; the monitoring that has not fired in a long time deserves verification, not assumption.

The certificate that expired during a vacation. HTTPS certificate expired while the grower was away; remote access broke; the grower did not notice until trying to connect. Fix: monitor certificate expiration; alert well in advance (30 days, 7 days, 1 day).

The host that rebooted repeatedly. A hardware issue caused the host to crash and restart repeatedly. Each restart was brief enough that individual outage alerts were suppressed by their own deduplication. The pattern was not visible. Fix: monitoring for restart frequency, not just individual restarts.

The alert that arrived after the crop was lost. The monitoring caught the condition; the alert went out; the grower was asleep; the acknowledgment did not happen for hours. Fix: for truly critical conditions, escalation — if the primary recipient does not acknowledge within a time window, escalate to a secondary recipient (SMS, phone call, second person).

The alert that was misinterpreted. An alert said "low soil moisture" but the context was a zone that had just been harvested and was deliberately dry. The grower spent time investigating before realizing the zone state. Fix: alerts that carry operational context ("low soil moisture in Zone 3, current production mode: active") prevent the wrong conclusion.

What not to do.

Patterns to avoid.

Don't alert on everything. Alert fatigue is real. Reserve alerts for actionable conditions. Monitor everything (visibility is free); alert selectively.

Don't set alert thresholds at the edge of normal. A threshold that fires every time conditions are marginal produces false positives. Set thresholds with margin beyond normal operating range.

Don't rely only on self-monitoring for critical operations. Complete host failure silences self-monitoring. External monitoring catches the cases self-monitoring cannot.

Don't forget to verify the monitoring works. Alerts that never fire might be perfectly calibrated — or they might be broken. Periodic test-firing confirms the monitoring path works end to end.

Don't ignore repeated alerts. An alert firing repeatedly means the condition persists. Acknowledging without acting means the alert was wasted. Each alert should either be investigated or the condition tuned so it stops firing.

Don't send non-critical alerts to critical channels. SMS is for things that need attention now. Email is for things that can wait. Routing everything to the critical channel trains the grower to ignore it.

Don't treat monitoring as a one-time setup. The operation changes; the installation changes; what was adequate monitoring last year may not be now. Periodic review keeps monitoring aligned with current needs.

Don't monitor the monitoring endlessly. At some point, additional layers of monitoring-the-monitoring become overhead without value. A reasonable self-monitoring layer plus external uptime monitoring is usually enough; deeper stacks are a sign of over-engineering.