Wearable Validity Atlas

How accurate are wearables, really?

An auditable, living reference synthesizing the independent validation literature into a single computed grade per device × claim. The pattern it surfaces: wearables are best validated for the metrics nobody buys them for — and least validated for the ones they're marketed on.

Click any cell for its full audit trail. A Grade D means the claim is marketed but has no independent study in the current seed corpus — not that none exists anywhere.

How grades work

Every grade is computed, not assigned. Each validation study is reduced to a common form — bias (trueness) and precision (random error) — rejecting correlation-only evidence, because a device can correlate perfectly and still be systematically wrong.

Errors are then made comparable across metrics with different units via the Resolution Ratio:

R = device precision ÷ smallest worthwhile change

R < 1 resolves a change worth acting on · R ≥ 3 means the noise exceeds the signal. A claim resting only on a lossy statistic (MAPE, “% within”) is capped below A — one conflated number can’t prove both trueness and precision.

Two scores sit under every grade, because one letter can’t answer two different questions. Accuracy (0–100) is how close the device gets to the criterion when measured; Confidence (0–100) is how much independent evidence backs that number. This mirrors the clinical GRADE split between an effect and the certainty of evidence. A marketed-but-unstudied claim (Grade D) has no accuracy score and near-zero confidence — an empty cell of knowledge, not a bad measurement. Cell shading reflects confidence: faint cells are weakly evidenced.

Out of scope: proprietary composite scores (Readiness, Recovery, Strain, Body Battery). They have no external criterion in physical units — nothing measures “readiness” — so there is nothing to validate them against. The Atlas grades only claims that can be checked against a gold standard.