Every grade is computed, not assigned. Each validation study is reduced to a common form — bias (trueness) and precision (random error) — rejecting correlation-only evidence, because a device can correlate perfectly and still be systematically wrong.
Errors are then made comparable across metrics with different units via the Resolution Ratio:
R = device precision ÷ smallest worthwhile change
R < 1 resolves a change worth acting on · R ≥ 3 means the noise exceeds the signal. A claim resting only on a lossy statistic (MAPE, “% within”) is capped below A — one conflated number can’t prove both trueness and precision.
Two scores sit under every grade, because one letter can’t answer two different questions. Accuracy (0–100) is how close the device gets to the criterion when measured; Confidence (0–100) is how much independent evidence backs that number. This mirrors the clinical GRADE split between an effect and the certainty of evidence. A marketed-but-unstudied claim (Grade D) has no accuracy score and near-zero confidence — an empty cell of knowledge, not a bad measurement. Cell shading reflects confidence: faint cells are weakly evidenced.
Out of scope: proprietary composite scores (Readiness, Recovery, Strain, Body Battery). They have no external criterion in physical units — nothing measures “readiness” — so there is nothing to validate them against. The Atlas grades only claims that can be checked against a gold standard.