GPS watches and metrics
Evidence: strong
Read every watch number as a trend within yourself over time, not as a precise absolute or a way to compare with others. Calories and sleep staging are the least trustworthy; the branded “readiness” scores carry false precision.
Keep it in proportion
This is one of the last few per cent, at most. It pays off only once the basics are already in place: consistent volume, sleep and adequate fuelling. Most runners gain far more from those than from anything on this page.
A running watch is both a genuine training tool and a generator of vanity metrics, and the value lies in telling the two apart. The governing idea, endorsed by a cardiology consensus review, is that consumer wearable signals should be read as trends within one individual over time, not as precise absolute values or as a basis for comparing one person to another (JACC State-of-the-Art Review 2023). Because a device’s error is often a fairly stable offset, an absolutely-wrong number can still track real change in one person against their own baseline.
What watches measure well
- Average pace and distance over a normal run. GPS distance error for modern watches is typically around 3 to 6%, and can be under 3% at steady efforts on open ground (Gilgen-Ammann et al. 2020).
- Heart rate, with caveats. Optical wrist heart rate is within about 5% at rest and steady state (Shcherbina et al. 2017).
What they only estimate
- GPS distance under tree cover, in cities, and on tight bends, where it is systematically underestimated and noisier (Gilgen-Ammann et al. 2020). Instantaneous pace is far noisier than lap or average pace, because differentiating a metre-noisy position over a short window amplifies the noise; use lap pace, not the jumping real-time figure.
- Wrist heart rate during intervals and hard efforts, where optical sensors lag and underestimate, with error worst during cycling and resistance work, and larger for darker skin tones at high intensity (Zhang et al. 2020; Hung et al. 2025). A chest strap is far better for interval work.
- Estimated VO₂max. Garmin’s figure uses the FirstBeat algorithm from the heart-rate-to-pace relationship (FirstBeat VO₂max white paper). The manufacturer states roughly 5% mean error; independent validation found around 7 to 8% overall, valid in moderately trained runners but poor in the highly trained, where it underestimates (Engel et al. 2025; Železnik Mežan 2025). A chest strap markedly improves it.
- Calories. The weakest common metric: no device tested achieved under 20% error, with a median around 27%, even while heart-rate error stayed small (Shcherbina et al. 2017). Treat calorie figures as fiction.
- Sleep staging. Wearables detect sleep well but wake poorly, overestimate total sleep time by 20 to 40 minutes, and get light, deep and REM staging substantially wrong (Schyvens et al. 2025). Total sleep trend, yes; the stage breakdown, no. See sleep.
- Lactate threshold and running power. Watch threshold estimates get heart rate roughly right but pace badly wrong, overestimating threshold pace by over 20% (Lu et al. 2025). Running “power” has no agreed criterion standard, so different brands report different watts for the same run and cannot be compared (Imbach et al. 2020).
Read the manual
The accuracy figures above are best-case, and sloppy setup makes them worse. Most of that error is avoidable by matching the watch’s settings to how you actually run.
- Use the lap button, not the live pace field. Instantaneous pace is the noisiest number the watch shows. Press lap at known points, or set auto-lap, and judge effort from lap or average pace. For intervals, the lap button marks each rep cleanly; staring at jumping real-time pace mid-rep is chasing noise.
- Select the right activity profile. Track, trail, treadmill and road profiles apply different distance logic; a track mode that snaps to a 400 m lane, or treadmill mode that uses the accelerometer rather than absent GPS, is far more accurate than a generic outdoor run.
- Set the satellite mode deliberately. Multi-band or multi-GNSS positioning is markedly more accurate under tree cover and among buildings than the default single-band mode, at some battery cost. Pick it for technical routes; the GPS error you read about is largely the default setting underperforming.
- Pair a chest strap for hard sessions. Wrist optical heart rate lags and underreads during intervals; a paired chest strap fixes both the heart-rate trace and every metric built on it, including estimated VO₂max.
- Calibrate and update. Calibrate a foot pod or treadmill factor, keep the firmware current, and enter an accurate maximum and resting heart rate, because the zone and load models are only as good as those inputs.
- Know what each metric actually is. The branded scores below are models, not measurements; reading the manufacturer’s description of how one is derived is the difference between using it well and being misled by it.
The vanity-metric trap
“Training Load”, “Training Status”, “Body Battery”, “Recovery Time”, “Sleep Score” and “Training Readiness” are built mostly on heart-rate-variability, sleep and load models from FirstBeat (FirstBeat stress and recovery white paper). A “Sleep Score” inherits the staging errors above, so it is a rough nightly trend at best. The mechanisms are credible, but the branded absolute outputs have little independent validation, and a headline like “Recovery Time: 37 hours” carries false precision. Their value is comparative and longitudinal within one person, not the exact number and not across people.
This connects directly to training monitoring and the idea of a smallest worthwhile change: a metric is only useful for a decision when its measurement error is smaller than the change you are trying to detect (Hopkins 2000). For most watch-estimated metrics that means watching multi-week trends and ignoring daily wobble, the same discipline that applies to HRV and to judging a shoe or supplement on yourself.