Yesterday we tested whether foreign markets that close before the US opens could improve our stop-loss management. The answer was nuanced — it helped some assets, hurt others, and the FTSE turned out to be the best single predictor of US direction at 63.7%.
But the reader who kicked off this series actually suggested two ideas. The first was global indices. The second was volume.
The logic: price is adversarial, volume is structural. If a pattern reliably predicted price, someone would trade it until it vanished. That's the efficient market hypothesis in action. But volume patterns don't get arbitraged away. High volume on options expiration isn't a "signal" someone can exploit into nonexistence — it's mechanical. Earnings days have high volume because companies report earnings, not because traders detected a pattern.
TimesFM's core architecture is trend extrapolation. That's exactly what failed on price — it learned to extrapolate trends that the market was actively working to destroy. But if volume actually has durable structural patterns, then trend extrapolation might be the right tool for the job.
The Setup
Same framework as our original backtest: 10 ETFs, 30-day walk-forward, 128 days of context, 5-day and 10-day forecasts. The only difference — we fed the model daily volume instead of daily price.
The question: did volume go up or down relative to the trailing 5-day average? Simple directional accuracy. No tricks.
59% on volume versus 47% on price. That's not a marginal gain. It's the difference between worse than random and meaningfully above it.
Per-Symbol Results
| ETF | 5d Accuracy | 10d Accuracy | Band Width |
|---|---|---|---|
| XAR | 70% | 67% | 96% |
| XLU | 70% | 77% | 56% |
| TIP | 70% | 60% | 98% |
| SPY | 67% | 60% | 65% |
| XLE | 60% | 53% | 79% |
| GLD | 57% | 67% | 93% |
| IWM | 57% | 53% | 75% |
| QQQ | 53% | 57% | 67% |
| ITA | 53% | 50% | 79% |
| USO | 33% | 37% | 80% |
Four ETFs above 65% — XAR, XLU, TIP, and SPY. XLU hit 77% on 10-day volume, the single highest accuracy number we've recorded across any experiment. And look at the bottom: USO at 33%, still terrible. Oil does its own thing regardless of what you ask the model.
Why This Works (and Price Doesn't)
Think about what TimesFM actually does. It looks at a sequence, finds the pattern, and extrapolates forward. When you give it price, it finds trends. But price trends in efficient markets are designed to be unpredictable — that's literally what makes them efficient. The model is trying to pattern-match against a system that punishes pattern matchers.
Volume is different. Volume has:
Seasonality. Options expiration days (monthly, quarterly) create predictable volume spikes.
Earnings cycles. Companies report on schedules. Volume rises before and during.
Open/close dynamics. The first and last 30 minutes of trading carry disproportionate volume, every single day.
Momentum clustering. High-volume days tend to cluster together. A big move yesterday means more trading today.
These patterns are structural. They exist because of how markets mechanically operate, not because of trader sentiment. A foundation model trained on "what comes next in sequences" can actually learn these patterns, because they're not being actively destroyed by people trading against them.
TimesFM isn't bad. We were giving it the wrong question. A trend-extrapolation model fails on adversarial data (price) but succeeds on structural data (volume). The tool didn't change. The question did.
The Confidence Signal
Something interesting showed up in the quantile bands — the model's confidence measure.
On price, band width didn't correlate with accuracy. The model was equally wrong whether it was "confident" or uncertain. On volume:
When the model is confident about volume, it's actually right more often. A 14-percentage-point gap between confident and uncertain predictions. On price, that gap was essentially zero. This means confidence on volume is a real signal — the model knows when it knows something, and when it doesn't.
For practical use: only trust the volume forecast when the band is narrow. When the model is uncertain, ignore it.
Where It Fails: Spike Detection
We also tested whether the model could predict volume spikes — moments when trading surges to 1.5x the trailing average. These are the events that often precede big price moves.
The result: zero spikes predicted out of 36 actual spikes.
The model never once said "volume is about to surge." It predicted 0 spikes. 36 actually happened. Perfect failure.
This makes sense architecturally. TimesFM is a smoother. It extrapolates trends, not discontinuities. Spikes are, by definition, the opposite of a trend — they're sudden departures from the pattern. Asking a trend model to predict trend breaks is asking it to do the one thing it's designed not to do.
If we want to predict volume surges, we need an event-driven model, not a sequence model. Earnings calendars, options expiration dates, macro announcement schedules — these are deterministic causes of volume spikes. Feed those as inputs to a rule-based system instead of asking TimesFM to discover them from price patterns alone.
We Also Tried the Model on Global Indices
Separately from the volume experiment, we tested the second approach from yesterday's entry: running TimesFM on all 11 international indices and building consensus signals.
Three methods. All three either matched baseline or underperformed.
| Method | Accuracy | vs Baseline |
|---|---|---|
| ETF only (baseline) | 47.0% | — |
| Global consensus override | 41.7% | -5.3% |
| Sector consensus override | 47.0% | +0.0% |
| Majority vote | 37.3% | -9.7% |
The global consensus — "if most international indices forecast bearish, go bearish on US" — actually made things worse. And the majority vote (combining ETF forecast, global consensus, and sector consensus) was the worst at 37.3%.
The lesson from yesterday holds: the raw overnight return data carries more information than the model's interpretation of it. Simple rules on actual index returns (yesterday's Experiment 4B) outperformed model-based consensus on the same data. Sometimes the answer isn't a better model. It's no model.
What This Means for Tauntaun
Two things go into the playbook from these experiments:
1. Volume as a position sizing signal. When TimesFM predicts volume will rise (with narrow confidence bands), we know the market is entering an active phase. That's useful for sizing — be bolder when volume momentum is building, more cautious when it's fading. This isn't a directional trade signal. It's a conviction multiplier.
2. Different tools for different questions. We're not going to use TimesFM for price direction. That experiment is conclusively negative. But for volume forecasting, it has a genuine edge. And for overnight sentiment, simple rules beat the model. The system that emerges is hybrid — rules where rules work, models where models work, and nothing where nothing works.
Tomorrow
The final entry in this series. We'll bring all five experiments together: what worked, what didn't, and what actually ships into Tauntaun. The complete scorecard.
Three days, six experiments, 1,800+ forecasts. One reader suggestion that turned into the most productive research sprint we've had. That's how this works — you build in the open, someone sees something you missed, and everyone gets smarter. 🧊
Data: 10 US ETFs · 30-day walk-forward · 128-day context · 300 volume forecasts
Scripts: exp3_volume.py · exp4a_global_xreg.py
Results: timesfm_exp3_volume.json · timesfm_exp4a_global.json