The Morning Video
I was watching a YouTube video this morning โ one of those deep-dive geopolitics lectures where a professor in Beijing breaks down what's happening in the world through the lens of history, game theory, and structural analysis. He's been making predictions for months. Big ones. War escalation timelines. Energy chokepoints. Alliance shifts. The kind of analysis you can't get from a news feed because it requires 2,000 years of context.
Halfway through, I had a thought: this guy is basically a signal source.
Not in the loose, "interesting perspective" sense. In the literal, "these are falsifiable predictions about geopolitical events that have direct market implications" sense. He's saying things like "Iran will strike GCC oil infrastructure" and "Germany is on a path to national conscription" โ concrete claims with timeframes and confidence levels baked into his language.
So we built a pipeline to turn his lectures into trading signals. In one morning.
The Pipeline
Here's what we built, start to finish:
Step 1: Detect New Videos
YouTube exposes a free Atom RSS feed for every channel. No API key needed. We poll it every few hours. When a new video appears, we grab its ID, title, publish date, and description. The publish date matters โ we'll come back to that.
Step 2: Pull Transcripts
YouTube auto-generates captions for most videos. The youtube-transcript-api Python library pulls them directly โ no YouTube Data API key, no OAuth, no cost. One function call, full transcript. The professor's latest video came back as 43,000 characters of raw text.
Step 3: Clean the Accent
Here's a problem nobody warns you about: auto-captions plus a non-native accent equals creative misspellings. The professor teaches from Beijing, and YouTube's speech-to-text has some opinions about what he's saying.
"The Strait of Hormuz" becomes "the Humus." "Clausewitz" becomes "close wits." "Pax Judaica" becomes "packs Judaica." "Xi Jinping" has at least three different spellings depending on the video.
So we built a corrections layer โ a dictionary of regex patterns that fixes known mistranscriptions before the text hits the extractor. It's got entries for world leaders, historical figures, geopolitical terms, military vocabulary. The dictionary grows over time as we spot new patterns. 47 unit tests verify it works.
Step 4: Extract Predictions
This is where it gets interesting. We send the cleaned transcript to Claude with a structured extraction prompt. The prompt tells it: you are a geopolitical signal extraction engine. Find falsifiable predictions with market implications. For each one, give me the prediction, a confidence score, a timeframe, a category, affected regions, and specific ETF trades.
The result from one 54-minute video:
| Category | Prediction | Confidence | ETF Signals |
|---|---|---|---|
| Energy | Strait of Hormuz closure โ loss of 20% global energy supply | 0.70 | USO, XLE, UNG, GLD, VXX, SPY |
| Military | US full-scale ground invasion with national draft | 0.80 | ITA, XAR, GLD, SPY |
| Military | Germany implements national draft for males 17-45 | 0.60 | VEA, GLD |
| Trade War | De-industrialization from energy loss, global trade collapse | 0.80 | XLI, SPY, VEA |
| Crisis | Fertilizer shortage โ famine affecting billions | 0.70 | GLD, SPY, VXX |
Five predictions. Eighteen ETF-level signals. All from a single lecture. And that's just the most recent video โ we back-processed 15 lectures and pulled 57 total predictions.
The Hard Part: Not Treating Every Prediction the Same
Raw extraction is the easy part. The real engineering is in what happens next.
Problem #1: Predictions go stale. A prediction from March 5th about "what happens this week" shouldn't carry the same weight on April 7th. So every prediction has a timeframe ("days", "weeks", "months") and the bridge applies exponential decay โ confidence halves every 14 days. A bold 0.9 call from three weeks ago naturally fades to a whisper. A fresh 0.7 from today rings loud.
Problem #2: One source shouldn't dominate. The professor is very confident. His predictions frequently come in at 0.8 or 0.9. If we passed those straight through, a single YouTube video could outweigh FRED, RSS, Kalshi, and Credit Spreads combined. So we apply a scaling factor (0.6x) that compresses the range without destroying it. His 0.9 becomes 0.54. His 0.5 becomes 0.30. The ranking is preserved โ his strongest calls still outweigh his weaker ones โ but no single lecture hijacks the strategy engine.
Problem #3: Corroboration matters. A professor predicting energy disruption is interesting. A professor predicting energy disruption while GDELT is showing a sanctions spike, RSS feeds are lighting up with Hormuz headlines, and Kalshi markets are pricing hot inflation โ that's a convergence event. The fusion engine handles this automatically. When the professor's signals align with other sources, confidence compounds. When he's an outlier, he stays an outlier.
Here's what the signal spread actually looks like after scaling + decay:
| Confidence | What It Means | Examples |
|---|---|---|
| 65% | Multiple recent predictions agree, strong corroboration | USO LONG, GLD LONG, VXX LONG |
| 55% | Strong single prediction, recent video | ITA LONG (defense buildup) |
| 42-50% | Moderate confidence, some age decay | UNG LONG, XAR LONG |
| 21-29% | Older predictions fading, or lower initial confidence | URA LONG, SPY LONG (contrarian) |
That spread is the whole point. It's not "the professor says buy gold." It's "the professor has made 4 predictions across 3 videos over the last 2 weeks that all point to gold as a safe haven, the most recent was 3 days ago, and his confidence is reinforced by GDELT sanctions spikes and Kalshi inflation pricing." That's a fused signal with provenance.
What Kind of Signal Is This?
Every source in the system fits a category. FRED is economic data. RSS is narrative detection. Credit Spreads are an institutional fear gauge. Google Trends is behavioral.
The professor is something different. He's a structural analyst. He doesn't react to today's headline โ he reads today's headline through the lens of the Peloponnesian War, the collapse of the Ottoman Empire, and the game theory of asymmetric warfare. His predictions have longer time horizons than a news cycle and deeper context than a data series.
That's exactly what the system was missing. Eight sources that are all reactive โ they read what is happening. Source #9 reads what happened 500 years ago and tells you what happens next.
The Architecture
Important detail: the professor's signals don't add any latency to the trading pipeline. Here's why.
Tauntaun runs every 30 minutes. Zero LLM tokens in the hot path. All eight existing sources are deterministic โ pull data, apply rules, emit signals. The professor's pipeline works the same way:
1. A cron job checks YouTube RSS for new videos
2. New video? Pull transcript, clean it, run extraction (this uses Claude, but it runs offline โ not during pipeline execution)
3. Extracted predictions are saved to a JSON file
4. At pipeline runtime, the bridge reads the JSON. No API calls. No LLM. Just a file read with decay math.
The pipeline still runs in seconds. The professor's signals are just there, pre-computed, waiting to be fused with everything else.
47 Tests
We built this test-first. The test suite covers:
โข Corrections layer: 22 tests โ every known mistranscription has a test. "Humus" โ "Hormuz". "She jin ping" โ "Xi Jinping". False positive protection (the word "hummus" should survive uncorrected).
โข RSS parsing: Feed structure, state management, idempotent re-processing.
โข Extraction contract: ETF universe sync with Tauntaun's config, prompt structure, output schema validation.
โข Bridge data contract: Confidence decay math, publish-date anchoring, timeframe expiry, signal deduplication.
Every test passed on the first full run. The system is live.
What Happens Next
The professor posts roughly twice a week. Each video gets auto-detected, transcribed, cleaned, extracted, and fused into the next pipeline run. The corrections dictionary will grow as we find new mistranscriptions. The extraction prompt will evolve as we learn what kinds of predictions are most valuable.
And somewhere in Beijing, a professor is giving lectures to high school students about the fall of empires and the game theory of war, completely unaware that his words are being parsed into ETF signals by a system running on a Mac Mini in California.
Source count: 9.