Forex Data Pipeline // yogeshsahu.xyz

> the_problem

Forex data is genuinely scattered. Economic calendars live on ForexFactory — behind a web interface designed for human browsing, not programmatic access. Session timings exist in blog posts and spreadsheets. Volatility windows get passed around as rules of thumb.

There was no single source of truth for when to trade and what events to watch. Every decision required tabbing between ForexFactory, a timezone converter, and a manually-updated spreadsheet.

The data problem wasn't lack of data — it was fragmentation. ForexFactory has excellent event data. Session overlap calculations are pure math. Neither was hard in isolation. The gap was a pipeline that pulled it all together into a unified, queryable, always-current view.

> my_approach

Three-layer architecture, each layer with a single responsibility:

Layer 1 // ForexFactory Scraper Scrapes economic events with impact level (high/medium/low), currency affected, actual vs forecast vs previous values, and event time. Handles pagination across weekly views. Respectful rate limiting prevents bans. Session handling maintains cookies across requests.
Layer 2 // Normalization Pipeline Raw scraped data lands in a staging table. The pipeline normalizes: timezone conversion to UTC, impact level classification, currency pair tagging, null handling for unreleased actual values. Output is a clean events table ready for querying.
Layer 3 // Session Dashboard Computes live session status (open/closed), overlap windows (London/NY is the key one — 70% of daily volume), and surfaces upcoming high-impact events in the next N hours. Built to be queried, not just displayed.

> architecture

┌────────────────────────────────────────────────┐
│               FOREXFACTORY.COM                  │
│  Weekly calendar pages (impact, currency,       │
│  actual, forecast, previous, event time)        │
└──────────────────────┬─────────────────────────┘
                       │  HTTP + session cookies
                       ▼
┌────────────────────────────────────────────────┐
│              SCRAPER LAYER                      │
│  BeautifulSoup HTML parser                      │
│  Rate-limited requests (respectful crawling)    │
│  Session handling + weekly pagination           │
└──────────────────────┬─────────────────────────┘
                       │  raw event rows
                       ▼
┌────────────────────────────────────────────────┐
│           NORMALIZATION PIPELINE                │
│  UTC timezone conversion                        │
│  Impact level classification (H/M/L)            │
│  Null handling for unreleased actuals           │
│  Currency pair tagging                          │
└──────────┬───────────────────────┬─────────────┘
           │                       │
           ▼                       ▼
┌──────────────────┐    ┌──────────────────────┐
│   PostgreSQL /   │    │   SESSION ENGINE      │
│   Storage Layer  │    │  Sydney / Tokyo /     │
│   (events table) │    │  London / New York    │
└──────────┬───────┘    │  overlap detection   │
           │            │  DST-aware UTC math  │
           └──────┬─────┘                      │
                  ▼                            │
┌─────────────────────────────────────────────┘
│           SESSION DASHBOARD
│  Live session status (open / closed)
│  Overlap windows highlighted
│  Upcoming high-impact events (next N hrs)
│  Volatility window alerts
└─────────────────────────────────────────────

> hard_challenges

01 // ForexFactory anti-scraping ForexFactory doesn't publish an API. The site uses JavaScript-rendered tables in some views and standard HTML in others — inconsistent across weeks. The solution was: always request the HTML calendar view directly (not the AJAX endpoint), maintain persistent sessions with realistic headers, and apply per-request delays with jitter. Aggressive crawling triggers 429s; respectful crawling doesn't. This has held stable across multiple months of operation.

02 // Timezone hell — DST, UTC offsets, forex session times Forex session open/close times are defined in local time (London opens at 08:00 London time), but DST shifts that UTC offset twice a year. North American DST and European DST don't switch on the same day, creating a brief annual window where the London/NY overlap is miscalculated if you use fixed offsets. The fix: store everything in UTC. All session boundaries are computed dynamically using pytz/zoneinfo aware datetimes — never hardcoded UTC offsets.

03 // Reliable impact level classification from scraped HTML ForexFactory encodes impact level as a CSS class on a bull icon element — not as text. The classes are: "high", "medium", "low", and "holiday". These CSS class names have been stable, but the surrounding DOM structure has changed at least twice. The scraper targets the icon element specifically and maps class names to enum values, with a fallback to "unknown" and an alert when a new class appears — making future DOM changes detectable immediately.

> results

4 FX Sessions Tracked Sydney, Tokyo, London, New York — live open/close status

3 Impact Levels High / Medium / Low — classified from ForexFactory HTML

Auto Economic Calendar Weekly ForexFactory scrape — no manual updates required

Live Session Overlap Alerts London/NY overlap surfaced — highest volume window

> lessons_learned

Timezone handling in finance is genuinely hard. The correct approach is non-negotiable: store everything in UTC, convert to local time only at the display layer. Any shortcut — fixed UTC offsets, local time storage, "it works most of the year" logic — will fail during DST transitions at exactly the moment you're watching a high-impact news release.

London/New York session overlap (13:00–17:00 UTC) accounts for roughly 70% of daily FX volume. This window deserves its own UI emphasis — it's not just another overlap, it's the trading session.

Scraping doesn't have to be adversarial. Respectful rate limiting, real session handling, and targeting stable HTML structures (not JavaScript-rendered state) produces scrapers that run for months without breaking. The investment in resilience upfront beats debugging silent failures later.

The dashboard's value isn't in any individual data point — it's in having all three layers (events, sessions, overlaps) queryable in one place. The integration is the product.

FOREX DATA PIPELINE

> the_problem

> my_approach

> architecture

> hard_challenges

> results

> lessons_learned