Every forex and macro trader knows ForexFactory's economic calendar — it's the industry standard for tracking high-impact events (NFP, CPI, FOMC, etc). But it has no official API. The data exists only as HTML rendered by their website, with actual vs forecast vs previous values updating in real-time as events release.
We needed this data in PostgreSQL — structured, clean, timestamped, and historically preserved — to feed into the Forex Data Pipeline and correlate economic events with price action. Buying a third-party calendar API would cost $200–500/mo and lock us into their schema.
Built a semantic HTML parser specifically for ForexFactory's calendar structure. ForexFactory renders events in a consistent table format across day/week/month views — the semantic parser extracts events as structured objects regardless of which view is scraped.
- Three scraping modes: day (realtime, lightweight), week (daily sync, broader coverage), month (backfill, historical reconstruction).
- UPSERT, not INSERT: Every event has a unique event_uid. If an event already exists, only update if actual/forecast/previous values have changed — this handles ForexFactory's live updates as events release.
- Timezone-first design: ForexFactory displays in US/Eastern. All times are converted to UTC on ingest. A timezone field is stored alongside UTC timestamp for auditability.
- Full audit trail: sync_log records every scrape operation — what was fetched, what was upserted, what was skipped, duration, and any errors. Debugging a missed event means querying sync_log, not trawling server logs.
ForexFactory blocks aggressive scrapers. Solution: respectful rate limiting (1 request per 15 minutes in realtime mode), proper User-Agent headers, and session reuse. We've never been blocked — the key is mimicking a human browsing pattern, not hammering the endpoint.
ForexFactory times are US/Eastern — which means UTC-5 in winter and UTC-4 during DST. Storing as-is and converting later is a mistake: you lose information about whether DST was active. Solution: convert to UTC at parse time using pytz with the original Eastern timestamp preserved in a separate column. This makes historical comparison unambiguous.
ForexFactory updates actual values as events release throughout the day. A naive scraper would miss these updates since the event row already exists. The UPSERT logic checks if actual/forecast/previous differ from stored values — if they do, it updates and logs the change in sync_log. This means you always have the latest actual values without duplicates.
ForexFactory's day, week, and month HTML templates are structurally different — different table layouts, different date rendering, different row structures for multi-day events. Built a semantic parser that understands each view's structure and normalises into the same event object. This is the core intellectual work of the project — 80% of the complexity lives here.
Building the semantic parser to understand ForexFactory's HTML structure deeply — rather than CSS selector hacking — is what makes v2.2 resilient. When ForexFactory tweaks their template, the semantic parser adapts with minimal changes. CSS selectors break silently.
The sync_log audit table is the most valuable debugging tool in the whole system. When a trader asks "why is the NFP event showing the wrong actual value?", the answer is always in sync_log within 30 seconds.
UPSERT over INSERT is the correct default for any scraping pipeline that touches data that can change after initial publish. Write the UPSERT logic once and never think about data freshness again.