← back_to_portfolio
the_problem

Every forex and macro trader knows ForexFactory's economic calendar — it's the industry standard for tracking high-impact events (NFP, CPI, FOMC, etc). But it has no official API. The data exists only as HTML rendered by their website, with actual vs forecast vs previous values updating in real-time as events release.

We needed this data in PostgreSQL — structured, clean, timestamped, and historically preserved — to feed into the Forex Data Pipeline and correlate economic events with price action. Buying a third-party calendar API would cost $200–500/mo and lock us into their schema.

ForexFactory is the most important economic data source for retail forex traders. The fact that it has no API is both the problem and the opportunity.
my_approach

Built a semantic HTML parser specifically for ForexFactory's calendar structure. ForexFactory renders events in a consistent table format across day/week/month views — the semantic parser extracts events as structured objects regardless of which view is scraped.

  • Three scraping modes: day (realtime, lightweight), week (daily sync, broader coverage), month (backfill, historical reconstruction).
  • UPSERT, not INSERT: Every event has a unique event_uid. If an event already exists, only update if actual/forecast/previous values have changed — this handles ForexFactory's live updates as events release.
  • Timezone-first design: ForexFactory displays in US/Eastern. All times are converted to UTC on ingest. A timezone field is stored alongside UTC timestamp for auditability.
  • Full audit trail: sync_log records every scrape operation — what was fetched, what was upserted, what was skipped, duration, and any errors. Debugging a missed event means querying sync_log, not trawling server logs.
architecture
GitHub Actions — 3 Automated Workflows ├── forexfactory-realtime-15min.yml → every 15 min ├── forexfactory-daily-sync.yml → 02:00 UTC daily └── forexfactory-monthly-backfill.yml → manual trigger ↓ scraper_2.2/ ├── src/scraper.py → semantic HTML parser (day/week/month) ├── src/database.py → UPSERT logic + connection pooling └── src/config.py → env vars, impact levels, currency map ↓ Event Normalization ├── Parse: event name, currency, impact (High/Med/Low/Holiday) ├── Values: actual, forecast, previous (handle "—" and null) ├── Timezone: US/Eastern → UTC conversion + DST handling └── Dedup: event_uid = hash(date + time + currency + event_name) ↓ PostgreSQL ├── Economic_Calendar_FF │ currency, event, impact, actual, forecast, previous, │ event_time_utc, event_uid (UNIQUE), created_at, updated_at └── sync_log run_id, mode, rows_fetched, rows_upserted, rows_skipped, duration_ms, error, run_at ↓ CSV Export (parallel output alongside DB)
hard_parts
Challenge 01
ForexFactory's anti-scraping measures

ForexFactory blocks aggressive scrapers. Solution: respectful rate limiting (1 request per 15 minutes in realtime mode), proper User-Agent headers, and session reuse. We've never been blocked — the key is mimicking a human browsing pattern, not hammering the endpoint.

Challenge 02
Timezone and DST hell

ForexFactory times are US/Eastern — which means UTC-5 in winter and UTC-4 during DST. Storing as-is and converting later is a mistake: you lose information about whether DST was active. Solution: convert to UTC at parse time using pytz with the original Eastern timestamp preserved in a separate column. This makes historical comparison unambiguous.

Challenge 03
Live event updates after initial scrape

ForexFactory updates actual values as events release throughout the day. A naive scraper would miss these updates since the event row already exists. The UPSERT logic checks if actual/forecast/previous differ from stored values — if they do, it updates and logs the change in sync_log. This means you always have the latest actual values without duplicates.

Challenge 04
Parsing across 3 view modes

ForexFactory's day, week, and month HTML templates are structurally different — different table layouts, different date rendering, different row structures for multi-day events. Built a semantic parser that understands each view's structure and normalises into the same event object. This is the core intellectual work of the project — 80% of the complexity lives here.

results
15 min
Realtime update cadence
0
Duplicate events (UPSERT)
100%
UTC-normalised timestamps
Full
Audit trail in sync_log
lessons_learned

Building the semantic parser to understand ForexFactory's HTML structure deeply — rather than CSS selector hacking — is what makes v2.2 resilient. When ForexFactory tweaks their template, the semantic parser adapts with minimal changes. CSS selectors break silently.

The sync_log audit table is the most valuable debugging tool in the whole system. When a trader asks "why is the NFP event showing the wrong actual value?", the answer is always in sync_log within 30 seconds.

UPSERT over INSERT is the correct default for any scraping pipeline that touches data that can change after initial publish. Write the UPSERT logic once and never think about data freshness again.

The three-workflow architecture (15-min / daily / monthly) maps directly to the three use cases: live trading alerts, daily analysis, and historical research. One scraper, three deployment modes.