← back_to_portfolio
the_problem

Crypto markets are narrative-driven. Price moves often follow news by minutes, not hours. We had excellent technical data (CryptoPrism-DB) and sentiment indicators — but no structured news feed to explain why signals fired or to validate them with real-world events.

The obvious solution — use a third-party news API directly — hits two problems: (1) they're expensive at production volume, and (2) they don't let you build custom ML on top. We needed our own pipeline: fetch, normalise, deduplicate, and expose to the ML layer.

44 verified crypto publishers. 500+ articles every hour. All flowing into a PostgreSQL schema we own and can query however we want.
my_approach

Two API sources chosen for complementary coverage: CoinDesk (editorial, authoritative) and CryptoCompare (breadth, global aggregation). The pipeline is a clean four-stage process:

  • API Connector — HTTP client with retry logic, rate limiting, and pagination handling for both sources.
  • Feature Extractor — Parses and transforms raw JSON into normalised schema: title, body, source, categories, published_at, tickers mentioned.
  • Data Organizer — Deduplication by content hash, category normalisation across 182+ tags, source verification.
  • DB Connector — PostgreSQL UPSERT — never creates duplicates, always updates if content changes.

On top of the fetch pipeline sits an ML layer (features → inference → models → nlp) that runs sentiment classification, entity extraction, and topic modelling on stored articles. The 55.8% positive sentiment figure comes from this layer running against the full historical corpus.

architecture
GitHub Actions (hourly CRON :00) └── trigger fetch workflow Data Sources ├── CoinDesk API (44 verified publishers, editorial) └── CryptoCompare (global aggregation, breadth) │ ▼ NewsFetcher Core (src/news_fetcher/) ├── API Connector → HTTP + retry + rate-limit ├── Feature Extractor → parse, normalise, tag tickers ├── Data Organizer → dedupe by hash, 182+ categories └── DB Connector → PostgreSQL UPSERT │ ▼ PostgreSQL Database ├── news_headlines (fast read for dashboards) ├── articles (full body, metadata) ├── sources (publisher registry, 44+) └── categories (182+ topic tags) │ ▼ ML Pipeline (src/) ├── features/ → TF-IDF, embeddings, ticker mentions ├── nlp/ → sentiment, entity extraction, NER ├── models/ → trained classifiers └── inference/ → batch scoring + live scoring API Export ├── CSV (500+ articles snapshot) └── JSON (structured data for downstream)
hard_parts
Challenge 01
Deduplication across two different APIs

CoinDesk and CryptoCompare often publish the same story (one originates, one aggregates). Deduplication by URL fails — different canonical URLs for the same article. Solution: content hash on normalised title + first 200 chars of body. Catches ~15% duplicates that URL matching would miss.

Challenge 02
Category normalisation at 182+ tags

Both APIs have their own taxonomy. CoinDesk has "DeFi", CryptoCompare has "decentralized-finance" and "defi". Mapping 182 normalised categories across two inconsistent source schemas required building a translation table and fuzzy matching for edge cases.

Challenge 03
Sentiment drift over time

The sentiment model trained on 2023 data performed poorly on 2025 articles — crypto vocabulary evolves fast ("restaking", "AI agents", "RWA" had no representation). Solution: quarterly retraining pipeline with the latest 90 days of articles as ground truth. The 55.8% positive figure reflects the retrained model.

Challenge 04
ML layer latency vs freshness

Running full NLP inference on 500 articles per hour synchronously would block the fetch pipeline. Solution: decoupled fetch and inference — fetch writes raw articles immediately, inference runs as a separate job on a 15-minute delay, scoring batches of unprocessed articles.

results
500+
Articles ingested/hour
44+
Verified publishers
182+
Normalised categories
55.8%
Positive sentiment baseline
lessons_learned

Content hash deduplication is far more robust than URL deduplication for news aggregation. This seems obvious in retrospect but we tried URL-first and had to migrate.

Decoupling fetch from ML inference was the right architecture from day one — but we didn't do it initially. Running inference synchronously in the fetch pipeline caused timeouts during high-volume hours. Splitting them into independent jobs fixed latency and made each independently scalable.

Crypto NLP models need quarterly retraining at minimum. The vocabulary shifts faster than any other domain I've worked in.