Crypto markets are narrative-driven. Price moves often follow news by minutes, not hours. We had excellent technical data (CryptoPrism-DB) and sentiment indicators — but no structured news feed to explain why signals fired or to validate them with real-world events.
The obvious solution — use a third-party news API directly — hits two problems: (1) they're expensive at production volume, and (2) they don't let you build custom ML on top. We needed our own pipeline: fetch, normalise, deduplicate, and expose to the ML layer.
Two API sources chosen for complementary coverage: CoinDesk (editorial, authoritative) and CryptoCompare (breadth, global aggregation). The pipeline is a clean four-stage process:
- API Connector — HTTP client with retry logic, rate limiting, and pagination handling for both sources.
- Feature Extractor — Parses and transforms raw JSON into normalised schema: title, body, source, categories, published_at, tickers mentioned.
- Data Organizer — Deduplication by content hash, category normalisation across 182+ tags, source verification.
- DB Connector — PostgreSQL UPSERT — never creates duplicates, always updates if content changes.
On top of the fetch pipeline sits an ML layer (features → inference → models → nlp) that runs sentiment classification, entity extraction, and topic modelling on stored articles. The 55.8% positive sentiment figure comes from this layer running against the full historical corpus.
CoinDesk and CryptoCompare often publish the same story (one originates, one aggregates). Deduplication by URL fails — different canonical URLs for the same article. Solution: content hash on normalised title + first 200 chars of body. Catches ~15% duplicates that URL matching would miss.
Both APIs have their own taxonomy. CoinDesk has "DeFi", CryptoCompare has "decentralized-finance" and "defi". Mapping 182 normalised categories across two inconsistent source schemas required building a translation table and fuzzy matching for edge cases.
The sentiment model trained on 2023 data performed poorly on 2025 articles — crypto vocabulary evolves fast ("restaking", "AI agents", "RWA" had no representation). Solution: quarterly retraining pipeline with the latest 90 days of articles as ground truth. The 55.8% positive figure reflects the retrained model.
Running full NLP inference on 500 articles per hour synchronously would block the fetch pipeline. Solution: decoupled fetch and inference — fetch writes raw articles immediately, inference runs as a separate job on a 15-minute delay, scoring batches of unprocessed articles.
Content hash deduplication is far more robust than URL deduplication for news aggregation. This seems obvious in retrospect but we tried URL-first and had to migrate.
Decoupling fetch from ML inference was the right architecture from day one — but we didn't do it initially. Running inference synchronously in the fetch pipeline caused timeouts during high-volume hours. Splitting them into independent jobs fixed latency and made each independently scalable.
Crypto NLP models need quarterly retraining at minimum. The vocabulary shifts faster than any other domain I've worked in.