Python / Binance API / Sentiment Analysis

CryptoPrism News Fetcher

News sentiment analysis pipeline feeding automated spot trading decisions via Binance API.

Live

trading signals

Live AppView SourceDeep Dive
CryptoPrism News Fetcher

Deep Dives

Leadership Lens

01 The Call

Chose to build a fully automated end-to-end ML signal pipeline — from news ingestion through FinBERT NLP, feature engineering, and LightGBM inference — rather than buying a third-party signal feed or limiting scope to price-based indicators alone.

02 The Bet

Bet that separating the training database (cp_backtest, years of history) from the inference database (dbcp, live snapshots) was the correct architecture, and that fixing this dual-DB wiring would unlock the majority of predictive signal — committing to the refactor before any performance evidence.

03 The Trade-off

Accepted a slower, heavier pipeline (FinBERT NLP is GPU-bound, BTC residual decomposition runs on hourly OHLCV for 250 coins) in exchange for a model that can see independent alpha — coins that move for their own reasons, not just because Bitcoin moved.

04 The Outcome

IC climbed from -0.007 (broken, worse than random) to +0.129 (above the 0.10 target), Sharpe 7.69, 60.7% accuracy on unseen data, max drawdown -3.0% — with 10 GitHub Actions workflows producing daily signals for 1,000 coins without human intervention.

05 Coordinated

Sole engineer-of-record across the full stack: news API integrations (44 sources), FinBERT sentiment pipeline, 10 FE tables, dual-PostgreSQL architecture, 6-model ensemble, and all GitHub Actions scheduling.

06 Where this goes next

Wire ML_SIGNALS into the CryptoPrism API (FastAPI /ml endpoints), expose top-ranked coin predictions to the Saarthi AI advisor, and extend the hourly ensemble to cover 500 coins from the current 250.

01 Chapter 1

The Problem

Crypto markets move on news and price patterns. With 1,000 coins trading 24/7, manual analysis is impossible. A bullish CoinDesk headline about Solana, a volume spike on Arbitrum, a regulatory mention of XRP — these signals are scattered across dozens of sources, mixed with noise, and stale within hours.

We needed a system that could: (1) automatically read hundreds of articles per day, (2) extract sentiment and categorize by coin, (3) combine news signals with price-based technical features, and (4) produce a single ranked prediction for every coin, every day.

The Scale Challenge

1,000 coins × 50 features × 365 days = 18.25 million data points per year. No human team can process this. The system must be fully automated, running without intervention, and producing predictions before markets open each day.

Coins Tracked

1,000

ranked daily

News Sources

44

CoinDesk, CryptoCompare, etc.

Categories

182+

topic classifications

Throughput

500+

articles per hour

02 Chapter 2

Data Architecture: Three Databases

The system spans three PostgreSQL databases on the same server, each serving a distinct purpose. This separation was not just organizational — it was the key architectural insight that unlocked model performance.

Database Roles

PRODUCTION — dbcp: Today's signals, news sentiment, Fear & Greed index, and the model's latest predictions. Trading bots read from here. Only holds recent snapshots — the live dashboard.

HISTORICAL — cp_backtest: Years of daily FE tables — millions of rows. The training ground. All 10 feature engineering tables live here with full history for backtesting and model training.

HOURLY — cp_backtest_h: Hourly OHLCV for 250 coins with 30-day rolling windows. Powers the BTC residual decomposition and hourly neural network models that need fine-grained price data.

Articles Stored

66K+

since Oct 2025

Historical Rows

Millions

across 10 FE tables

Hourly Coins

250

30-day windows

Features Per Coin

50

signals per day

Why Three Databases?

The live database (dbcp) only has snapshots — it knows what happened today, not what happened two years ago. The model needs history to learn, so it reads from cp_backtest during training. When we discovered the model was only reading from dbcp (the "blind training" bug), all the historical features were empty. Connecting it to cp_backtest was the single biggest improvement in the entire project.

03 Chapter 3

Feature Engineering: 10 FE Tables

Feature Engineering is the process of turning raw data into useful signals the model can learn from. Each FE table holds a different family of signals. Together they give the model a 360-degree view of every coin.

TableWhat It MeasuresPlain English
FE_PCT_CHANGEDaily returns, cumulative return, volatility, riskHow much did this coin move today? How risky is it?
FE_MOMENTUM_SIGNALSRate of change, Williams %R, CMO, SMIIs this coin on a hot streak, or losing steam?
FE_OSCILLATORS_SIGNALSMACD, CCI, ADX, Ultimate Oscillator, TrixIs the coin overbought or oversold? Reversal coming?
FE_TVV_SIGNALSOn-Balance Volume, SMA/EMA crossovers, CMFIs money flowing into or out of this coin?
FE_RATIOS_SIGNALSAlpha, Beta, Sharpe, Sortino, Win RateGood returns for the risk you take?
FE_FEAR_GREED_CMCCoinMarketCap Fear & Greed IndexIs the market feeling greedy or scared today?
FE_NEWS_SIGNALSSentiment scores, article volume, event flagsWhat is the news saying? Is coverage spiking?
FE_BTC_RESIDUALSBeta, alpha, residual after stripping BTCHow much of this coin's move was just following BTC?
FE_RESIDUAL_FEATURESMomentum, z-score, vol regime, autocorrelationAfter removing BTC, is the coin's own move trending?
FE_CROSS_COINPercentile ranks, market breadth, dispersion, HHIHow is this coin doing vs. every other coin today?

The Key Insight

No single table is very predictive on its own. The model's power comes from combining all of them — 50 signals per coin per day. A coin might look great on momentum but terrible on risk-adjusted ratios. The model weighs all trade-offs simultaneously across 1,000 coins every day.

04 Chapter 4

The Journey: Four Phases

The project started as a simple news collector and evolved into a full prediction engine over seven months. Here is the story told in four phases.

Phase 1 — Collecting the News (Sep 2025)

Built a program connecting to CoinDesk and CryptoCompare that downloads every article published. Runs every hour on GitHub Actions, storing title, body, source, and category in PostgreSQL. By February 2026: 66,000+ articles from 44 sources across 182 categories.

Phase 2 — Teaching It to Read (Feb 2026)

Added FinBERT — a language model trained on financial text. It reads each article and scores it as positive, negative, or neutral. Scores are grouped by coin and averaged into daily "news signals": Is the news about Bitcoin bullish today? Is article volume spiking for Ethereum? Are there regulatory stories about Solana?

Phase 3 — The Big Fix: Dual-DB Bug (Apr 8)

Discovered the model had been "blind" — a database wiring bug meant it could only see 1 out of 54 available signals during training. The fix: connected training pipeline to cp_backtest (years of history) while keeping real-time news on dbcp. This dual-database approach unlocked all 34 price features overnight. IC jumped from -0.007 to +0.081.

Phase 4 — Making It Smarter (Apr 9–11)

Added BTC residual decomposition (stripping Bitcoin's influence), cross-coin percentile ranks, market breadth and dispersion metrics. Went from 34 features to 50. Model hit IC +0.129 and Sharpe 7.69 — exceeding our 0.10 IC target.

05 Chapter 5

Key Innovation: BTC Residual Decomposition

In crypto, when Bitcoin goes up 5%, most altcoins go up too. If you are trying to predict which coins will outperform, you first need to strip out this "following Bitcoin" effect. Otherwise the model just learns "buy everything when BTC is up" — which is not useful.

Decomposition Pipeline

Hourly OHLCV
30-Day Rolling Regression
Beta + Alpha
Residual (Independent Alpha)

How It Works

For each coin, we run a 30-day rolling regression against Bitcoin's returns. This gives us a beta (how much the coin follows BTC) and a residual (whatever is left — the coin's own independent movement).

Example

Ethereum went up 8% and Bitcoin went up 5%. If ETH's beta is 1.2, we'd expect a 6% move (1.2 × 5%) just from following BTC. The residual is the extra 2% — that's Ethereum's own alpha. Our model predicts these residuals, not raw price moves.

8 Second-Order Features From Residuals

The FE_RESIDUAL_FEATURES table then extracts patterns from these stripped returns: residual momentum, z-score (mean reversion signal), volatility regime, autocorrelation, cumulative residual drift, residual acceleration, vol-of-vol, and trend strength. These become 8 additional features that capture the coin's independent behavior after removing the Bitcoin tide.

06 Chapter 6

Results: Model Performance

We measure prediction quality with IC (Information Coefficient) — how well the model's rankings correlate with actual outcomes (0 = random, 0.05 = useful in tradfi, 0.10+ = strong for daily crypto) — and Sharpe Ratio (risk-adjusted return; above 2 is very good, above 5 is exceptional).

Model VersionWhenFeaturesIC-3dSharpeAssessment
Original (broken)pre-Apr 81-0.007-2.16Worse than random — the "blind" bug
After dual-DB fixApr 834+0.081+6.18Working! First real signal
+ Ensemble (6 models)Apr 953+0.086Slight gain, many features cold-starting
+ Residual + Cross-coinApr 1150+0.129+7.69Target hit. Stable generalization.

Prediction IC

0.129

Target was 0.10

Sharpe Ratio

7.69

risk-adjusted return

Accuracy

60.7%

on unseen data

Max Drawdown

-3.0%

worst loss in test period

What This Means

If you ranked 1,000 coins every day using this model and bought the top-ranked ones while shorting the bottom-ranked ones, you would have earned positive risk-adjusted returns on unseen data with a maximum dip of only 3%. The signal is real, stable across different time windows, and above the target we set.

07 Chapter 7

Pipeline Automation

The entire system runs on autopilot via GitHub Actions — cloud servers that execute code on a schedule. Ten workflows, zero manual intervention.

ScheduleWorkflowWhat It Does
Every hourNews FetchDownloads 30–60 new articles from CryptoCompare, stores in database
00:00 UTCFE Tables UpdateRefreshes all 10 FE_* tables with today's prices and indicators
00:30 UTCNLP PipelineFinBERT reads today's articles, scores sentiment, creates FE_NEWS_SIGNALS
01:00 UTCML Signals (daily)Refreshes features, runs model, ranks all 1,000 coins → ML_SIGNALS
Every 4 hoursEnsemble (6 models)6-component model generates granular signals → ML_SIGNALS_V2
Sundays 02:00Weekly RetrainRegenerates labels, refreshes views, retrains model with latest data

Daily Pipeline Flow

News Fetch (hourly)
FinBERT NLP
FE Tables
LightGBM
ML_SIGNALS

08 Chapter 8

Key Lessons

Lesson 1 — Data Plumbing > Model Complexity

The single biggest improvement came from fixing a database routing bug — not from using fancier AI. A simple model with good data beats a complex model with broken data every time. IC jumped from -0.007 to +0.081 overnight.

Lesson 2 — Context Is Everything

A coin going up 5% tells you nothing until you know what the rest of the market did. Cross-coin features (ranks, breadth, dispersion) turned an unstable model into a generalizing one. Collapsed the gap between validation and test performance.

Lesson 3 — Strip the Market, Find the Alpha

Most crypto movement is just "following Bitcoin." By removing that with BTC residual decomposition, we let the model focus on each coin's unique story — which is where the actual prediction signal lives.

Lesson 4 — Automate Everything

Ten GitHub Actions workflows run the entire pipeline without human intervention. News fetching, sentiment scoring, feature generation, model retraining, and daily predictions all happen automatically, every day, on schedule.

09 Chapter 9

Tech Stack

PythonFinBERT (NLP)LightGBMPostgreSQL (3 DBs)GitHub ActionsCoinDesk APICryptoCompare APIpandasscikit-learnpsycopg2statsmodelsNumPy

Data Layer — 3 PostgreSQL Databases

dbcp (production), cp_backtest (historical), cp_backtest_h (hourly). Dual-DB architecture separates inference from training.

ML Layer — LightGBM + Ensemble

Gradient boosting for daily signals. 6-model ensemble (LightGBM, XGBoost, Ridge, LSTM, TCN, market regime) for hourly predictions.

NLP Layer — FinBERT Sentiment

Financial domain-specific BERT model. Scores each article as positive/negative/neutral. Aggregated by coin into daily news signals.