Python / Binance API / Sentiment Analysis
CryptoPrism News Fetcher
News sentiment analysis pipeline feeding automated spot trading decisions via Binance API.

Leadership Lens
01 The Call
Chose to build a fully automated end-to-end ML signal pipeline — from news ingestion through FinBERT NLP, feature engineering, and LightGBM inference — rather than buying a third-party signal feed or limiting scope to price-based indicators alone.
02 The Bet
Bet that separating the training database (cp_backtest, years of history) from the inference database (dbcp, live snapshots) was the correct architecture, and that fixing this dual-DB wiring would unlock the majority of predictive signal — committing to the refactor before any performance evidence.
03 The Trade-off
Accepted a slower, heavier pipeline (FinBERT NLP is GPU-bound, BTC residual decomposition runs on hourly OHLCV for 250 coins) in exchange for a model that can see independent alpha — coins that move for their own reasons, not just because Bitcoin moved.
04 The Outcome
IC climbed from -0.007 (broken, worse than random) to +0.129 (above the 0.10 target), Sharpe 7.69, 60.7% accuracy on unseen data, max drawdown -3.0% — with 10 GitHub Actions workflows producing daily signals for 1,000 coins without human intervention.
05 Coordinated
Sole engineer-of-record across the full stack: news API integrations (44 sources), FinBERT sentiment pipeline, 10 FE tables, dual-PostgreSQL architecture, 6-model ensemble, and all GitHub Actions scheduling.
06 Where this goes next
Wire ML_SIGNALS into the CryptoPrism API (FastAPI /ml endpoints), expose top-ranked coin predictions to the Saarthi AI advisor, and extend the hourly ensemble to cover 500 coins from the current 250.
01 Chapter 1
The Problem
Crypto markets move on news and price patterns. With 1,000 coins trading 24/7, manual analysis is impossible. A bullish CoinDesk headline about Solana, a volume spike on Arbitrum, a regulatory mention of XRP — these signals are scattered across dozens of sources, mixed with noise, and stale within hours.
We needed a system that could: (1) automatically read hundreds of articles per day, (2) extract sentiment and categorize by coin, (3) combine news signals with price-based technical features, and (4) produce a single ranked prediction for every coin, every day.
The Scale Challenge
1,000 coins × 50 features × 365 days = 18.25 million data points per year. No human team can process this. The system must be fully automated, running without intervention, and producing predictions before markets open each day.
Coins Tracked
1,000
ranked daily
News Sources
44
CoinDesk, CryptoCompare, etc.
Categories
182+
topic classifications
Throughput
500+
articles per hour
02 Chapter 2
Data Architecture: Three Databases
The system spans three PostgreSQL databases on the same server, each serving a distinct purpose. This separation was not just organizational — it was the key architectural insight that unlocked model performance.
Database Roles
PRODUCTION — dbcp: Today's signals, news sentiment, Fear & Greed index, and the model's latest predictions. Trading bots read from here. Only holds recent snapshots — the live dashboard.
HISTORICAL — cp_backtest: Years of daily FE tables — millions of rows. The training ground. All 10 feature engineering tables live here with full history for backtesting and model training.
HOURLY — cp_backtest_h: Hourly OHLCV for 250 coins with 30-day rolling windows. Powers the BTC residual decomposition and hourly neural network models that need fine-grained price data.
Articles Stored
66K+
since Oct 2025
Historical Rows
Millions
across 10 FE tables
Hourly Coins
250
30-day windows
Features Per Coin
50
signals per day
Why Three Databases?
The live database (dbcp) only has snapshots — it knows what happened today, not what happened two years ago. The model needs history to learn, so it reads from cp_backtest during training. When we discovered the model was only reading from dbcp (the "blind training" bug), all the historical features were empty. Connecting it to cp_backtest was the single biggest improvement in the entire project.
03 Chapter 3
Feature Engineering: 10 FE Tables
Feature Engineering is the process of turning raw data into useful signals the model can learn from. Each FE table holds a different family of signals. Together they give the model a 360-degree view of every coin.
| Table | What It Measures | Plain English |
|---|---|---|
| FE_PCT_CHANGE | Daily returns, cumulative return, volatility, risk | How much did this coin move today? How risky is it? |
| FE_MOMENTUM_SIGNALS | Rate of change, Williams %R, CMO, SMI | Is this coin on a hot streak, or losing steam? |
| FE_OSCILLATORS_SIGNALS | MACD, CCI, ADX, Ultimate Oscillator, Trix | Is the coin overbought or oversold? Reversal coming? |
| FE_TVV_SIGNALS | On-Balance Volume, SMA/EMA crossovers, CMF | Is money flowing into or out of this coin? |
| FE_RATIOS_SIGNALS | Alpha, Beta, Sharpe, Sortino, Win Rate | Good returns for the risk you take? |
| FE_FEAR_GREED_CMC | CoinMarketCap Fear & Greed Index | Is the market feeling greedy or scared today? |
| FE_NEWS_SIGNALS | Sentiment scores, article volume, event flags | What is the news saying? Is coverage spiking? |
| FE_BTC_RESIDUALS | Beta, alpha, residual after stripping BTC | How much of this coin's move was just following BTC? |
| FE_RESIDUAL_FEATURES | Momentum, z-score, vol regime, autocorrelation | After removing BTC, is the coin's own move trending? |
| FE_CROSS_COIN | Percentile ranks, market breadth, dispersion, HHI | How is this coin doing vs. every other coin today? |
The Key Insight
No single table is very predictive on its own. The model's power comes from combining all of them — 50 signals per coin per day. A coin might look great on momentum but terrible on risk-adjusted ratios. The model weighs all trade-offs simultaneously across 1,000 coins every day.
04 Chapter 4
The Journey: Four Phases
The project started as a simple news collector and evolved into a full prediction engine over seven months. Here is the story told in four phases.
Phase 1 — Collecting the News (Sep 2025)
Built a program connecting to CoinDesk and CryptoCompare that downloads every article published. Runs every hour on GitHub Actions, storing title, body, source, and category in PostgreSQL. By February 2026: 66,000+ articles from 44 sources across 182 categories.
Phase 2 — Teaching It to Read (Feb 2026)
Added FinBERT — a language model trained on financial text. It reads each article and scores it as positive, negative, or neutral. Scores are grouped by coin and averaged into daily "news signals": Is the news about Bitcoin bullish today? Is article volume spiking for Ethereum? Are there regulatory stories about Solana?
Phase 3 — The Big Fix: Dual-DB Bug (Apr 8)
Discovered the model had been "blind" — a database wiring bug meant it could only see 1 out of 54 available signals during training. The fix: connected training pipeline to cp_backtest (years of history) while keeping real-time news on dbcp. This dual-database approach unlocked all 34 price features overnight. IC jumped from -0.007 to +0.081.
Phase 4 — Making It Smarter (Apr 9–11)
Added BTC residual decomposition (stripping Bitcoin's influence), cross-coin percentile ranks, market breadth and dispersion metrics. Went from 34 features to 50. Model hit IC +0.129 and Sharpe 7.69 — exceeding our 0.10 IC target.
05 Chapter 5
Key Innovation: BTC Residual Decomposition
In crypto, when Bitcoin goes up 5%, most altcoins go up too. If you are trying to predict which coins will outperform, you first need to strip out this "following Bitcoin" effect. Otherwise the model just learns "buy everything when BTC is up" — which is not useful.
Decomposition Pipeline
How It Works
For each coin, we run a 30-day rolling regression against Bitcoin's returns. This gives us a beta (how much the coin follows BTC) and a residual (whatever is left — the coin's own independent movement).
Example
Ethereum went up 8% and Bitcoin went up 5%. If ETH's beta is 1.2, we'd expect a 6% move (1.2 × 5%) just from following BTC. The residual is the extra 2% — that's Ethereum's own alpha. Our model predicts these residuals, not raw price moves.
8 Second-Order Features From Residuals
The FE_RESIDUAL_FEATURES table then extracts patterns from these stripped returns: residual momentum, z-score (mean reversion signal), volatility regime, autocorrelation, cumulative residual drift, residual acceleration, vol-of-vol, and trend strength. These become 8 additional features that capture the coin's independent behavior after removing the Bitcoin tide.
06 Chapter 6
Results: Model Performance
We measure prediction quality with IC (Information Coefficient) — how well the model's rankings correlate with actual outcomes (0 = random, 0.05 = useful in tradfi, 0.10+ = strong for daily crypto) — and Sharpe Ratio (risk-adjusted return; above 2 is very good, above 5 is exceptional).
| Model Version | When | Features | IC-3d | Sharpe | Assessment |
|---|---|---|---|---|---|
| Original (broken) | pre-Apr 8 | 1 | -0.007 | -2.16 | Worse than random — the "blind" bug |
| After dual-DB fix | Apr 8 | 34 | +0.081 | +6.18 | Working! First real signal |
| + Ensemble (6 models) | Apr 9 | 53 | +0.086 | — | Slight gain, many features cold-starting |
| + Residual + Cross-coin | Apr 11 | 50 | +0.129 | +7.69 | Target hit. Stable generalization. |
Prediction IC
0.129
Target was 0.10
Sharpe Ratio
7.69
risk-adjusted return
Accuracy
60.7%
on unseen data
Max Drawdown
-3.0%
worst loss in test period
What This Means
If you ranked 1,000 coins every day using this model and bought the top-ranked ones while shorting the bottom-ranked ones, you would have earned positive risk-adjusted returns on unseen data with a maximum dip of only 3%. The signal is real, stable across different time windows, and above the target we set.
07 Chapter 7
Pipeline Automation
The entire system runs on autopilot via GitHub Actions — cloud servers that execute code on a schedule. Ten workflows, zero manual intervention.
| Schedule | Workflow | What It Does |
|---|---|---|
| Every hour | News Fetch | Downloads 30–60 new articles from CryptoCompare, stores in database |
| 00:00 UTC | FE Tables Update | Refreshes all 10 FE_* tables with today's prices and indicators |
| 00:30 UTC | NLP Pipeline | FinBERT reads today's articles, scores sentiment, creates FE_NEWS_SIGNALS |
| 01:00 UTC | ML Signals (daily) | Refreshes features, runs model, ranks all 1,000 coins → ML_SIGNALS |
| Every 4 hours | Ensemble (6 models) | 6-component model generates granular signals → ML_SIGNALS_V2 |
| Sundays 02:00 | Weekly Retrain | Regenerates labels, refreshes views, retrains model with latest data |
Daily Pipeline Flow
08 Chapter 8
Key Lessons
Lesson 1 — Data Plumbing > Model Complexity
The single biggest improvement came from fixing a database routing bug — not from using fancier AI. A simple model with good data beats a complex model with broken data every time. IC jumped from -0.007 to +0.081 overnight.
Lesson 2 — Context Is Everything
A coin going up 5% tells you nothing until you know what the rest of the market did. Cross-coin features (ranks, breadth, dispersion) turned an unstable model into a generalizing one. Collapsed the gap between validation and test performance.
Lesson 3 — Strip the Market, Find the Alpha
Most crypto movement is just "following Bitcoin." By removing that with BTC residual decomposition, we let the model focus on each coin's unique story — which is where the actual prediction signal lives.
Lesson 4 — Automate Everything
Ten GitHub Actions workflows run the entire pipeline without human intervention. News fetching, sentiment scoring, feature generation, model retraining, and daily predictions all happen automatically, every day, on schedule.
09 Chapter 9
Tech Stack
Data Layer — 3 PostgreSQL Databases
dbcp (production), cp_backtest (historical), cp_backtest_h (hourly). Dual-DB architecture separates inference from training.
ML Layer — LightGBM + Ensemble
Gradient boosting for daily signals. 6-model ensemble (LightGBM, XGBoost, Ridge, LSTM, TCN, market regime) for hourly predictions.
NLP Layer — FinBERT Sentiment
Financial domain-specific BERT model. Scores each article as positive/negative/neutral. Aggregated by coin into daily news signals.