CryptoPrism-DB // yogeshsahu.xyz

The Problem

When I started building crypto trading strategies, I hit the same wall everyone hits: getting clean, pre-computed technical indicator data at scale is either prohibitively expensive or completely unreliable. Bloomberg Terminal charges thousands per month. Glassnode and Santiment lock the good data behind $500+/mo plans. Free APIs give you raw OHLCV with no indicators, no QA, and no guarantees.

What I actually needed was a production-quality data store I could query in seconds — with momentum indicators, financial ratios, volume analysis, and fundamental metrics all pre-computed for every coin, updated daily, free, and under my own control.

The insight: the compute for these indicators is cheap. The bottleneck is ingestion reliability and duplicate-free storage. If I could nail those two things, I'd have a private Bloomberg terminal for the cost of a VPS.

My Approach

I designed around separation of concerns from day one. Rather than a single monolithic database that grows chaotic over time, I split into four purpose-specific databases: dbcp for live production data, cp_ai for AI-processed analysis outputs, cp_backtest for weekly backtesting runs, and cp_backtest_h for historical backtest snapshots.

For ingestion, I made an unconventional choice: use R instead of Python for the OHLCV layer. The crypto2 R package handles 1,000+ coin requests with robust pagination and rate-limit handling that would have taken weeks to replicate in Python. Python then takes over for all indicator computation — vectorized Pandas/NumPy operations that process 100+ indicators per coin in a single DataFrame pass.

For QA, I didn't want to stare at raw log files. I pipe the QA report into Gemini AI, which returns a plain-English summary of anomalies, and that summary goes straight to Telegram. GitHub Actions runs the whole pipeline daily at 00:05 UTC and weekly backtests every Sunday.

Key insight: use the best tool for each layer. R's financial data packages are genuinely superior to Python's for this use case. Fighting that fact would have cost weeks of work.

Architecture

The pipeline runs in four sequential stages, each gating the next:

┌─────────────────────────────────────────────────────────────┐
│                    CRYPTOPRISM-DB PIPELINE                  │
│                    Daily @ 00:05 UTC (GitHub Actions)       │
└─────────────────────────────────────────────────────────────┘

  STAGE 1: LISTINGS
  ─────────────────
  CoinMarketCap API → Python → dbcp.listings
  • Fetch top 1,000+ coins by market cap
  • Update slugs, symbols, metadata
  • Gate: must complete before OHLCV runs

  STAGE 2: OHLCV  (R — crypto2 package)
  ──────────────────────────────────────
  R script → PostgreSQL (dbcp)
  • Pull 1,000+ coin OHLCV via crypto2
  • Timestamp-based dedup filter (pre-write)
  • Batch upsert → dbcp.ohlcv_*
  • Gate: must have 0 errors before DMV

  STAGE 3: DMV  (Python — Pandas/NumPy)
  ──────────────────────────────────────
  dbcp.ohlcv_* → Python → dbcp + cp_ai
  • 21 Momentum indicators (RSI×5, ROC, Williams %R...)
  • 33 Oscillators (MACD, CCI, ADX, TRIX...)
  • 23 Financial Ratios (Alpha/Beta vs BTC, Sharpe...)
  • 15+ Fundamentals, 33 Volume indicators
  • DMV Scores: D:19.8  M:-30.8  V:-24.4
  • Output → cp_ai for AI analysis

  STAGE 4: QA  (Python + Gemini AI + Telegram)
  ─────────────────────────────────────────────
  Run health checks → Gemini summary → Telegram
  • Null ratio checks, range validation
  • Duplicate detection sweep
  • Gemini: plain-English anomaly summary
  • Score: 100/100  Uptime: 99.9%
  • Alert → Telegram channel

  ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌──────────────┐
  │ LISTINGS │──▶│  OHLCV   │──▶│   DMV    │──▶│      QA      │
  └──────────┘   └──────────┘   └──────────┘   └──────────────┘
       │               │               │                │
     dbcp           dbcp            dbcp            Telegram
                                   cp_ai

Hard Challenges

Challenge 01

Duplicate prevention at scale without per-row checks

At 1,000+ coins with daily OHLCV inserts, naive per-row duplicate detection caused the pipeline to run for 40+ minutes — sometimes timing out on GitHub Actions' 6-hour limit after weeks of data accumulation. A SELECT EXISTS per row is O(n) database round trips.

Fix: timestamp-based pre-filtering. Before any write, fetch the max timestamp already stored per coin in a single query. Then filter the incoming DataFrame to only rows newer than that timestamp. Zero duplicates, O(1) database round trips per coin. Runtime dropped from 40+ minutes to under 8 minutes.

Challenge 02

Numeric-prefixed table names breaking SQL syntax

Many cryptocurrency symbols start with numbers (e.g. 1inch, 3crv). PostgreSQL table names cannot start with a digit without being quoted. This caused silent failures when the ORM generated unquoted DDL statements.

Solution: prefix all coin-derived table names with c_ and enforce double-quoting in all dynamic SQL generation. Added a validation step in the LISTINGS stage that flags any symbol requiring special handling before the OHLCV stage runs.

Challenge 03

Multi-language orchestration — Python calling R at scale

Python's subprocess calling R scripts introduces failure modes: R crashes don't propagate as Python exceptions by default, R output pollutes the Python log stream, and environment differences between local and CI runners cause package version conflicts.

Solution: wrapped all R calls in a Python class that captures stdout/stderr separately, checks return codes, and raises typed Python exceptions on failure. R environment pinned via renv.lock. GitHub Actions caches the renv library between runs.

Challenge 04

Zero-downtime QA on a live production database

Running health checks on dbcp while the DMV stage is still writing to it risks reading partial data and generating false-positive anomaly alerts.

Solution: QA stage uses PostgreSQL REPEATABLE READ transaction isolation. DMV writes continue in parallel without blocking. QA sees a stable snapshot; no false positives.

Results

After reaching v4.4.0 production release:

2.26M+ Total Records

100/100 QA Health Score

0 Duplicate Rows

99.9% Pipeline Uptime

Average query response time: 2–3 seconds for full-coin scans. The 4-database architecture means the production database is never locked by backtest workloads — they run entirely in cp_backtest.

Lessons Learned

R is genuinely better than Python for certain financial data ops.
The crypto2 package's pagination logic saved weeks of API client work. Pragmatic polyglot beats language purity every time.
Gemini QA summaries beat raw logs for operational monitoring.
A plain-English Telegram message is infinitely more actionable than scrolling through 800-line log files at midnight.
Timestamp filtering beats row-level duplicate detection at scale.
Always know the high-water mark before writing, not after. Applicable to any append-heavy pipeline.
Separation of concerns in databases pays compound interest.
Distinct databases for live, AI, and backtest workloads prevented dozens of hours of query interference.
GitHub Actions is a surprisingly capable free production scheduler.
The 6-hour job timeout forced the pipeline optimisation that ultimately made everything faster anyway.