When I started building crypto trading strategies, I hit the same wall everyone hits: getting clean, pre-computed technical indicator data at scale is either prohibitively expensive or completely unreliable. Bloomberg Terminal charges thousands per month. Glassnode and Santiment lock the good data behind $500+/mo plans. Free APIs give you raw OHLCV with no indicators, no QA, and no guarantees.
What I actually needed was a production-quality data store I could query in seconds — with momentum indicators, financial ratios, volume analysis, and fundamental metrics all pre-computed for every coin, updated daily, free, and under my own control.
I designed around separation of concerns from day one. Rather than a single monolithic database that grows chaotic over time, I split into four purpose-specific databases: dbcp for live production data, cp_ai for AI-processed analysis outputs, cp_backtest for weekly backtesting runs, and cp_backtest_h for historical backtest snapshots.
For ingestion, I made an unconventional choice: use R instead of Python for the OHLCV layer. The
crypto2 R package handles 1,000+ coin requests with robust
pagination and rate-limit handling that would have taken weeks to replicate in Python. Python then
takes over for all indicator computation — vectorized Pandas/NumPy operations that process 100+
indicators per coin in a single DataFrame pass.
For QA, I didn't want to stare at raw log files. I pipe the QA report into Gemini AI, which returns a plain-English summary of anomalies, and that summary goes straight to Telegram. GitHub Actions runs the whole pipeline daily at 00:05 UTC and weekly backtests every Sunday.
The pipeline runs in four sequential stages, each gating the next:
┌─────────────────────────────────────────────────────────────┐
│ CRYPTOPRISM-DB PIPELINE │
│ Daily @ 00:05 UTC (GitHub Actions) │
└─────────────────────────────────────────────────────────────┘
STAGE 1: LISTINGS
─────────────────
CoinMarketCap API → Python → dbcp.listings
• Fetch top 1,000+ coins by market cap
• Update slugs, symbols, metadata
• Gate: must complete before OHLCV runs
STAGE 2: OHLCV (R — crypto2 package)
──────────────────────────────────────
R script → PostgreSQL (dbcp)
• Pull 1,000+ coin OHLCV via crypto2
• Timestamp-based dedup filter (pre-write)
• Batch upsert → dbcp.ohlcv_*
• Gate: must have 0 errors before DMV
STAGE 3: DMV (Python — Pandas/NumPy)
──────────────────────────────────────
dbcp.ohlcv_* → Python → dbcp + cp_ai
• 21 Momentum indicators (RSI×5, ROC, Williams %R...)
• 33 Oscillators (MACD, CCI, ADX, TRIX...)
• 23 Financial Ratios (Alpha/Beta vs BTC, Sharpe...)
• 15+ Fundamentals, 33 Volume indicators
• DMV Scores: D:19.8 M:-30.8 V:-24.4
• Output → cp_ai for AI analysis
STAGE 4: QA (Python + Gemini AI + Telegram)
─────────────────────────────────────────────
Run health checks → Gemini summary → Telegram
• Null ratio checks, range validation
• Duplicate detection sweep
• Gemini: plain-English anomaly summary
• Score: 100/100 Uptime: 99.9%
• Alert → Telegram channel
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────────┐
│ LISTINGS │──▶│ OHLCV │──▶│ DMV │──▶│ QA │
└──────────┘ └──────────┘ └──────────┘ └──────────────┘
│ │ │ │
dbcp dbcp dbcp Telegram
cp_ai
At 1,000+ coins with daily OHLCV inserts, naive per-row duplicate detection caused the pipeline
to run for 40+ minutes — sometimes timing out on GitHub Actions' 6-hour limit after weeks of
data accumulation. A SELECT EXISTS per row is O(n)
database round trips.
Many cryptocurrency symbols start with numbers (e.g. 1inch,
3crv). PostgreSQL table names cannot start with a digit
without being quoted. This caused silent failures when the ORM generated unquoted DDL statements.
Solution: prefix all coin-derived table names with c_
and enforce double-quoting in all dynamic SQL generation. Added a validation step in the LISTINGS
stage that flags any symbol requiring special handling before the OHLCV stage runs.
Python's subprocess calling R scripts introduces failure
modes: R crashes don't propagate as Python exceptions by default, R output pollutes the Python
log stream, and environment differences between local and CI runners cause package version conflicts.
Solution: wrapped all R calls in a Python class that captures stdout/stderr separately, checks
return codes, and raises typed Python exceptions on failure. R environment pinned via
renv.lock. GitHub Actions caches the renv library
between runs.
Running health checks on dbcp while the DMV stage is
still writing to it risks reading partial data and generating false-positive anomaly alerts.
Solution: QA stage uses PostgreSQL REPEATABLE READ
transaction isolation. DMV writes continue in parallel without blocking. QA sees a stable
snapshot; no false positives.
After reaching v4.4.0 production release:
cp_backtest.
-
R is genuinely better than Python for certain financial data ops.
The crypto2 package's pagination logic saved weeks of API client work. Pragmatic polyglot beats language purity every time.
-
Gemini QA summaries beat raw logs for operational monitoring.
A plain-English Telegram message is infinitely more actionable than scrolling through 800-line log files at midnight.
-
Timestamp filtering beats row-level duplicate detection at scale.
Always know the high-water mark before writing, not after. Applicable to any append-heavy pipeline.
-
Separation of concerns in databases pays compound interest.
Distinct databases for live, AI, and backtest workloads prevented dozens of hours of query interference.
-
GitHub Actions is a surprisingly capable free production scheduler.
The 6-hour job timeout forced the pipeline optimisation that ultimately made everything faster anyway.