***

OHLCV and Me

2024-08-12

Est. 9m read

Introduction

It’s a bit different than what I normally write about, but it is something I recently had fun with: parsing on-chain transactions and turning them into OHLCV candles for analysis. I dove into this without knowing “the right” way of doing things and was initially focused on results.

Once it was working, I took what I learned and made it more efficient by picking better tools and techniques. First we’ll go over the problem, and then we’ll talk about how to improve the initial solution.

To start off, what is OHLCV data? Here’s a picture:

at the top, if we zoom in, there’s already “O H L C” with the Open, High, Low, and the Closing price for the interval- the interval pictured above is for the 1M (1 month.)

To break it down, within the current period:

The Open price (start price) was $5,537.84
The High price was $5,566.16
The Low price was $5,119.26
The Close price (current price until end of period) is $5,334.15
and a special mention to Volume, who sometimes gets left out

To get the most out of our OHLCV data, it makes sense to store this data across many intervals rather than just the monthly. Different intervals can often tell a different story.

One more thing: OHLC is most commonly plotted with the candlestick chart, the OHLC can all be cleanly visualized with the candles and their wicks. The end goal is to be able to run different kinds of analysis on these OHLC values to identify trends.

Creating a Crypto Price Feed

For the average developer, getting trade data from the stock market is… difficult. It is all handled by a few groups that keep this data on central servers. Getting a real-time feed for that data isn’t cheap either- the NYSE is charging between $500 - $20,000 a month depending on what you need.

On the flip side, most blockchains are public ledgers and those can be seen by anyone with a network connection. We’ll be “subscribing” to the “RPC” to read these transactions as soon as they come in.

What the heck is RPC?

RPC, short for “Remote Procedure Call” is a communication protocol to execute remote functions as if it were local. RPC is preferred to REST when the communication is more about commands or procedure rather than domain-specific CRUD. It is very common when reading and interacting with blockchains.

Blockchain RPCs come in tiers, on one end of the spectrum are free RPC endpoints, and on the other is fully-managed, private RPC nodes with the lowest latency possible. Which one you choose, mostly depends on how fast it needs to be and your budget.

Once we have a connection, we need to listen (or “subscribe”) for transactions that are interacting with decentralized exchanges (a.k.a. DEXs.) From the transactions, we can calculate the current price after each swap is completed pretty easily. Calculating the current price from a transaction is unfortunately out of scope for this post.

So we kind of know how to calculate the price, and we know we have to group the data into different intervals. To solve this problem of grouping data across different intervals or “buckets”, we should make it simple and create a separate map for each interval. We created the OHLCCalculator class to handle this:

interface OHLC {
  open: number;
  high: number;
  low: number;
  close: number;
  startTime: number;
  interval: number;
  tokenAddress: string;
}

export class OHLCCalculator {
  private ohlcData: Record<string, OHLC> = {}; // 1m -> OHLCV data for token
  private interval: number;
...

The responsibility of this class is to record OHLC objects in-memory and then when the period “closes”, to commit that record to the database. Looking back, there’s a lot of room for improvement.

To continue with this OHLCCalculator class, we’ll need a function for updating the high and low value. We’ll also need a way of creating a new candle if one doesn’t already exist. The pseudocode can be thought of as:

Lookup the current candle on the current interval (1m, 1h, etc.)

If no existing data for this interval at the current time:
    Create a new record with:
        - Set all values (OHLC) to the current price.
Else:
    Update the existing record:
        - Set the high price if the current price is greater than the previous high.
        - Set the low price if the current price is lower than the previous low.
        - Always update the closing price to the new price.

Attempt to save or update the record in the database:
    Include fields for the token, start time, and interval.

if you’d like to see the full OHLCCalculator.ts for my initial solution: here.

That is where I started for storing OHLCV data. It isn’t terrible, but there are some big issues:

Storing data in-memory until the period ends is a bad solution
- There are weekly, monthly and yearly intervals to consider
setTimeout is not a reliable way of keeping time
Volume is never mentioned in OHLCCalculator.ts
There needs to be an instance of OHLCCalculator for every token
Using MongoDB (and without the time-series features) seems inefficient

A Better Solution

Once I had a working solution, I started looking for a better solution. After some research, I found that there are some popular database options for “financial data.” Financial data is often just time-series data with frequent read/writes.

kdb+, TimescaleDB, InfluxDB, DuckDB, MongoDB, and Redis were all considered. In the end, I decided to run with TimescaleDB because it’s open-source, built atop of Postgres, and has excellent documentation.

Optimized for financial data?

Yeah, each DB has its own feature-set that would make it better suited for financial data. These features typically fall into one of these camps:

More efficient chunking and compression, data partitioning, in-memory storage, real-time aggregates, indexed time columns, and high write throughput.

The next iteration of the OHLCCalculator is honestly… much simpler. Rather than using code to check if the current price is greater than the previous high, it’s just handled in the query. The volume metric that we have yet to talk about? Handled in the query. Let me just show you this query:

INSERT INTO ohlcv (token_id, time_interval, timestamp, open, high, low, close, volume)
VALUES ($1, $2, $3, $4, $4, $4, $4, $5)
ON CONFLICT (token_id, time_interval, timestamp)
DO UPDATE SET
    high = GREATEST(ohlcv.high, EXCLUDED.high),
    low = LEAST(ohlcv.low, EXCLUDED.low),
    close = EXCLUDED.close,
    volume = ohlcv.volume + EXCLUDED.volume

In the second iteration, this query is executed every time a swap is detected. Each ohlcv row represents the OHLCV data for a given interval (1s, 5m, 1d, etc.) and each one is associated with a token_id.

There’s nothing specific to TimescaleDB in the query above, except that the ohlcv table was created as a hypertable:

-- Convert the OHLCV table to a hypertable
SELECT create_hypertable('ohlcv', 'timestamp', chunk_time_interval => INTERVAL '1 day', if_not_exists => TRUE);

This just means that our ohlcv table will be chunked into daily intervals making for more efficient indices.

Switching from NoSQL to SQL

With the MongoDB solution, we stored the token’s “contract address” alongside the OHLCV data. With the TimescaleDB solution, we create a row for each token and associate the token’s ID with the OHLCV data.

Switching from NoSQL to SQL is not as simple as changing DBs, it can often requires changes in implementation. Domain Driven Design and Natural Keys are two concepts that can make this switch much easier.

But this new query handles pretty much all of the heavy lifting.

The only thing we have to get right is the timestamp. If we think about it, the timestamp kind of needs to be consistent until the next period. We can think of the timestamp as a label for a time bucket.

So how do we make the current timestamp consistent for 15 seconds? Well, we ask GPT to write the function for us! Here’s what it came up with:

function truncateTimestamp(timestamp: Date, interval: string): Date {
  const date = new Date(timestamp);
  switch (interval) {
    case "5s":
      date.setMilliseconds(0);
      date.setSeconds(Math.floor(date.getSeconds() / 5) * 5);
      break;
    case "15s":
      date.setMilliseconds(0);
      date.setSeconds(Math.floor(date.getSeconds() / 15) * 15);
      break;
    case "1m":
      date.setSeconds(0, 0);
      break;
    case "5m":
      date.setMinutes(Math.floor(date.getMinutes() / 5) * 5, 0, 0);
      break;
    case "15m":
      date.setMinutes(Math.floor(date.getMinutes() / 15) * 15, 0, 0);
      break;
    case "1h":
      date.setMinutes(0, 0, 0);
      break;
    case "4h":
      date.setHours(Math.floor(date.getHours() / 4) * 4, 0, 0, 0);
      break;
    case "1d":
      date.setHours(0, 0, 0, 0);
      break;
  }
  return date;
}

I like it. It will work better than the setTimeout we were originally using because this solution is based on the absolute time rather than a relative timeout.

That’s pretty much all it takes to record OHLCV data in real-time. We’ve been using this solution for a few weeks now, and I’m happy with the performance, queryability, and schema that we’ve landed on.

The original solution tracked only one token, the new solution tracks every single token on the exchange with less code. That’s a win.

Once the OHLCV data is recorded, it is trivial to add it to a chart like lightweight-charts.

Real-Time Triggers

One of the last things we’ve got to mention is how to listen for updates. We need some kind of pub/sub mechanism that can trigger updates when a new ohlcv row is inserted.

Luckily, TimescaleDB supports NOTIFY/LISTEN.

We’ll just leverage this built-in Postgres feature rather than adding another dependency. To create the trigger, we just have to add this to the SQL:

-- Create the notification function and trigger
CREATE OR REPLACE FUNCTION notify_new_ohlcv_data() RETURNS TRIGGER AS $$
BEGIN
    PERFORM pg_notify('new_ohlcv_data', row_to_json(NEW)::text);
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER ohlcv_insert_trigger
AFTER INSERT ON ohlcv
FOR EACH ROW
EXECUTE FUNCTION notify_new_ohlcv_data();

This makes it really easy to create charts that update in real-time or run trading strategies that listen for real-time OHLCV data.

Closing Notes

Is this the absolute best way of storing OHLCV data? I doubt it. For one, there is redundancy in the data. If you think about it, the 30m interval could be pieced together from the 15m or even the 1s intervals.

Storing anything above the smallest interval (typically “1 tick”) seems to be for convenience-sake and I am curious if I should store both or just the smallest interval.

Sites like TradingView allow you to specify a custom interval and this would only be possible if the data was easy to query per tick.

Pages

Blog