5 minute read
This article will outline the difference in efficient insert patterns between OLAP (analytical) and OLTP (transactional) databases, and discuss best practices in OLAP (specifically ClickHouse) for optimizing inserts, with code examples using MooseStack to set up a Redpanda streaming buffer and in front of a ClickHouse OLAP database.
You can set this up yourself. MooseStack can make it easy to do this in code.
Most of us know the multivariate calculus of OLTP write tuning: how many workers to run, whether interactive requests should stick to 1 row per transaction (or a tiny bundle) to keep locks short and reduce contention, and whether background ETL should use modest batches to amortize round-trips and fsyncs. Do we use multi-row INSERTs for higher throughput, or jump to bulk loaders like COPY/LOAD DATA when ingesting files? We’re balancing contention, commit latency/visibility (group commit helps), blast radius on failure, and network/parse overhead.
OLTP, as a row-oriented system, is optimized for small, individual transactions and can parallelize them while maintaining ACID guarantees. Indexes are updated with each write (though what is reindexed can be optimized), so the more indexes you have, the slower ingestion gets. Inserts don’t lock existing rows; deletes/updates lock only their target rows. So, for bulk inserts, you usually turn indexing off, load up all the data, and recreate indexes from scratch.
In practice, you either do whatever the source app allows on the interactive path, and for ETL you batch to your risk appetite: small enough to keep locks and replica lag tame, big enough to hit your throughput target.
Bottom line for OLTP: optimize for many small updates on the request path, use modest batches off the request path, and parallelize sensibly.
Column stores write data per column and organize it into immutable parts/segments/files (e.g., parts in ClickHouse). Bigger, well-formed inserts mean fewer parts, less merge work, better compression, and stronger data skipping at query time.
It serves to have a look at what ClickHouse does behind the scene for a given insert (and I note this is for MergeTree-family engines):
There are a few operations here that benefit from scale:
ORDER BY
, improving data locality and reducing merge work later.It seems obvious, then, that bigger batches are better in OLAP. But, like OLTP, there is still some light multi-variate calculus to be done. Most specifically on latency and freshness (large batches lead to less freshness). And if the size of the batch exceeds RAM, you can run into max_memory_usage
issues (obviously). Finally, resilience, with blast radius, network replication and streaming consumer lag being compromised with larger writes.
Batch your data. 100k rows is good. Or 1s worth of data. Whatever comes first. (For file-oriented loads, target ~100–512 MB compressed files.)
This allows you to balance freshness, risk and efficiency.
How do you achieve this? There are in-database ways of working with this (like ClickHouse’s Buffer table engine), but the market standard is to prefix the database with a streaming buffer like Kafka or its drop-in sibling, Redpanda.
This allows you to decouple your producers from your database: a durable buffer.
If you are implementing these buffers yourself (and, assuming you have your Kafka/RedPanda already set up in the same region as your OLAP DB), you need to:
MooseStack’s Buffer, OLAP and API modules help you set this up in very few lines of code, whilst maintaining best practices. If you want to see this live, make sure to have your Moose Dev Server up and running moose dev
(see quickstart guide).
Step 1: Define your data model in your TypeScript or Python data model. This should map to the data you want to ingest, and will be used to type the data from ingestion to ClickHouse table
Step 2: use Moose OLAP module to set up your ClickHouse table, and
Step 3: use Moose Buffer module to set up a streaming buffer
Step 4: test using direct insert
There you go! MooseStack creates the streaming buffer for you, synced with the ClickHouse table.
This includes setting up and connecting your topic, consumer, and table. If you want, you can optionally set up a dead-letter queue for failed inserts (at API, at streaming function, at consumer).
Best practices for micro-batching pre-set up!
COPY
many 100–256 MB files in parallel.