ORDER BY in OLAP: Why Sorting Data Defines Performance (ClickHouse Guide)

OK, so ORDER BY is really important in OLAP tables. Remember, OLAP queries are usually trying to grab and process some large chunk of a huge chunk of data. ORDER BY helps them do that efficiently.

Relevant docs: https://clickhouse.com/docs/sql-reference/statements/select/order-by

Indices in OLTP databases

Probably best to think about this in contrast with the OLTP version of this. In OLTP, queries are usually trying to retrieve one or a few rows of data. So, they use indexes to do this. An index is computed to help efficiently retrieve or sort a set of columns by the values in a field. You can have multiple indexes, and it takes a bunch of compute to generate them. It isn't related to where the data is physically stored, but rather, works as a map to each item.

Think of a grocery store, where items are randomly distributed around the aisles, and the index is a map to each item. Can help you find one item just fine. If you have a few, you might be running around.

Here’s a toy example:

id	item	category	aisle
1	Milk	Dairy	4
2	Bread	Bakery	2
3	Butter	Dairy	4
4	Apple	Produce	1
5	Yoghurt	Dairy	4

You can see how this helps you find one or two items fast — but if you’re shopping for a basket of dairy, you’ll be zig-zagging all over.

ORDER BY in OLAP databases

ORDER BY in OLAP databases physically stores data near each other, by some parameter. It also takes compute to make sure data is inserted correctly into the order. But then retrieval is super quick.

Think back to our grocery store. ORDER BY is sorting these grocery stores into relevant aisles (and then storing the milk in ascending freshness order, if you fancy—but more on that later). It might be quicker to find one item with the map, but if you are trying to find milk, butter and yoghurt, ORDER BY aisle is probably more helpful!

category	item	use_by
Bakery	Bread	2025-10-03
Dairy	Milk	2025-10-01
Dairy	Butter	2025-10-02
Dairy	Yoghurt	2025-10-05
Produce	Apple	2025-10-04

Now all the dairy is together — efficient for any scan, count, or group-by.

And if we want to torture the metaphor just a bit more, we can get two more insights out of it:

ORDER BY would make it take a little longer to put stuff on the shelf (inserts are slightly slower).
What you ORDER BY becomes critical for retrieval efficiency. In our supermarket, we’d be sorting by Category (e.g. Dairy), then perhaps popularity or best-buy date — that’d make it super easy to retrieve the freshest things we need!

(OK, yes, the metaphor breaks down a bit. But to be realistic, it means it’d be super easy to count the number of cartons of milk going off this week. 😉)

ORDER BY, a worked example with ADSB.LOL data

See the benchmark I created earlier in the year with ADSB.LOL data.

There are two primary considerations when setting your ORDER BY fields.

1️⃣ For the first, I'm glad I wrote about what I did the past couple days! The first is a property of the data you are going to order by: its cardinality. The lower the cardinality of the data, the easier it is to skip chunks of data when looking for the data related to the query.

If you have a look at the project I did benchmarking ClickHouse against Postgres, we were dealing with aircraft telemetry data. So we processed the data coming in such that when the aircraft was on the ground, a boolean, on_ground was set to true. We made that the first ORDER BY field, allowing us to skip so much data with the simplest possible computation.

What makes that column a great one to ORDER BY is that the type was super restrictive, so computing that first filter is so cheap. Other small types are also great (like enums and strings or integers wrapped in LowCardinality()).

2️⃣ The second relates to what you want to do with the data (is it used in WHERE statements at the start of your query? That's a good sign).

In that same aircraft use-case, I was trying to compute certain analytics about aircraft in the air — like the relationship between air speed and altitude. That means I was using WHERE on_ground = false as the first clause in my query. Allowing me to completely ignore all the telemetry of aircraft on the ground (a surprising majority of the data!).

Time's a special case

Since I was also filtering by "in the last week", time was a really good ORDER BY field here too. Because it’s more cardinal than the boolean, I made that a secondary ORDER BY field.

Time is a really interesting field in OLAP analysis. It’s high cardinality, but also one of the most common filters. For append-only data (like telemetry or logs), sorting by time first often makes ingestion cheaper and queries faster, the natural order matches insert order.

You can reduce the impact of time’s cardinality by reducing its granularity (even just in how it’s used in the sort). Millisecond precision is more than enough for most analytics, and half the cardinality of nanosecond (each extra digit of precision multiplies distinct values).

This sorts timestamps by minute, grouping nearby events and improving compression. You still query on full precision — you’re just clustering more coarsely.

Heuristics to go by

Rules for cardinality

Agent docs

🐙 https://github.com/oatsandsugar/olap-agent-ref/blob/main/2-order-by.md

Appendix: Partition By

PARTITION BY works at a higher level than ORDER BY — it splits data into parts (files) before sorting within each part. See https://clickhouse.com/docs/engines/table-engines/mergetree-family/custom-partitioning-key

OLAP on Tap: ORDER BY defines your performance

Published Thu, October 2, 2025 ∙ OLAP, ClickHouse ∙ by Johanan Ottensooser