Like the previous few articles, this article will show how the fundamental differences in the implementation and purpose of OLAP databases makes working with the database different from what we’re used to from other paradigms.
Here, OLAP databases like ClickHouse and DuckDB:
This article will talk specifically about how these two differences (which are really two sides of the same coin) affect data modelling and schema design.
So, when we are thinking about more traditional databases, we immediately think of third normal form for OLTP, and star schemas for more traditional analytical use-cases.
These were based on reducing storage duplication, ensuring the correctness, immediate correctness that is, and integrity of the data.
This made sense because storage was expensive, and for the type of queries you were doing, joins were relatively cheap. You’d grab the few rows you wanted to enrich, and only compute joins against them. So why not save a bunch of money by storing your data more efficiently, makes sense!
This is a pretty great solution to a system where transactions, atomicity, reading individual rows and editing or deleting them is important. E.g. if you have 10k rows for “Joj” as a customer, and he lets you know he prefers to be known as “Johanan”, you edit one line in one table, and all 10k rows are “edited”.
If you want to be able to sum across a few hundred thousand rows though, a star schema would make your query orders of magnitude more expensive.
So, star schemas:
As discussed in yesterday’s session, ClickHouse, DuckDB and many other OLAP databases are columnar. So, they have different design guides, preferring less, wider tables.
Why?
ORDER BY
right (see yesterday’s article).There’s a bit more going on underneath the hood, specifically with the way OLAP databases store data, scan data, and compute over data.
date
, state
, and total_amount
, ignoring all other columns. The fact that you added a category
column wouldn’t affect the computation speed.How should you organize your table?
Negatives:
Positives:
In analytical use-cases, this positive completely outweighs the increased cost of data duplication.
You can store lesser used data in separate tables, but if your data is appearing in WHERE, GROUP BY or SELECT, it probably deserves to be part of the denormalized table (depending on the frequency, it might also deserve to be part of ORDER BY).
We’ll talk more about JSON and unpredictably nested data in the future, but for now, it is sufficient to note that flattened data is almost always more efficient in OLAP use-cases than nested data. This can increase the complexity of ingestion (you need to flatten the data as you ingest it). But ClickHouse has some tools related to this, if your nested JSON is predictable and typed, ClickHouse can show you the data as nested, and infer/implement sub-columns automatically by using their JSON type.
In OLAP, your schema is your query plan.
Every design choice (what you flatten, how you order, what you precompute) shapes how fast the engine can retrieve and aggregate.
Storage is cheap. Compute isn’t. Design for the queries you’ll run most often, not the data you’ll edit.
🐙 https://github.com/oatsandsugar/olap-agent-ref/blob/main/0-schema-design.md
At this scale, the storage benefits don’t favor star schemas, but with great scale (and repetition), they become more and more efficient for size
fact_orders
order_id | order_date_key | customer_key | product_key | quantity | total_amount |
---|---|---|---|---|---|
1 | 20250101 | 10 | 100 | 2 | 40.00 |
2 | 20250101 | 11 | 101 | 1 | 15.00 |
3 | 20250102 | 10 | 102 | 3 | 90.00 |
dim_customer
customer_key | customer_name | city | state |
---|---|---|---|
10 | Alice | Portland | OR |
11 | Bob | Seattle | WA |
dim_product
product_key | product_name | category | unit_price |
---|---|---|---|
100 | Coffee Mug | Kitchen | 20.00 |
101 | T-Shirt | Apparel | 15.00 |
102 | Water Bottle | Fitness | 30.00 |
dim_date
date_key | date | day_of_week | month | year |
---|---|---|---|---|
20250101 | 2025-01-01 | Wed | January | 2025 |
20250102 | 2025-01-02 | Thu | January | 2025 |
orders_flat
order_id | date | day_of_week | month | year | customer_name | city | state | product_name | category | unit_price | quantity | total_amount |
---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 2025-01-01 | Wed | January | 2025 | Alice | Portland | OR | Coffee Mug | Kitchen | 20.00 | 2 | 40.00 |
2 | 2025-01-01 | Wed | January | 2025 | Bob | Seattle | WA | T-Shirt | Apparel | 15.00 | 1 | 15.00 |
3 | 2025-01-02 | Thu | January | 2025 | Alice | Portland | OR | Water Bottle | Fitness | 30.00 | 3 | 90.00 |
Aspect | OLTP (e.g. Postgres) | OLAP (e.g. ClickHouse) |
---|---|---|
Primary Goal | Transaction integrity, fast row-level updates | Analytical speed, fast columnar scans |
Workload | Many small reads/writes | Fewer, large aggregations |
Schema Design | Normalized | Denormalized (Wide, Flat) |
Data Access | Point lookups (by PK / index) | Range scans, GROUP BY, aggregations |
Joins | Common, cheap | Expensive, avoid when possible |
Mutation Pattern | Frequent UPDATE / DELETE | Append-only, periodic rebuilds |
Optimization Strategy | Indexing, normalization | Clustering, compression, partitioning |
Duplication | Avoid duplication | Embrace duplication for speed |
Schema Evolution | Flexible, mutable | Write-once, evolve with ETL |