Like the previous few articles, this article will show how the fundamental differences in the implementation and purpose of OLAP databases makes working with the database different from what we’re used to from other paradigms.

Here, OLAP databases like ClickHouse and DuckDB:

Per the name, are analytical focused, not transactional focused, and want to retrieve large batches of data from huge batches of data, and make calculations on them, rather than retrieve or amend specific rows of data.
Are columnar, focusing on retrieving many instances of one column and doing calculations against that, rather than retrieving the whole row (even where parts of it aren’t related to the analytics you are doing.

This article will talk specifically about how these two differences (which are really two sides of the same coin) affect data modelling and schema design.

Now you’re just a star schema that I used to know

So, when we are thinking about more traditional databases, we immediately think of third normal form for OLTP, and star schemas for more traditional analytical use-cases.

These were based on reducing storage duplication, ensuring the correctness, immediate correctness that is, and integrity of the data.

A quick refresher on star schemas

You have a central “fact table” (orders, transactions) that references other tables with dimensions
You have many “dimensions tables” (users, products) that color the fact tables
To do almost anything, you need to join your fact and dimension tables (often with many joins)

This made sense because storage was expensive, and for the type of queries you were doing, joins were relatively cheap. You’d grab the few rows you wanted to enrich, and only compute joins against them. So why not save a bunch of money by storing your data more efficiently, makes sense!

This is a pretty great solution to a system where transactions, atomicity, reading individual rows and editing or deleting them is important. E.g. if you have 10k rows for “Joj” as a customer, and he lets you know he prefers to be known as “Johanan”, you edit one line in one table, and all 10k rows are “edited”.

If you want to be able to sum across a few hundred thousand rows though, a star schema would make your query orders of magnitude more expensive.

So, star schemas:

Low storage costs
High efficiency for row retrieval and use-cases that require mutability
Higher compute costs because of required joins, especially for large analytical queries

The OLAP approach

As discussed in yesterday’s session, ClickHouse, DuckDB and many other OLAP databases are columnar. So, they have different design guides, preferring less, wider tables.

Why?

Storage is cheap
Compression is highly efficient (especially for low cardinality columns)
Joins are expensive
Scans are lightning fast if you do your ORDER BY right (see yesterday’s article).

There’s a bit more going on underneath the hood, specifically with the way OLAP databases store data, scan data, and compute over data.

The database engine doesn’t need to look at columns that aren’t part of the query, so making the table wider doesn’t increase the cost of a scan / computation (e.g. if you are looking for customers in Portland, in the past Month, and how much they spent, you’d only look at date, state, and total_amount, ignoring all other columns. The fact that you added a category column wouldn’t affect the computation speed.
Once the database engine finds the right “section” of the column, it can utilize low level CPU optimization to increase computation efficiency (i.e. SIMD “Single Instruction, Multiple Data”, allows the CPU to process many values at once in each batch—columnar storage allows the engine to process thousands of values per CPU tick using SIMD.).

OLAP schema design heuristics

How should you organize your table?

Fewer tables, fewer joins
Precompute relationships
Wider tables, more duplication
Flatter tables, less nesting

Negatives:

This increases storage duplication (but not by as much as you think, because of extremely efficient compression!)
This increases the cost and difficulty of editing and deleting rows
Increases ingest complexity (depending on the source format; you must enrich data on ingest, shouldn’t be done at query time)

Positives:

This drastically reduces the cost and time for retrieving and calculating over data.

In analytical use-cases, this positive completely outweighs the increased cost of data duplication.

Exceptions

You can store lesser used data in separate tables, but if your data is appearing in WHERE, GROUP BY or SELECT, it probably deserves to be part of the denormalized table (depending on the frequency, it might also deserve to be part of ORDER BY).

Nesting

We’ll talk more about JSON and unpredictably nested data in the future, but for now, it is sufficient to note that flattened data is almost always more efficient in OLAP use-cases than nested data. This can increase the complexity of ingestion (you need to flatten the data as you ingest it). But ClickHouse has some tools related to this, if your nested JSON is predictable and typed, ClickHouse can show you the data as nested, and infer/implement sub-columns automatically by using their JSON type.

Conclusion: your schema is your query plan

In OLAP, your schema is your query plan.

Every design choice (what you flatten, how you order, what you precompute) shapes how fast the engine can retrieve and aggregate.

Storage is cheap. Compute isn’t. Design for the queries you’ll run most often, not the data you’ll edit.

Agent docs

🐙 https://github.com/oatsandsugar/olap-agent-ref/blob/main/0-schema-design.md

Appendices

Star schema toy example

At this scale, the storage benefits don’t favor star schemas, but with great scale (and repetition), they become more and more efficient for size

fact_orders

order_id	order_date_key	customer_key	product_key	quantity	total_amount
1	20250101	10	100	2	40.00
2	20250101	11	101	1	15.00
3	20250102	10	102	3	90.00

dim_customer

customer_key	customer_name	city	state
10	Alice	Portland	OR
11	Bob	Seattle	WA

dim_product

product_key	product_name	category	unit_price
100	Coffee Mug	Kitchen	20.00
101	T-Shirt	Apparel	15.00
102	Water Bottle	Fitness	30.00

dim_date

date_key	date	day_of_week	month	year
20250101	2025-01-01	Wed	January	2025
20250102	2025-01-02	Thu	January	2025

Flat schema of the same data

orders_flat

order_id	date	day_of_week	month	year	customer_name	city	state	product_name	category	unit_price	quantity	total_amount
1	2025-01-01	Wed	January	2025	Alice	Portland	OR	Coffee Mug	Kitchen	20.00	2	40.00
2	2025-01-01	Wed	January	2025	Bob	Seattle	WA	T-Shirt	Apparel	15.00	1	15.00
3	2025-01-02	Thu	January	2025	Alice	Portland	OR	Water Bottle	Fitness	30.00	3	90.00

Schema goals comparison table

Aspect	OLTP (e.g. Postgres)	OLAP (e.g. ClickHouse)
Primary Goal	Transaction integrity, fast row-level updates	Analytical speed, fast columnar scans
Workload	Many small reads/writes	Fewer, large aggregations
Schema Design	Normalized	Denormalized (Wide, Flat)
Data Access	Point lookups (by PK / index)	Range scans, GROUP BY, aggregations
Joins	Common, cheap	Expensive, avoid when possible
Mutation Pattern	Frequent UPDATE / DELETE	Append-only, periodic rebuilds
Optimization Strategy	Indexing, normalization	Clustering, compression, partitioning
Duplication	Avoid duplication	Embrace duplication for speed
Schema Evolution	Flexible, mutable	Write-once, evolve with ETL

OLAP on Tap: The Art of Letting Go (of Normalization)