OLAP on Tap: The Art of Letting Go (of Normalization)

Published Fri, October 3, 2025 ∙ OLAP, ClickHouse ∙ by Johanan Ottensooser

Like the previous few articles, this article will show how the fundamental differences in the implementation and purpose of OLAP databases makes working with the database different from what we’re used to from other paradigms.

Here, OLAP databases like ClickHouse and DuckDB:

  • Per the name, are analytical focused, not transactional focused, and want to retrieve large batches of data from huge batches of data, and make calculations on them, rather than retrieve or amend specific rows of data.
  • Are columnar, focusing on retrieving many instances of one column and doing calculations against that, rather than retrieving the whole row (even where parts of it aren’t related to the analytics you are doing.

This article will talk specifically about how these two differences (which are really two sides of the same coin) affect data modelling and schema design.

Now you’re just a star schema that I used to know

So, when we are thinking about more traditional databases, we immediately think of third normal form for OLTP, and star schemas for more traditional analytical use-cases.

These were based on reducing storage duplication, ensuring the correctness, immediate correctness that is, and integrity of the data.

A quick refresher on star schemas

  • You have a central “fact table” (orders, transactions) that references other tables with dimensions
  • You have many “dimensions tables” (users, products) that color the fact tables
  • To do almost anything, you need to join your fact and dimension tables (often with many joins)

This made sense because storage was expensive, and for the type of queries you were doing, joins were relatively cheap. You’d grab the few rows you wanted to enrich, and only compute joins against them. So why not save a bunch of money by storing your data more efficiently, makes sense!

This is a pretty great solution to a system where transactions, atomicity, reading individual rows and editing or deleting them is important. E.g. if you have 10k rows for “Joj” as a customer, and he lets you know he prefers to be known as “Johanan”, you edit one line in one table, and all 10k rows are “edited”.

If you want to be able to sum across a few hundred thousand rows though, a star schema would make your query orders of magnitude more expensive.

So, star schemas:

  • Low storage costs
  • High efficiency for row retrieval and use-cases that require mutability
  • Higher compute costs because of required joins, especially for large analytical queries

The OLAP approach

As discussed in yesterday’s session, ClickHouse, DuckDB and many other OLAP databases are columnar. So, they have different design guides, preferring less, wider tables.

Why?

  • Storage is cheap
  • Compression is highly efficient (especially for low cardinality columns)
  • Joins are expensive
  • Scans are lightning fast if you do your ORDER BY right (see yesterday’s article).

There’s a bit more going on underneath the hood, specifically with the way OLAP databases store data, scan data, and compute over data.

  1. The database engine doesn’t need to look at columns that aren’t part of the query, so making the table wider doesn’t increase the cost of a scan / computation (e.g. if you are looking for customers in Portland, in the past Month, and how much they spent, you’d only look at date, state, and total_amount, ignoring all other columns. The fact that you added a category column wouldn’t affect the computation speed.
  2. Once the database engine finds the right “section” of the column, it can utilize low level CPU optimization to increase computation efficiency (i.e. SIMD “Single Instruction, Multiple Data”, allows the CPU to process many values at once in each batch—columnar storage allows the engine to process thousands of values per CPU tick using SIMD.).

OLAP schema design heuristics

How should you organize your table?

  • Fewer tables, fewer joins
  • Precompute relationships
  • Wider tables, more duplication
  • Flatter tables, less nesting

Negatives:

  • This increases storage duplication (but not by as much as you think, because of extremely efficient compression!)
  • This increases the cost and difficulty of editing and deleting rows
  • Increases ingest complexity (depending on the source format; you must enrich data on ingest, shouldn’t be done at query time)

Positives:

  • This drastically reduces the cost and time for retrieving and calculating over data.

In analytical use-cases, this positive completely outweighs the increased cost of data duplication.

Exceptions

You can store lesser used data in separate tables, but if your data is appearing in WHERE, GROUP BY or SELECT, it probably deserves to be part of the denormalized table (depending on the frequency, it might also deserve to be part of ORDER BY).

Nesting

We’ll talk more about JSON and unpredictably nested data in the future, but for now, it is sufficient to note that flattened data is almost always more efficient in OLAP use-cases than nested data. This can increase the complexity of ingestion (you need to flatten the data as you ingest it). But ClickHouse has some tools related to this, if your nested JSON is predictable and typed, ClickHouse can show you the data as nested, and infer/implement sub-columns automatically by using their JSON type.

Conclusion: your schema is your query plan

In OLAP, your schema is your query plan.

Every design choice (what you flatten, how you order, what you precompute) shapes how fast the engine can retrieve and aggregate.

Storage is cheap. Compute isn’t. Design for the queries you’ll run most often, not the data you’ll edit.

Agent docs

🐙 https://github.com/oatsandsugar/olap-agent-ref/blob/main/0-schema-design.md

Appendices

Star schema toy example

At this scale, the storage benefits don’t favor star schemas, but with great scale (and repetition), they become more and more efficient for size

fact_orders

order_idorder_date_keycustomer_keyproduct_keyquantitytotal_amount
12025010110100240.00
22025010111101115.00
32025010210102390.00

dim_customer

customer_keycustomer_namecitystate
10AlicePortlandOR
11BobSeattleWA

dim_product

product_keyproduct_namecategoryunit_price
100Coffee MugKitchen20.00
101T-ShirtApparel15.00
102Water BottleFitness30.00

dim_date

date_keydateday_of_weekmonthyear
202501012025-01-01WedJanuary2025
202501022025-01-02ThuJanuary2025

Flat schema of the same data

orders_flat

order_iddateday_of_weekmonthyearcustomer_namecitystateproduct_namecategoryunit_pricequantitytotal_amount
12025-01-01WedJanuary2025AlicePortlandORCoffee MugKitchen20.00240.00
22025-01-01WedJanuary2025BobSeattleWAT-ShirtApparel15.00115.00
32025-01-02ThuJanuary2025AlicePortlandORWater BottleFitness30.00390.00

Schema goals comparison table

AspectOLTP (e.g. Postgres)OLAP (e.g. ClickHouse)
Primary GoalTransaction integrity, fast row-level updatesAnalytical speed, fast columnar scans
WorkloadMany small reads/writesFewer, large aggregations
Schema DesignNormalizedDenormalized (Wide, Flat)
Data AccessPoint lookups (by PK / index)Range scans, GROUP BY, aggregations
JoinsCommon, cheapExpensive, avoid when possible
Mutation PatternFrequent UPDATE / DELETEAppend-only, periodic rebuilds
Optimization StrategyIndexing, normalizationClustering, compression, partitioning
DuplicationAvoid duplicationEmbrace duplication for speed
Schema EvolutionFlexible, mutableWrite-once, evolve with ETL