Wed, October 29, 20255 min read

Data modeling for OLAP with AI ft. Michael Klein, Director of Technology at District Cannabis

MK
Michael KleinDirector of Technology, District Cannabis
JO
Johanan OttensooserProduct

TL;DR

  1. Dropping CDC into OLAP “as is” leaves serious performance on the table (null maps, read-time joins, poor skipping).
  2. Manual modeling is hard and iterative; there are reliable LLM patterns if you give them structure (CDC/DDL), behavior (query logs), and guardrails (types, engines, sort-key policy).
  3. You can apply this to your own data: use those inputs + guardrails, review the AI’s DDL/MVs, and iterate against system.query_log: just like Mike did.

Quick Links: OLAP Best Practices Agent Docs | MooseStack Repo | MooseDev MCP

The last two posts covered moving data from OLTP to OLAP via CDC.

But CDC patterns only get you as far as landing the data. Now, querying that naively ingested data might still get you improved performance, but the order-of-magnitude gains you expect from OLAP requires optimized data modeling: thoughtful grain, types, sort keys, denormalization, and materialized views.

This used to require a bunch of manual work, and there was some question about whether the trade-off in extra engineering effort was worth the increased efficiency. Now, however, we close that gap with AI: giving copilots the context they need to correctly model this data.

This article features our customer, Michael Klein, Director of Technology at District Cannabis, whose ingestion of Cannabis industry data into ClickHouse captured exactly what AI-assisted modeling is meant to scale.

The problem with naively adding data to your OLAP database

Piping CDC data directly into OLAP preserves OLTP habits (normalized tables, loose types, nullable fields). The result is correct but slow at scale.

In brief, this comes from the tradeoffs inherent in OLAP / OLTP:

  • OLTP optimizes for transactional correctness and write atomicity.
  • OLAP optimizes for scans, compression, and aggregate throughput.

When you carry over OLTP defaults (normalized tables, nullable columns, and generic types) ClickHouse dutifully executes them but wastes CPU on null maps, performs joins at read time, and misses data-skipping opportunities.

Mike saw some of these data issues when ingesting his raw industry data.

The strain type he was given by his upstream data provider shipped as an excessively unconstrained TEXT type that can be up to 16 million characters long. Now, this industry’s product names are certainly unique (Florida Wedding, Sherbsicle Cookies, Rotten Teeth), but 16 million characters is certainly generous.

The inventory table shipped as a pivoted table, with each column representing a reporting date (pathological for OLAP performance).

Here’s the schema excerpt from the raw Snowflake export (note, this schema was created to power a BI tool’s visualization, but it contained data that was otherwise inaccessible, so it was required as a source).

Loading...

Those column names are literal dates, so each week the source system adds a new column (Oct 20, Oct 21, …) instead of adding rows with an inventory_date. If ingested straight into ClickHouse, we get a super-wide table packed with nulls, and any analytics by date turn into awkward multi-column scans.

The OLAP-friendly rewrite flattens those into rows:

Loading...

What OLAP systems want

OLAP modeling isn’t mysterious. They are easy to get wrong when you’re rushing to ship data.

The principles are stable, but nuanced:

  • Grain and denormalization. Define the atomic row of analysis and flatten parent fields to avoid read-time joins. Like Mike’s treatment of the pivoted Inventory data above.
  • Tight types. Use fixed-width numerics and LowCardinality strings to shrink scanned bytes and improve vectorization. (see Appendix §1 and §4).
  • Nulls vs defaults. Null maps kill SIMD performance; use defaults where possible. Instead of nullable text fields from the Snowflake source, Mike assigned defaults (empty strings or 0) so ClickHouse stays branch-free at scan time. (see Appendix §1).
  • Sort keys. Pick ORDER BY that matches real filters so ClickHouse can skip parts. The transactions fact was clustered by transaction_date, channel, and product, matching the dominant dashboard filters and keeping slices efficient. (see Appendix §2 + code snippet below)).
  • JSON usage. Prefer typed subcolumns for hot paths, not raw blobs. (see Appendix §3).

Here’s what that looks like in code for the transactions fact table:

Loading...

Why manual data modeling for OLAP is hard

Manual OLAP modeling takes deep cross-domain intuition:

  • OLAP engineers know compression, sort keys, and data-skipping.
  • Domain experts know which filters and rollups matter.

Bridging those worlds takes iteration: measuring workloads, refactoring schemas, regenerating materialized views. It’s slow, repetitive, and hard to scale across many data sources.

In District Cannabis’ case, one of the sources would have required hand‑wiring Cannabis Industry Data’s 100‑plus Snowflake views into ClickHouse.

That’s where AI fits: not replacing judgment, but capturing it. LLMs can infer the requirements for modeling from schema, grain, and sort keys from objective signals:

  1. Structure: CDC payloads, source DDL, ORM types.
  2. Behavior: ClickHouse query logs that reveal how data is actually used.

When we prompt and provide context to these models on expert heuristics, they start producing candidate OLAP models that follow the same patterns human engineers rely on: fixed types, safe defaults, sorted on real filters, with rollups pre-defined for top query clusters.

LLM-Based Data Modeling

Michael Klein’s rebuild of District Cannabis’ warehouse turned this theory into production results.

"I did four hours of work, and I had my data modeled in Moose and ingested into ClickHouse, ready to query. And, I think that would have taken me weeks of my time."

MK
Mike Klein
On using MooseDev MCP + our OLAP best-practices context to model hundreds of Cannabis Industry data from his upstream source

He rebuilt the entire Cannabis Industry Data warehouse in four hours, migrating from a Snowflake star schema to ClickHouse with raw facts, six tuned dimension tables, and an 8.8-million-row fact table, all modeled for OLAP performance.

The setup relied on five key components:

  1. Context. The model consumed the same inputs Mike used: Snowflake schema exports, Moose OLAP best-practice docs, and query logs showing real filter and aggregation patterns.
  2. Prompting. A staged sequence ensured reasoning before code generation: analyze raw shape, propose grain and types, then emit candidate DDL and materialized views. (see Appendix §5 for a concrete MV pattern).
  3. Guardrails. Validation against MooseOLAP typing and ClickHouse engine rules prevented hallucinated syntax and enforced schema correctness.
  4. Validation. Each output replayed against system.query_log to verify that sort keys and rollups aligned with actual query behavior. (e.g. led to rolling 30 day sync because of lack of CDC fields → Cannabis Industry Data source)
  5. Workflow and tooling.

The result: data models from the source data that are compliant with OLAP best-practices, that Mike was able to run, and tweak to his performance requirements.

Context

There were two main bodies of context that he fed the LLM in this workflow: data subject matter context, and infrastructure / best practice context.

For data subject matter, Mike pulled in:

  • Snowflake schema exports from CANNABIS_STRUCTURE_AND_QUERIES.md so the model could see every column, type, and naming quirk straight from the source.
  • Inline documentation inside the new Cannabis Industry Source models, such as the inventory comment below, to explain how raw structures should be interpreted downstream:
Loading...

For infrastructure and best-practice context, Mike pulled in:

  • The agent guide in CLAUDE.md, which maps the repository layout and links every prerequisite doc.
  • FiveOneFour’s OLAP best-practice markdowns—especially docs/OLAP_BEST_PRACTICES/1-type-rules.md and friends—so the LLM would apply the same typing and ordering rules the team enforces by hand.

These covered OLAP heuristics that AI could enforce:

  • Flatten aggressively, but predictably. Each fact table defines one row of analysis and pre-joins hot parent fields.
  • Prefer ReplacingMergeTree(version_col) for CDC. Dedup at ingest, not query time.
  • Sort on real filters first, then time. Example: ORDER BY (status, event_time) instead of just event_time.
  • Defaults beat nulls. Guard every nullable column.
  • Typed JSON. Use subcolumns for hotspots, keep raw JSON only for cold paths.
Loading...

Prompting

The prompt stack mirrors how Mike plans schema work by hand:

  • Grounding – the agent first reads the Snowflake schema export, OLAP best-practice notes, and the existing Moose pricing module before it’s allowed to propose anything.
  • Planning – a second prompt makes it outline phases (models → workflows → views → APIs) so it delivers a plan before touching TypeScript.
  • Synthesis – only after those steps does it emit Moose OLAP tables, materialized views, and workflows.

The meta-prompt spelled this out explicitly so the model couldn’t skip the reasoning step:

Loading...

That sequencing prevented the model from jumping straight to boilerplate—it had to justify grain, types, and sort keys first.

Guardrails & Validation

Every LLM-generated schema runs through the same loop:
generate → constrain → verify → adjust.

Guardrails

  • Typed contracts: Moose types (Decimal, LowCardinality, ClickHouseDefault) and engine rules (ReplacingMergeTree(ver) for facts, MergeTree for dims) fail fast at compile time (MooseDev server picks this up and feeds the output to LLM with MooseDev MCP).
  • Sort-key policy: Always (hot filters, time, entity): enforced in table builders.
  • Pivot handling: Wide snapshots (like inventory date columns) land raw as JSON, then unpivoted in views. (see Appendix §3)

"Also, not to beat a dead horse but the ease of modeling my data in Typescript using MooseOLAP is so refreshing and comforting. It really protects against errors and helps a ton when ingesting data because you have type safety built into the ETL workflow. I don't really know many tools that have that end-to-end type safety. Like you have things like parquet and all that but they don't provide that level of code safety."

MK
Mike Klein
On using MooseOLAP typed data models

Validation

  • Query-log checks: Compare 7–14 days of system.query_log to confirm sort-key alignment and MV hit rate.
  • CDC sanity: Missing version columns fall back to rolling windows stamped with syncedAt.
  • Regression tests: Measure bytes/rows read and latency before vs. after; adjust ORDER BY or defaults if regressions appear.
  • Safe rollout: Deploy via shadow tables → backfill → compare → cutover.

e.g.: Snowflake snapshots lacked CDC fields, so the model switched to 30-day rolling syncs versioned by syncedAt, keeping ClickHouse merges clean.

Workflow & tooling

Coding happened with a pair of copilots and a tight feedback loop. Mike paired Claude Code and Codex with MooseDev MCP tooling so every suggestion could be validated live: the copilot proposes a change, MooseDev server pushes back if it drifts from convention.

Keeping that flow in sync required fresh context. The MooseDev MCP (bundled with MooseDev server) (docs) lets the agent ask for console logs and errors, “peeks” at data, schema diffs, recent query patterns, and workflow status.

"When I got MooseDev MCP dialed in, that was particularly impressive. My copilot was constantly using the MCP tools to validate work, just like my standard manual workflow, but even faster because it was using the MCP tools."

MK
Mike Klein
On using MooseDev MCP

With copilots, MCP, and MooseOLAP, the LLM edits inside a scaffold designed for operators instead of inventing scripts in the dark.

Result

Mike and the Copilot completed a migration that would have taken weeks by hand in four hours. It delivered:

  • Dimension models that typed and flattened the Snowflake exports so analysts stop fighting TEXT columns.
  • Fact tables with version-aware deduplication and lookback semantics tuned for Snowflake’s nightly snapshots.
  • Materialized views that pre-aggregate the hottest queries into daily product performance.
  • A matching API/workflow layer that exposes the data to merch teams without bespoke scripts.

District Cannabis now sees query responses with sub-second response times instead of the 20–30 second waits they had endured, and every improvement is tied back to artifacts the team can audit, extend, or regenerate.

This was fast enough that Mike was able to provision new APIs for data consumption, and even chat based analytics (more on that soon).

Appendix: From Raw Cannabis Industry Data Schemas to OLAP-Ready Models

The raw Snowflake exports were never meant for analytics: they’re wide, loosely typed, and packed with pivoted columns. Here are five “before → after” snapshots that show how we brought them in line with the OLAP rules referenced above.

1. Product Dimension Typing

Loading...
Loading...

Practices: keep keys tight, apply LowCardinality for repeated text, eliminate NULLs with defaults.

2. Transaction Fact Ordering & Versioning

Loading...
Loading...

Practices: order by the filters dashboards actually use, add an explicit version for ReplacingMergeTree.

3. Pivoted Inventory Snapshots → JSON + Views

Loading...
Loading...

Practices: keep the raw shape lossless, unpivot late so ORDER BY can follow real filters.

4. Menu Products with Real Measurements

Loading...
Loading...

Practices: separate clean numerics from messy strings, still enforce defaults to avoid NULL bitmaps.

5. Materialized View for Product Velocity

Loading...
Loading...

Practices: join to a proper date dimension, pre-aggregate hot queries, keep ORDER BY aligned with filters.

These transformations turn unwieldy CSV exports into OLAP-compliant tables that ClickHouse and Moose can optimize automatically.

Interested in learning more?

Sign up for our newsletter — we only send one when we have something actually worth saying.

Follow us

Related posts

All Blog Posts