As we discussed yesterday, the strictness of a type is very much tied to the read and aggregate performance of OLAP databases. The amount of variance in a column is called "cardinality". Fields with higher cardinality have more variance, like UUIDs. Fields with lower cardinality have lower variance: like boolean.
The corollary of this is compression and encoding efficiency. Low cardinality, great compression. High cardinality, bad compression.
Column | Distinct Count | Compression | Storage Strategy |
---|---|---|---|
country | 200 | Excellent | LowCardinality(String) |
state | 50 | Excellent | Enum('CA','TX',...) |
user_id | 50M | Poor | Plain UInt64 |
Cardinality comparison table
You notice there in country, that the column type is string, but that it uses a wrapper, LowCardinality()
.
What that does is encode each of those 200 countries as a dictionary, drastically reducing the possible "amount of information" that could be possible in that column, allowing for improved compression.
This is really powerful, and you should use it if you have less than 10-50k unique values in a column (assuming OLAP scale tables, in the millions of rows). As you get past this limit, the overhead of dictionary lookup can obviate the performance gains from using this wrapper.
When the possible space of results is small, manually defining them as enumerates (e.g. CA = 1, NY = 2, TX = 3) makes for an even faster and smaller column than using LowCardinality(). Don't use this when you expect new values (will require a migration to update).
Use these in with your Claude, Cursor or other agents:
Cardinality | Type | Notes |
---|---|---|
≤ 10 | Enum8 | perfect static set |
10–10k | LowCardinality(String) | best trade-off |
10k–100k | LowCardinality (watch memory) | depends on workload |
>100k | base String / numeric | avoid dictionary |
Table showing heuristics for Types / Wrappers and expected cardinality