What's the difference between a column oriented database and a column-family store?
3 min read
Hello everyone. Today I'm going to briefly explain the difference between column oriented databases (e.g Vertica, MonetDB, Clickhouse) and column-family stores (e.g Cassandra) because throughout many system design courses I've seen and also throughout the information on the internet, there's a lot of misconception about the difference and I see many places refer to Cassandra as a column oriented database which is not correct. Let's start
Column Oriented Databases
They are databases that store data differently, they physically store the column values one after the other on disk unlike for example row oriented database (PostgreSQL for example) which physically stores the one whole row after the other on disk. This can be beneficial in data aggregations and pulling whole columns without an IO overhead but comes with the disadvantage of a bad performance when using for example select * which has to access every single file for each column to fetch the whole row.
This is a brief example of how data is stored in column oriented databases
ID Last First Salary 1 Doe John 8000 2 Smith Jane 4000 3 Beck Sam 1000 1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;
This is a traditional relational data model and so column values are physically stored next to each other.
Column Family Store
It's considered part of NoSQL, Column Family databases store parts of a data entity or “row” in separate column-families, and has the ability to access these column-families separately. This means that not all parts of a row are picked up in a single I/O operation from storage, which is considered a good thing if only a subset of a row is relevant for a particular query.
Column oriented databases store columns from a traditional relational database table separately so that they can be accessed independently. Like column families, this is useful for queries that only access a subset of table attributes in any particular query. However, the main difference is that every column is stored separately, instead of families of columns.
Column Family Databases use a multi-dimensional map (something along the lines of a sparse, distributed, persistent multi-dimensional sorted map), Typically a row-name, column-name, and timestamp are sufficient to uniquely map to a value in the database. As we saw above how column oriented databases roughly get stored on disk, column family databases as I said use a sparse model (different rows can have a very different set of columns defined), Hence these systems will explicitly have column-name/value pairs for each element in a row within a column-family, or row-name/value pairs for each element within a single column-family Assume we have the following column family called Fruits
apple -> colour weight price variety "red" 100 40 "Cox" orange -> colour weight price origin "orange" 120 50 "Spain"
The row key here is the fruit type and each column name maps to a value, however as we see apple doesn't have an origin column. That's the beauty of sparse models, you can have any column you want in a specified row and not have it in the row after. It's like if it was a relational database but with MANY NULL columns but it doesn't physically store null values here it omits them saving storage space.
Column Family Databases generally does better for individual row queries, and does not perform well on aggregation-heavy workloads. Much of the reason for this difference can be explained in the “pure column” vs “column-family” difference between the systems. They can put attributes that tend to be co-accessed in the same column-family; this saves the seek cost that results from column-stores needing to find different attributes from the same row in many different places.
Cassandra can be named as a partitioned row-store which stores rows in column families partitioned on several nodes. While column oriented databases are a completely different concept. Hope you got an idea of the misconception and everything got even a bit clearer for you. That was it for this article see you next time!