What's the difference between a column oriented database and a column-family store?

What's the difference between a column oriented database and a column-family store?

Hello everyone. Today I'm going to briefly explain the difference between column oriented databases (e.g Vertica, MonetDB, Clickhouse) and column-family stores (e.g Cassandra) because throughout many system design courses I've seen and also throughout the information on the internet, there's a lot of misconception about the difference and I see many places refer to Cassandra as a column oriented database which is not correct. Let's start

Column Oriented Databases

They are databases that store data differently, they physically store the column values one after the other on disk unlike for example row oriented database (PostgreSQL for example) which physically stores the one whole row after the other on disk. This can be beneficial in data aggregations and pulling whole columns without an IO overhead but comes with the disadvantage of a bad performance when using for example select * which has to access every single file for each column to fetch the whole row.

This is a brief example of how data is stored in column oriented databases

  ID         Last    First   Salary
  1          Doe     John    8000
  2          Smith   Jane    4000
  3          Beck    Sam     1000

1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000;

This is a traditional relational data model and so column values are physically stored next to each other.

Column Family Store

It's considered part of NoSQL, Column Family databases store parts of a data entity or “row” in separate column-families, and has the ability to access these column-families separately. This means that not all parts of a row are picked up in a single I/O operation from storage, which is considered a good thing if only a subset of a row is relevant for a particular query.

Column oriented databases store columns from a traditional relational database table separately so that they can be accessed independently. Like column families, this is useful for queries that only access a subset of table attributes in any particular query. However, the main difference is that every column is stored separately, instead of families of columns.

Column Family Databases use a multi-dimensional map (something along the lines of a sparse, distributed, persistent multi-dimensional sorted map), Typically a row-name, column-name, and timestamp are sufficient to uniquely map to a value in the database. As we saw above how column oriented databases roughly get stored on disk, column family databases as I said use a sparse model (different rows can have a very different set of columns defined), Hence these systems will explicitly have column-name/value pairs for each element in a row within a column-family, or row-name/value pairs for each element within a single column-family Assume we have the following column family called Fruits

apple -> colour  weight  price variety
         "red"   100     40    "Cox"
orange -> colour    weight  price  origin
          "orange"  120     50     "Spain"

The row key here is the fruit type and each column name maps to a value, however as we see apple doesn't have an origin column. That's the beauty of sparse models, you can have any column you want in a specified row and not have it in the row after. It's like if it was a relational database but with MANY NULL columns but it doesn't physically store null values here it omits them saving storage space.

Column Family Databases generally does better for individual row queries, and does not perform well on aggregation-heavy workloads. Much of the reason for this difference can be explained in the “pure column” vs “column-family” difference between the systems. They can put attributes that tend to be co-accessed in the same column-family; this saves the seek cost that results from column-stores needing to find different attributes from the same row in many different places.

Summary

Cassandra can be named as a partitioned row-store which stores rows in column families partitioned on several nodes. While column oriented databases are a completely different concept. Hope you got an idea of the misconception and everything got even a bit clearer for you. That was it for this article see you next time!

References

  1. Link 1
  2. Link 2

Did you find this article valuable?

Support Amr Elhewy by becoming a sponsor. Any amount is appreciated!