Thursday 14 November 2013

Cassandra Data Modelling - Tables


In this series of posts I will discuss the basic concepts of data modelling in Apache Cassandra. It is important to understand those concepts as it will help us to design an efficient system using Cassandra. In this post I will discuss basic concept called tables in Cassandra.

I am assuming that you have installed Cassandra 1.2 or above with CQL3. You can download Cassandra from Apache Cassandra website. CQL 3.0 is the language which helps to work with Cassandra object same as SQL language helps to works with SQL objects.


Concept

We are all familiar with the map data structure. The map data structure is a list of key/value pairs in which we assign values against keys. Suppose we have a person profile with fields such as First Name (fname), Last Name(lname), DateOfBirth(dateofbirth), email address(email) and phone(phone). We can store this information in a map structure as shown in figure 1.
Figure 1: Person profile stored in a map.

We can use a unique number to point to this map. We will call this unique number as a row key as shown in figure 2:
Figure 2: Person profile referenced by the rowkey.

Similarly we can view other person profiles in map structures and refer them using unique row keys as shown in figure 3:
Figure 3: Person profiles referenced by the row keys.

Note that the maps in figure 3 do not have all the key/value pairs as in figure 2. For example person profile for Adam Jones have key/value pairs: fname, lname and dateofbirth. It does not have key/value pairs for phone number and email address. Similarly person profile Richard Douglas has three columns: fname, lname and email. It does not have dateofbirth and phone key/value pairs. This is a common scenario in normal life where we have different information available for different persons. Suppose we are collecting this information for a census in a city. In that case it makes sense to group those person profiles in a structure for processing and reporting. One way to group those data would be to introduce two dimensional map. In that two dimensional map the first dimension is the map in which keys are unique row keys and the values are the pointers to the map structure of the person profiles. The second dimension is the map structure which contains key/value pairs of person profile information. This is shown in figure 5:
Figure 5: Two dimensional map. First map has key as rowkey and value points to the person profile map.
In the above diagram we have a map of row keys which point to person profile maps. For example row key 789 points to a person profile map of Adam Jones.
The two dimensional map structure which we have just discussed will help us to to understand the concepts of tables, rows and columns in Cassandra. Before moving forward I would suggest to run Cassandra so that you can run CQL statements on your machine. Open command terminal and using cd command go to the directory where cassandra is installed. Run the following command in the terminal window:

bin/cassandra -f

This will run cassandra application which resides in the bin folder of Cassandra installation. Open another terminal tab or window and while cassandra is running, start Cassandra command line interface by typing following command.
bin/cqlsh

After cqlsh has launched successfully, you will see cqlsh command prompt. We will type all the CQL commands in cqlsh interface.

Keyspace:


The concept of keyspace is some what similar to creating a database in a SQL database. A keyspace is a namespace that defines how data is replicated on nodes. At the moment we will not go into details of replication but for our purpose it is important to understand that the keyspace is a container and contains several tables. We can create a keyspace named census with the following command:
create keyspace census with replication = {'class' : 'SimpleStrategy', 'replication_factor':1};
With this we create a keyspace called census with some replication policies. We will discuss about replication policies in later posts. After the keyspace is set successfully you can set it as the current keyspace by typing the following command in cqlsh terminal window:
use census;

Table:

In Cassandra a table can have number of rows. Each row is referenced by a primary key called row key. There are number of columns in a row but the number of columns can vary in different rows. For example one row in a table can have three columns where as the other row in the same table can have ten columns. It is also important to note that in Cassandra both column names and values have binary types. Which means column names can have binary values such as string, timestamp or an integer etc. This is different from SQL databases where each row in an SQL table has fixed number of columns and column names can only be text.
Now we will create person profile table which we have discussed before. Go to the cqlsh terminal window and type the following command. Make sure that your current keyspace is census.
A person table can be created using CQL 3.0 in keyspace Census as follows:
create table person ( person_id int primary key, fname text, lname text, dateofbirth timestamp, email text, phone text );
In the above statement we have created a table person which has a person_id column as a primary key. The primary key is same as a rowkey which we have mentioned before. The person table has few other columns such as fname, lname, dateofbirth, email and phone.  If you have worked with SQL databases than the above create statement will look familiar.  Next we will insert some data and see how they are stored in the database.  Type the following command on cqlsh terminal window:

insert into person (person_id,fname, lname, dateofbirth, email, phone) values ( 123, 'Chris', 'Williams', '1960-05-20', 'chris.williams@test.com', '099045769');
The above statement will insert Chris Williams person profile into the database. We can check if the data has been added successfully by using the select statement. Type the following command in cqlsh terminal window:
select * from person;

The output is:

 person_id | dateofbirth              | email                   | fname | lname    | phone

-----------+--------------------------+-------------------------+-------+----------+-----------

       123 | 1960-05-20 00:00:00-0400 | chris.williams@test.com | Chris | Williams | 099045769

We can see that the information is same as we inserted using insert statement. Note that the syntax of insert and select statements are similar to SQL database. At this point you may find it difficult to relate the output of select statement to the map structure concept which we have discussed before. With the cqlsh interface we do not really see how data is stored into the database. But hopefully you will see the relation by using another tool called cassandra-cli.  cassandra-cli tool comes with the standard Cassandra installation so you do not need to install any additional software. This tool resides in the bin folder along with the cassandra and cqlsh executables. This tool is useful to see how data is stored in the Cassandra system.  Open another terminal tab or window and run the following command:
bin/cassandra-cli

When this application is launched you will see a command prompt. On this command prompt type:
use census;
This is the same statement which we used on cqlsh prompt to select the keyspace. We have already created census keyspace so the application should successfully switch to that keyspace. We will now access person table to see the records in that table. Type the following command on cassandra-cli command prompt.

list person;

RowKey: 123
=> (name=, value=, timestamp=1384058627117000)

=> (name=dateofbirth, value=ffffffb955773e00, timestamp=1384058627117000)

=> (name=email, value=63687269732e77696c6c69616d7340746573742e636f6d, timestamp=1384058627117000)

=> (name=fname, value=4368726973, timestamp=1384058627117000)

=> (name=lname, value=57696c6c69616d73, timestamp=1384058627117000)

=> (name=phone, value=303939303435373639, timestamp=1384058627117000) 


As we can see that the value of primary key column i.e person_id has become RowKey. The value of person_id is 123.  This row has several columns. The first column has no name and value. This column represents the Rowkey or primary_key. If you look at the above output carefully you will notice that the name of primary key column i.e. person_id is missing. The column name of rowkey or primary key is stored in another table which we will discuss later in this article. The missing column name and value is useful for a case when table has only one column. But at the moment this is not important to understand the concept of data modelling so we will skip it. The second column has name: dateofbirth and value: ffffffb955773e00. The value is the binary representation of the date which we entered i.e. '1960-05-20'. Similarly email has value  63687269732e77696c6c69616d7340746573742e636f6d which is binary representation of chris.williams@test.com. This is same for other columns and values. Note that each column has a timestamp value. The timestamp value represents the time when the values were added into the database.

If you compare the above output with the map structure which we have discussed before you will find that they are similar. In the map structure each row is represented by the rowkey and it has number of key/value pairs. The above row output also has rowkey and column name/value pairs. It also has additional field of timestamp. In the map structure we mentioned that different rows can have different number of columns. To check how it works in Cassandra we will insert another row with less number of columns. Type the following command on cqlsh command line interface. Please note that it is cqlsh and not cassandra-cli prompt.
insert into person ( person_id, fname, lname,dateofbirth) values ( 456, 'Adam', 'Jones', '11985-04-18');

We now retrieve the data by using the select statement.

select * from person;

person_id | dateofbirth               | email                   | fname | lname    | phone
-----------+---------------------------+-------------------------+-------+----------+-----------
       123 |  1960-05-20 00:00:00-0400 | chris.williams@test.com | Chris | Williams | 099045769
       456 | 11985-04-17 23:00:00-0500 |                    null |  Adam |    Jones |      null
The first row person_id 123 contains all the columns where as the second row person_id 456 has only four columns: person_id, dateofbirth, fname and lname. The row person_id 456 does not have email and phone number columns. The value of email and phone number for person_id 456 is shown as null in the output of select statement. You now wonder how it is different from SQL database where if column value does not exist then it is set to null. We see how it is represented under the hood using cassandra-cli. Type the following command in cassandra-cli interface.
list person;

RowKey: 123

=> (name=, value=, timestamp=1384058627117000)
=> (name=dateofbirth, value=ffffffb955773e00, timestamp=1384058627117000)
=> (name=email, value=63687269732e77696c6c69616d7340746573742e636f6d, timestamp=1384058627117000)
=> (name=fname, value=4368726973, timestamp=1384058627117000)
=> (name=lname, value=57696c6c69616d73, timestamp=1384058627117000)
=> (name=phone, value=303939303435373639, timestamp=1384058627117000)
-------------------
RowKey: 456
=> (name=, value=, timestamp=1384112124419000)
=> (name=dateofbirth, value=00011f72a1036e00, timestamp=1384112124419000)
=> (name=fname, value=4164616d, timestamp=1384112124419000)
=> (name=lname, value=4a6f6e6573, timestamp=1384112124419000)

It returns two rows. The first row is RowKey 123 which we have discussed before. The second row is RowKey 456. The RowKey 456 has one primary key column which has empty name and value. The other columns are dateofbirth, fname and lname. Notice that it does not store column name/value pairs for email and phone number as they do not exists for this row. This means that the null values which we saw in cqlsh command interface are only for display purposes. The table does not store null values for columns which are not present in the system.

Primary key names:

As we mentioned before that the value of primary key is not displayed when we use list person command. These values are stored in the schema_columnfamilies table in the keyspace called system. system keyspace is managed by cassandra and we do not need to create this keyspace.

Type the following command in cqlsh interface:

SELECT key_aliases, column_aliases FROM system.schema_columnfamilies WHERE keyspace_name='census' AND columnfamily_name='person';

The output will  be:

 key_aliases   | column_aliases
---------------+----------------
 ["person_id"] |             []

The key alias column contains values for the primary key.

In our next post we will discuss composite primary key, partition keys and cluster keys. Please feel free to comment if there are things which I have missed or you think is not correct.

12 comments:

  1. I have worked with RDBMS for years and cassandra is the first NoSQL DB I am reading about. It was very difficult to get rid of the relational mentality. Your post has helped me. Thanks.

    ReplyDelete
  2. It is really simple and deep explanation about how data is stored in Cassandra.

    ReplyDelete
  3. brilliant explanation

    ReplyDelete
  4. Amazing post millian thanks sirjii

    ReplyDelete
  5. Awesome explanation. Made me understand about cassandra much better.

    ReplyDelete
  6. amazing post..:)

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. In cassandra , if 2 tables have same partition key ,how will the data get stored internally? Will both the tables data of the same partition key rows hit the same partition?

    ReplyDelete
  9. well done..very clear..thank you so much

    ReplyDelete