18 3 / 2013

Cassandra : Demystifying Column Families

Last 2 posts  were fairly trivial, but now comes the fun part. I am seeing a lot of UTF8Type and validation_class type of syntax, which tells me (at least initially) that there is a lot of flexibility when it comes to creating structures and types. I will find out what these things are - but without getting into a lot of details, from whatever documentation I could google, a column family is what defines the structure of the document. In other words, its a family of columns, just like a relational table is. For Mongo folks, this would be hard to grasp as there is no concept of a column or a group of columns  there. All Mongo has is a collection to group a bunch of documents, which in turn, have columns (or keys). So, a family has a name, and you define the columns that form that family. This to me felt super odd till I found out that I can also have dynamic families, where I do not have to specify the columns ahead of time. We’ll get there, and we’ll also try to add/delete columns from a family that we declare. 

Initial instinct would be to just declare a column family with no columns.

[default@myspace] create column family SomeFamily;
9b08e2e8-f0d8-3788-a23c-0c4a720b766c

Apparently it worked. Now, this just means that we’ve run a SQL similar to create table SomeFamily. Whats different is that the SQL wont work as we did not specify any columns. 

To insert data into a column family, its similar to inserting data in an map. Lets analyze the below (it will not actually work, we’ll find out soon).

[default@myspace] set SomeFamily['lobster1234']['name']='Manish Pandit';
org.apache.cassandra.db.marshal.MarshalException: cannot parse 'name' as hex bytes

Looking at the set statement, we’re using a key (which is my Yahoo! ID, lobster1234) and setting name to the value of my full name. This looks familiar, doesn’t it? Like a map of maps or a multimap. The key is Yahoo! ID, and the value of this key is another map, with key as name and value has the name of the user. I can also add another key-value to the valuemap, this time adding a gender. Now if I were to extend it, I can keep adding more folks with their Yahoo! IDs as keys to the multimap, with the value of this key being another map, whose keys are name and gender. A point to be noted here is that this key (lobster1234) is referred to as a row key, as it identifies the key-value pairs that are mapped to it. Like a key in a multimap.

In Scala, this is what SomeFamily boils down to:

scala> var someFamily = Map[Any,Map[Any,Any]]()
someFamily: scala.collection.immutable.Map[Any,Map[Any,Any]] = Map()

scala> someFamily = someFamily+("lobster1234"->Map("name"->"Manish Pandit"))
someFamily: scala.collection.immutable.Map[Any,Map[Any,Any]] = Map(lobster1234 -> Map(name -> Manish Pandit))

Hope this clarifies or at least gives you an image of what a column family is like. The closest counterpart to this for MongoDB would be a collection. A collection contains a bunch of documents, with every document having an _id, and key-value pairs, where keys are Strings and values can be of the types supported by BSON spec (including another key-value pair).

In Mongo terms, the above would look like:

use  myspace
db.someFamily.save({_id:"lobster1234","name":"Manish Pandit"})

Adding another value to this document will be something like

 db.someFamily.update({_id:"lobster1234"},{$set:{"gender":"Male"}})

Going back to the error we saw when trying to insert a record in our column family (SomeFamily) - it appears that Cassandra is assuming that we are providing hex values instead of UTF8. So there has to be some way to tell Cassandra that the column family is expected to hold UTF8 data. Again, Google and Cassandra documentation say that we need to provide additional parameters while creating the column family. Lets try this one more time. We can delete SomeFamily column family so we can use the same name.

[default@myspace] drop column family SomeFamily;
36cac489-882a-345f-94f4-10dfb61d9395

Note that column family names are case sensitive - I learnt it by messing with the case of the column family name. SomeFamily and somefamily would be two unique column family names in a given keyspace.

So, lets change our create column family statement to account for UTF8 data as the key as well as value(s), and try to insert a record.

[default@myspace] create column family SomeFamily with comparator=UTF8Type and key_validation_class = UTF8Type and default_validation_class=UTF8Type;
27b66cfa-76ce-340c-bd31-4079873368db
[default@myspace] set SomeFamily['lobster1234']['name']='Manish Pandit';
Value inserted.
Elapsed time: 4.1 msec(s).

We added comparator and validation classes to the definition. These are often referred to as comparators and validators. The data type of the column name is called a comparator. For comparison, the comparator in a relational database is always String, because all the column names in a table have to be a String. A validator defines the data type of the column values. We used key_validation_class so that the key 'lobster1234' can be validated against the specified type, and the default_validation_class so that the value 'Manish Pandit' can be validated against the specified type. 

The default value for comparators and validators is hex byte[], hence the error that we saw earlier.

Cassandra does support type-casting values, so if we did not provide the default_validation_class during the creation of the column family, the following would work:

[default@myspace] set SomeFamily['mpandit.meetup']['name']=UTF8('MPandit Meetup');

Note the UTF8(..) conversion.

MongoDB does not need any of this, as it supports the BSON types for the values, and figures the type out at runtime. As long as the provided value is one of the BSON types, MongoDB will not complain. There is no type tied to a value. In other words, a key can have different type of value in 2 documents within the same collection, like this (note the type of the key someType is an int in one document, and a String in another).

> db.test.insert({"someType":2})
> db.test.insert({"someType":"Manish"})
> db.test.find()
{ "_id" : ObjectId("5147d3f995500e98b8f7a6cd"), "someType" : 2 }
{ "_id" : ObjectId("5147d3fd95500e98b8f7a6ce"), "someType" : "Manish" }

Also, the keys (or the names of the keys) in a MongoDB document can only be Strings.

To keep the discussion lively and post length in check, I’ll discuss static and dynamic column families, CRUD operations in the next post.

18 3 / 2013

Cassandra-CLI : Do not forget your semicolons!

I knew it was too good to be true. After being spoilt by Scala, and MongoDB client - I need to sober up and start using semicolons. I really hope there is some kind of switch I can use here but this gets annoying.

mpandit$ ./cassandra-cli
Column Family assumptions read from /Users/mpandit/.cassandra-cli/assumptions.json
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.2.2

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] help
...    
...	;
Getting around:
?                       Display this help.
help;                   Display this help.
quit;                   Exit this utility.

[default@unknown] quit
...    
...	
...	
...	;

Now lets build the databases and stick some data in. A db in Mongo == a keyspace in Cassandra. So, a Mongo show dbs is same as show keyspaces;  (do not forget the semicolon!) in Cassandra. However, this produces huge output and unlike MongoDB it isn’t JSON, so you cannot wrap this into a javascript and filter it out. But, more data means more information - reading through the output, I am starting to see the various attributes that can be tuned and configured.

Creating a DB (keyspace) is explicit, unlike Mongo where its implicit. In Mongo you can say use somedb and the moment you save something in that context, like db.somecollection.save({“a”:”b”}), that db will be created for you. In Cassandra, you create a keyspace explicitly, like so:

[default@unknown] create keyspace myspace;
e5bb6d2e-36ea-365b-85cc-065d9cf15941

This returns a hash, which I am not sure what to do with at this moment. I think its the UUID for every entity that gets created within Cassandra. I could be wrong, we’ll find out. Maybe its something like MongoDB’s getLastErrorObj which is returned after every operation.

Now that we’ve created a keyspace, we use it. Similar to MongoDB’s use somedatabase, except that the keyspace has to exist before using it. 

[default@unknown] use myspace;
Authenticated to keyspace: myspace
[default@myspace] use myotherspace;
Keyspace 'myotherspace' not found.    

Notice the promot has changed from default@unknown to default@myspace.

You can then use show schema; to introspect the keyspace. This is very much like show create table of MySQL and I cannot imagine a MongoDB counterpart of this.

[default@myspace] show schema;
create keyspace myspace
  with placement_strategy = 'NetworkTopologyStrategy'
  and strategy_options = {datacenter1 : 1}
  and durable_writes = true;

use myspace;

Also you will notice that the create statement has a list of default settings that will show up. Since we used none, this gives an idea of what all options are available.

 In the next post we will create Cassandra counterparts for Collections and Documents.

18 3 / 2013

Tiptoeing on Cassandra after running a MongoDB Marathon

As mentioned in my previous post, I am weeks away from starting at Netflix Engineering. One of the things that I absolutely have never touched is Apache Cassandra. While I do not think of myself as a NoSQL n00b, Cassandra is something I really never got around to.

With this change, I decided to pay around with it, and I’ll use this blog as a series to post my thoughts with the context of MongoDB inertia. 

Getting Started:

As with anything else, the best way to learn is to *not* RTFM. Lets see how far I can get with this.

I downloaded 1.2.2 and extracted it on my Mac. When running

bin/cassandra -f

I did get a boatload of errors.

xss = -ea -javaagent:./../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2048M -Xmx2048M -Xmn400M -XX:+HeapDumpOnOutOfMemoryError 

log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /var/log/cassandra/system.log (No such file or directory)
........
........
INFO 23:54:25,132 Scheduling key cache save to each 14400 seconds (going to save all keys).
 INFO 23:54:25,133 Initializing row cache with capacity of 0 MBs and provider org.apache.cassandra.cache.SerializingCacheProvider
 INFO 23:54:25,141 Scheduling row cache save to each 0 seconds (going to save all keys).
ERROR 23:54:25,167 Failed to create /var/lib/cassandra/data/system/batchlog directory
ERROR 23:54:25,172 Failed to create /var/lib/cassandra/data/system/batchlog directory
ERROR 23:54:25,172 Failed to create /var/lib/cassandra/data/system/batchlog directory
ERROR 23:54:25,173 Failed to create /var/lib/cassandra/data/system/peer_events directory
ERROR 23:54:25,173 Failed to create /var/lib/cassandra/data/system/peer_events directory
ERROR 23:54:25,174 Failed to create /var/lib/cassandra/data/system/hints directory
ERROR 23:54:25,175 Failed to create /var/lib/cassandra/data/system/Schema directory
ERROR 23:54:25,176 Failed to create /var/lib/cassandra/data/system/schema_keyspaces directory
ERROR 23:54:25,177 Failed to create /var/lib/cassandra/data/system/schema_keyspaces directory
ERROR 23:54:25,177 Failed to create /var/lib/cassandra/data/system/schema_keyspaces directory
.....


Clearly I need to fix the folder locations. Usually all servers have a conf or config or settings folder. So does Cassandra. I found the conf folder. Looking at the logs, it read the config from a yaml file. Fortunately there is only one yaml file there. I guessed the log location is in the log4j properties file. Next up, to change locations where the process has permissions to write. 

README.txt    		cassandra-rackdc.properties	cassandra.yaml
cqlshrc.sample			log4j-tools.properties
cassandra-env.sh		cassandra-topology.properties
commitlog_archiving.properties	log4j-server.properties

I vi’d the cassandra.yaml file and started changing the locations. Same with log4j*.properties. Lets try using ~/cassandra/? instead of /var/? and see where we go with it.

mpandit$ ./cassandra -f
xss =  -ea -javaagent:./../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2048M -Xmx2048M -Xmn400M -XX:+HeapDumpOnOutOfMemoryError
 INFO 00:28:36,783 Logging initialized
 INFO 00:28:36,799 JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.6.0_43
 INFO 00:28:36,799 Heap size: 2105540608/2105540608
 INFO 00:28:36,814 Loading settings from file:/Users/mpandit/software/apache-cassandra-1.2.2/conf/cassandra.yaml
 .....
 ......
 INFO 00:28:39,610 Binding thrift service to localhost/127.0.0.1:9160
 INFO 00:28:39,641 Using TFramedTransport with a max frame size of 15728640 bytes.
 INFO 00:28:39,652 Using synchronous/threadpool thrift server on localhost : 9160
 INFO 00:28:39,653 Listening for thrift clients...
 INFO 00:28:49,628 Created default superuser 'cassandra'
    

Great! looks like we got our Cassandra server working. Next post is where I’d google around to figure out Cassandra CLI and create the MongoDB equivalent of dbs, collections, and documents. To be honest I’ve no idea if these constructs are even valid, or make sense in Cassandra world - we will find out.

12 1 / 2012

Evolving Scala Echosystem

After spending a few months with Scala, I am at a point where I’ve found quite a bit of tidbits on digging myself in a hole, and managing to get out of it. So far I’ve found that sbt is a pain in the ass, maven is not as bad as its set out to be, salat is a decent object (well case class) to DBObject mapper, and lift is not that scary to tippie-toe into. 

IGN’s Video API was the 1st service to roll onto the new stack. We call it v3 (apply for a job at IGN Engineering and I’d be more than happy to go over v1 and v2, which is a very interesting story I promise).

The new stack has:

We use Newrelic to monitor the performance, and use memcached as our caching layer. The numbers were *very* impressive compared to where we ported the Video services from, and it was a good proof as well as a vote of confidence. Soon enough, we started to roll out this new stack to the new services, as well as porting some old ones to it.

Along the way, we found that dealing with update-merges and juggling around DBObjects was turning out to be bug-prone and labor intensive. The solution? lift-mongodb-record. No more case classes, and easy transformation to/from JSON to top it. The next set of services are using this when it comes to talking to MongoDB.

One pain in the ass was sbt, which, when we tried to upgrade, we ran into issues with Idea plugin, and the jetty-web plugin incompatibility.

mpandit-mbp:video-api mpandit$ sbt gen-idea
[info] Set current project to default (in build file:/Users/mpandit/.sbt/plugins/)
[info] Set current project to default (in build file:/Users/mpandit/work/git/video-api/project/plugins/)
[info] Set current project to default (in build file:/Users/mpandit/work/git/video-api/)
[info] Trying to create an Idea module default
[error] java.lang.NoSuchMethodError: sbt.Attributed.get(Lsbt/AttributeKey;)Lscala/Option;
[error] Use 'last' for the full log.mpandit-mbp:video-api mpandit$ 

The hacks to fix this (per google search) were too time consuming after many failed attempts and frankly we had to move on - we like to keep things simple, and since we were not writing custom sbt plugins, we’d rather stick to what works. At this point we decided to brush the dust off maven, and surprise surprise - it worked like a charm. We plan to stick with Maven for now, as we do not touch configuration that often anyway. Also, the ScalaIDE 2.0 from Typesafe made Scala development in Eclipse fairly easy (compared to the super buggy earlier releases) and I was able to get back to my favorite IDE for my newfound favorite programming language. 

Its going great so far, and we’re very happy with the progress along the new stack. Needless to say there is a lot more we can improve on, but thats all the part of learning! I am sure the stack will change going forward as we learn more with pair programming and tech-talks. We are not wrangling implicits and other advanced (academic? cryptic?) scala capabilities yet, but given the progress the team has made so far, its not too far on the horizon.

08 10 / 2011

Silicon Valley Code Camp is today, and I am talkin’

Today is SVCC 2011, and I am giving an introductory talk on MongoDB, and another one on Play! Framework. I have been a speaker at SVCC for the last 2 years, but have not imagined giving 2 talks in a day. With a lot of Red Bull and coffee, I am all set. See you at the Foothill College in Los Gatos!

03 10 / 2011

Lightening Talk at NoSQL Camp SF

I had an opportunity to deliver a lightening talk at SF NoSQL Camp 2011. Its amazing how much can be covered in 5 minutes! Here are the slides, and here is a great writeup about the session.

24 5 / 2011

My talk from MongoSF 2011

Had an awesome day at MongoSF 2011, and I would like to thank 10gen to give me an opportunity to speak at their conference. Here is the video my talk at MongoSF 2011 on how MongoDB is used at IGN. Great event!

The slides are available on slideshare too.