13 9 / 2013

My talk from Scala at Netflix - 9/9/2013, Scala Bay Meetup.

11 4 / 2013

Migrating a Perforce repo into git

Here is the git version that I used for this on the Mac running 10.8.3.

$ git --version
git version 1.8.2

It is pretty much a one liner thereafter.

$ git p4 clone //path_to_p4_repo@all /tmp/path_to_git_repo

Note the @all at the end of the P4 repo, this is so that the clone can grab full commit history of the P4 repo.

Once the clone is done, you’ll cd to the git repo and notice the .git folder there. In order to add this to remote, do this:

Step 1. Create a remote repo where you’d be pushing the newly created git repo to. Lets call it some_git_project. I assume you’re in the newly created git repo folder for step 2.

Step 2.  

$ git remote add origin http://user@server/path/some_git_project.git

Step 3.  

$ git push origin master

You may see this error during step 3 if the code base and/or the revision history is huge.

Delta compression using up to 8 threads.
Compressing objects: 100% (427/427), done.
Writing objects: 100% (2132/2132), 15.92 MiB | 17.65 MiB/s, done.
Total 2132 (delta 1163), reused 2132 (delta 1163)
Unable to rewind rpc post data - try increasing http.postBuffer
error: RPC failed; result=65, HTTP code = 302
fatal: The remote end hung up unexpectedly
fatal: The remote end hung up unexpectedly

This can be fixed by running the below, thanks the post hereand retrying Step 3.

git config http.postBuffer 209715200 

Enjoy!

18 3 / 2013

Cassandra : Demystifying Column Families

Last 2 posts  were fairly trivial, but now comes the fun part. I am seeing a lot of UTF8Type and validation_class type of syntax, which tells me (at least initially) that there is a lot of flexibility when it comes to creating structures and types. I will find out what these things are - but without getting into a lot of details, from whatever documentation I could google, a column family is what defines the structure of the document. In other words, its a family of columns, just like a relational table is. For Mongo folks, this would be hard to grasp as there is no concept of a column or a group of columns  there. All Mongo has is a collection to group a bunch of documents, which in turn, have columns (or keys). So, a family has a name, and you define the columns that form that family. This to me felt super odd till I found out that I can also have dynamic families, where I do not have to specify the columns ahead of time. We’ll get there, and we’ll also try to add/delete columns from a family that we declare. 

Initial instinct would be to just declare a column family with no columns.

[default@myspace] create column family SomeFamily;
9b08e2e8-f0d8-3788-a23c-0c4a720b766c

Apparently it worked. Now, this just means that we’ve run a SQL similar to create table SomeFamily. Whats different is that the SQL wont work as we did not specify any columns. 

To insert data into a column family, its similar to inserting data in an map. Lets analyze the below (it will not actually work, we’ll find out soon).

[default@myspace] set SomeFamily['lobster1234']['name']='Manish Pandit';
org.apache.cassandra.db.marshal.MarshalException: cannot parse 'name' as hex bytes

Looking at the set statement, we’re using a key (which is my Yahoo! ID, lobster1234) and setting name to the value of my full name. This looks familiar, doesn’t it? Like a map of maps or a multimap. The key is Yahoo! ID, and the value of this key is another map, with key as name and value has the name of the user. I can also add another key-value to the valuemap, this time adding a gender. Now if I were to extend it, I can keep adding more folks with their Yahoo! IDs as keys to the multimap, with the value of this key being another map, whose keys are name and gender. A point to be noted here is that this key (lobster1234) is referred to as a row key, as it identifies the key-value pairs that are mapped to it. Like a key in a multimap.

In Scala, this is what SomeFamily boils down to:

scala> var someFamily = Map[Any,Map[Any,Any]]()
someFamily: scala.collection.immutable.Map[Any,Map[Any,Any]] = Map()

scala> someFamily = someFamily+("lobster1234"->Map("name"->"Manish Pandit"))
someFamily: scala.collection.immutable.Map[Any,Map[Any,Any]] = Map(lobster1234 -> Map(name -> Manish Pandit))

Hope this clarifies or at least gives you an image of what a column family is like. The closest counterpart to this for MongoDB would be a collection. A collection contains a bunch of documents, with every document having an _id, and key-value pairs, where keys are Strings and values can be of the types supported by BSON spec (including another key-value pair).

In Mongo terms, the above would look like:

use  myspace
db.someFamily.save({_id:"lobster1234","name":"Manish Pandit"})

Adding another value to this document will be something like

 db.someFamily.update({_id:"lobster1234"},{$set:{"gender":"Male"}})

Going back to the error we saw when trying to insert a record in our column family (SomeFamily) - it appears that Cassandra is assuming that we are providing hex values instead of UTF8. So there has to be some way to tell Cassandra that the column family is expected to hold UTF8 data. Again, Google and Cassandra documentation say that we need to provide additional parameters while creating the column family. Lets try this one more time. We can delete SomeFamily column family so we can use the same name.

[default@myspace] drop column family SomeFamily;
36cac489-882a-345f-94f4-10dfb61d9395

Note that column family names are case sensitive - I learnt it by messing with the case of the column family name. SomeFamily and somefamily would be two unique column family names in a given keyspace.

So, lets change our create column family statement to account for UTF8 data as the key as well as value(s), and try to insert a record.

[default@myspace] create column family SomeFamily with comparator=UTF8Type and key_validation_class = UTF8Type and default_validation_class=UTF8Type;
27b66cfa-76ce-340c-bd31-4079873368db
[default@myspace] set SomeFamily['lobster1234']['name']='Manish Pandit';
Value inserted.
Elapsed time: 4.1 msec(s).

We added comparator and validation classes to the definition. These are often referred to as comparators and validators. The data type of the column name is called a comparator. For comparison, the comparator in a relational database is always String, because all the column names in a table have to be a String. A validator defines the data type of the column values. We used key_validation_class so that the key 'lobster1234' can be validated against the specified type, and the default_validation_class so that the value 'Manish Pandit' can be validated against the specified type. 

The default value for comparators and validators is hex byte[], hence the error that we saw earlier.

Cassandra does support type-casting values, so if we did not provide the default_validation_class during the creation of the column family, the following would work:

[default@myspace] set SomeFamily['mpandit.meetup']['name']=UTF8('MPandit Meetup');

Note the UTF8(..) conversion.

MongoDB does not need any of this, as it supports the BSON types for the values, and figures the type out at runtime. As long as the provided value is one of the BSON types, MongoDB will not complain. There is no type tied to a value. In other words, a key can have different type of value in 2 documents within the same collection, like this (note the type of the key someType is an int in one document, and a String in another).

> db.test.insert({"someType":2})
> db.test.insert({"someType":"Manish"})
> db.test.find()
{ "_id" : ObjectId("5147d3f995500e98b8f7a6cd"), "someType" : 2 }
{ "_id" : ObjectId("5147d3fd95500e98b8f7a6ce"), "someType" : "Manish" }

Also, the keys (or the names of the keys) in a MongoDB document can only be Strings.

To keep the discussion lively and post length in check, I’ll discuss static and dynamic column families, CRUD operations in the next post.

18 3 / 2013

Cassandra-CLI : Do not forget your semicolons!

I knew it was too good to be true. After being spoilt by Scala, and MongoDB client - I need to sober up and start using semicolons. I really hope there is some kind of switch I can use here but this gets annoying.

mpandit$ ./cassandra-cli
Column Family assumptions read from /Users/mpandit/.cassandra-cli/assumptions.json
Connected to: "Test Cluster" on 127.0.0.1/9160
Welcome to Cassandra CLI version 1.2.2

Type 'help;' or '?' for help.
Type 'quit;' or 'exit;' to quit.

[default@unknown] help
...    
...	;
Getting around:
?                       Display this help.
help;                   Display this help.
quit;                   Exit this utility.

[default@unknown] quit
...    
...	
...	
...	;

Now lets build the databases and stick some data in. A db in Mongo == a keyspace in Cassandra. So, a Mongo show dbs is same as show keyspaces;  (do not forget the semicolon!) in Cassandra. However, this produces huge output and unlike MongoDB it isn’t JSON, so you cannot wrap this into a javascript and filter it out. But, more data means more information - reading through the output, I am starting to see the various attributes that can be tuned and configured.

Creating a DB (keyspace) is explicit, unlike Mongo where its implicit. In Mongo you can say use somedb and the moment you save something in that context, like db.somecollection.save({“a”:”b”}), that db will be created for you. In Cassandra, you create a keyspace explicitly, like so:

[default@unknown] create keyspace myspace;
e5bb6d2e-36ea-365b-85cc-065d9cf15941

This returns a hash, which I am not sure what to do with at this moment. I think its the UUID for every entity that gets created within Cassandra. I could be wrong, we’ll find out. Maybe its something like MongoDB’s getLastErrorObj which is returned after every operation.

Now that we’ve created a keyspace, we use it. Similar to MongoDB’s use somedatabase, except that the keyspace has to exist before using it. 

[default@unknown] use myspace;
Authenticated to keyspace: myspace
[default@myspace] use myotherspace;
Keyspace 'myotherspace' not found.    

Notice the promot has changed from default@unknown to default@myspace.

You can then use show schema; to introspect the keyspace. This is very much like show create table of MySQL and I cannot imagine a MongoDB counterpart of this.

[default@myspace] show schema;
create keyspace myspace
  with placement_strategy = 'NetworkTopologyStrategy'
  and strategy_options = {datacenter1 : 1}
  and durable_writes = true;

use myspace;

Also you will notice that the create statement has a list of default settings that will show up. Since we used none, this gives an idea of what all options are available.

 In the next post we will create Cassandra counterparts for Collections and Documents.

18 3 / 2013

Tiptoeing on Cassandra after running a MongoDB Marathon

As mentioned in my previous post, I am weeks away from starting at Netflix Engineering. One of the things that I absolutely have never touched is Apache Cassandra. While I do not think of myself as a NoSQL n00b, Cassandra is something I really never got around to.

With this change, I decided to pay around with it, and I’ll use this blog as a series to post my thoughts with the context of MongoDB inertia. 

Getting Started:

As with anything else, the best way to learn is to *not* RTFM. Lets see how far I can get with this.

I downloaded 1.2.2 and extracted it on my Mac. When running

bin/cassandra -f

I did get a boatload of errors.

xss = -ea -javaagent:./../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2048M -Xmx2048M -Xmn400M -XX:+HeapDumpOnOutOfMemoryError 

log4j:ERROR setFile(null,true) call failed. java.io.FileNotFoundException: /var/log/cassandra/system.log (No such file or directory)
........
........
INFO 23:54:25,132 Scheduling key cache save to each 14400 seconds (going to save all keys).
 INFO 23:54:25,133 Initializing row cache with capacity of 0 MBs and provider org.apache.cassandra.cache.SerializingCacheProvider
 INFO 23:54:25,141 Scheduling row cache save to each 0 seconds (going to save all keys).
ERROR 23:54:25,167 Failed to create /var/lib/cassandra/data/system/batchlog directory
ERROR 23:54:25,172 Failed to create /var/lib/cassandra/data/system/batchlog directory
ERROR 23:54:25,172 Failed to create /var/lib/cassandra/data/system/batchlog directory
ERROR 23:54:25,173 Failed to create /var/lib/cassandra/data/system/peer_events directory
ERROR 23:54:25,173 Failed to create /var/lib/cassandra/data/system/peer_events directory
ERROR 23:54:25,174 Failed to create /var/lib/cassandra/data/system/hints directory
ERROR 23:54:25,175 Failed to create /var/lib/cassandra/data/system/Schema directory
ERROR 23:54:25,176 Failed to create /var/lib/cassandra/data/system/schema_keyspaces directory
ERROR 23:54:25,177 Failed to create /var/lib/cassandra/data/system/schema_keyspaces directory
ERROR 23:54:25,177 Failed to create /var/lib/cassandra/data/system/schema_keyspaces directory
.....


Clearly I need to fix the folder locations. Usually all servers have a conf or config or settings folder. So does Cassandra. I found the conf folder. Looking at the logs, it read the config from a yaml file. Fortunately there is only one yaml file there. I guessed the log location is in the log4j properties file. Next up, to change locations where the process has permissions to write. 

README.txt    		cassandra-rackdc.properties	cassandra.yaml
cqlshrc.sample			log4j-tools.properties
cassandra-env.sh		cassandra-topology.properties
commitlog_archiving.properties	log4j-server.properties

I vi’d the cassandra.yaml file and started changing the locations. Same with log4j*.properties. Lets try using ~/cassandra/? instead of /var/? and see where we go with it.

mpandit$ ./cassandra -f
xss =  -ea -javaagent:./../lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms2048M -Xmx2048M -Xmn400M -XX:+HeapDumpOnOutOfMemoryError
 INFO 00:28:36,783 Logging initialized
 INFO 00:28:36,799 JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.6.0_43
 INFO 00:28:36,799 Heap size: 2105540608/2105540608
 INFO 00:28:36,814 Loading settings from file:/Users/mpandit/software/apache-cassandra-1.2.2/conf/cassandra.yaml
 .....
 ......
 INFO 00:28:39,610 Binding thrift service to localhost/127.0.0.1:9160
 INFO 00:28:39,641 Using TFramedTransport with a max frame size of 15728640 bytes.
 INFO 00:28:39,652 Using synchronous/threadpool thrift server on localhost : 9160
 INFO 00:28:39,653 Listening for thrift clients...
 INFO 00:28:49,628 Created default superuser 'cassandra'
    

Great! looks like we got our Cassandra server working. Next post is where I’d google around to figure out Cassandra CLI and create the MongoDB equivalent of dbs, collections, and documents. To be honest I’ve no idea if these constructs are even valid, or make sense in Cassandra world - we will find out.

15 3 / 2013

So long, IGN - Hello Netflix!

After a long hiatus from the blogosphere, I am back. A lot has changed. IGN got acquired by Ziff Davis, and I, along with a lot of awesome engineers and friends, were let go as a part of the restructuring during last week of February. Life happens.

After a few weeks of rigorously interviewing at several companies in the SF Bay Area, I accepted the offer from Netflix. Not only do I love the product (Netflix subscriber since 2000), but the technology brand that Netflix carries is like none other. Their culture and talent density intrigues me, and I cannot wait to start at a company with such well defined, practical values, and a track record of following them religiously. 

Technology wise, Netflix is the poster child for AWS deployments, and distributed, highly available architectures. They have also started the Netflix OSS initiative. The initiative involves open sourcing parts of Netflix’s tech stack. The OSS roadmap is very exciting, and so are the projects that have already been open sourced. The number of forks and stars tell the same story. Too bad I was 80 miles north (in SF) when the OSS meetups happened in Los Gatos, but now I’ve all the reasons and more to be a part of them. 

One of the first things I would like to do is hiring - if you want to be a part of Netflix Engineering Team, and are as intrigued about the company and the culture as I am - please let me know. Take a look at the Netflix Technology Blog, Netflix OSS repos - and drop me a line!  

26 9 / 2012

On Thursday, 10/04/2012 : Evolving IGN’s API on Scala

On Thursday, 10/04/2012 : Evolving IGN’s API on Scala

02 4 / 2012

Happy Birthday to my daughter, Tanvi!

Happy Birthday to my daughter, Tanvi!

17 3 / 2012

Lone Hydrant at Yosemite View Lodge in Yosemite, CA

Lone Hydrant at Yosemite View Lodge in Yosemite, CA

16 3 / 2012

roybahat:

Or, the return of the human.

There was (once upon a time, in the early 2000’s) a generation of Internet products built on the premise that if you mastered Google’s search rankings, you could grow traffic. Services like IMDb, About.com, Wikipedia, etc. provided value to their users, but were…

Permalink 8 notes