Ryan Angilly

Hi, I'm Ryan Angilly. I'm a founder of Ramen. I'm a geek.

I founded Signal Genius.

I blogged about my failed startup, MessageSling, at The Day Series.

Things I used to do:

3 reasons to use MongoDB

Note: This is precursor post to my talk at Mongo Boston on September 20th.  It’s gonna be at Microsoft’s NERD (which I hear is COMPLETELY AWESOME).  If you haven’t signed up yet, stop being lame.  It’s gonna be awesome.  http://www.10gen.com/conferences/mongoboston2010

People have asked me why to use MongoDB.  I used to answer with “it’s SO FREAKING AMAZING!!”, and talk about how “new” and “cool” and “hawt” it felt.  I would wax on about how it felt like the first time I used Rails and so forth.

That was cheating.  It’s a crappy answer of no value.  So today, I am going to give you three reasons to use MongoDB over MySQL, Postgres, Tokyo Cabinet, or CouchDB.

1. Simple queries

MongoDB is a document store with no transactions and no joins.  When an application warrants using this type of database[1], the result is that your queries become much simpler.  They are easier to write.  They are easier to tune.  They make it easier for developers to do their job.  In Punchbowl land, ‘users’ have ‘events.’  There is a table for each, with a user_id on the events table.  Lets say I want to get all the users who have published an event.

In an SQL database, I have two tables: users and events.  I could write this query like so:

SELECT `users`.* from `users` INNER JOIN `events` ON `events`.`user_id` = `users`.`id` where `events`.`published_at` is not null group by `users`.`email`;

Analogously, in a MongoDB database, lets say I have just one collection: users.  Each user document has an attribute called ‘events’, which is a list of embedded documents.  It looks something like this in JSON:

{
  ”name” : “Ryan Angilly”,
  ”events” : [
{
"title" : "First one!"
},
{
"title" : "Whoa!"
},
{
"title" : "Oh hi",
"published_at" : true
}
  ]
}

To perform the same query in MongoDB query syntax:

db.users.find( { ‘events.published_at’: {$ne: null}}  )

Simpler.  Simpler to read.  Simple to write.  I glossed over the fact that we are drastically changing how we store our data, but that’s the whole point.  And you can clearly see how it makes things easier to understand.

2. Sharding

Sharding is a simple concept.  If you have lots of data and you are getting disk-bound and/or running out of space, take your data and split across several machines.  You get more disk throughput and more storage.  In a perfect world, as your storage and performance needs grow, just add more shards.

MongoDB is pretty close to this perfect world.  If you have a mongod process running, and you want to setup sharding, you:

1) Bring up a new machine
2) Start a new mongod process to act as a member of your shard cluster
3) Start a new mongod process to act as a separate ‘config’ database for maintaining configuration information about which data are in which shard
4) Start a mongos process & tell it how to find the current db, the new shard member, the config database
5) Enter ~5 commands to enable sharding on whichever databases and collections you want
6) Modify your apps to connect to the mongos process instead of the old mongod process
7) Profit.

All intraprocess communication is done over IP, so the configuration mongod process and mongos process can run on their own machines or run on the same machine as one of your shard members.  This can be be done with no downtime, guys.  And you don’t have to have an eye towards sharding when you start.  You can take a regular old mongod process and it will “just work.”

There are solutions to this problem in MySQL [2], but they require massaging data at a layer above the database.  The database itself does not support this feature.  Also, you don’t have to think about sharding until you need it.  You don’t have to pre-optimize.  When you don’t need sharding, just start up a mongod process and go.  When you do need sharding, fire up a few more machines, and issue a few commands.  No downtime.

A common quip that I’ve heard and read is something to the effect of: “How many people reading this post actually have enough data to worry about the need to shard?  Not many.”  My response to this is simple.  Most people who use MySQL w/ master/slave replication probably don’t need that either.  Lots of apps could get away with sqlite and a cronjob that backed the file up every hour.  But MySQL & master/slave replication is the status quo, so we all do it.  Now, think about HD video, geolocation, realtime messaging, augmented reality, closer-to-realtime-satellite imagery.  Think about all that data, and how much faster people will want it (and mashups & derivatives of it), 5 years from years from now.  Then think about what database you want to start using right now.

3. GridFS

For reasons that I’m not experienced enough to talk about, you don’t store files in MySQL.  Let’s say you have an application where a user can upload a profile pic.  The standard practice is to store the path to that file in the database, store the file on the filesystem (a shared filesystem if you have multiple app servers) or S3.  If you use a filesystem, some kind of backup is usually performed as well.  If you have multiple apps, you have to use a shared filesystem.

With GridFS, you store files in the database.  MongoDB was built to do this.  Why is this a “reason to use MongoDB,” because MongoDB has replication and sharding of collections built-int.  And guess what?  You can apply that stuff seamlessly to GridFS collections as well.  When you store assets in MongoDB, you get all the replication and sharding capability for free.  Want to backup your user assets?  Just replicate the GridFS collections.  Running out of space on your NFS share?  Have fun dealing with IT.  Running out of space on your MongoDB GridFS installation?  Bring up another machine and shard that collection.

Storing assets in a database is the way we should be doing it from now on.

FINI

So there you have it.  MongoDB is teh awesome because of a simple query syntax, the ability to shard data across machines easily, and the ability to store files in GridFS while taking advantage of replication & sharding.

If you can make it to Mongo Boston, sign up and come say hi, I’ll be the guy getting lynched by the MySQL and CouchDB fanboys.

[1] “Well when the crap is that?!” you may be asking.  Look for my next post about when you should be using MongoDB.

[2] http://axonflux.com/mysql-sharding-for-5-billion-p