Gordon Myers

Articles on Life, Truth, Love, Computers, and Music


How to do Joins in MongoDB

If you've come here looking how to perform a JOIN operation on two or more collections in MongoDB, you've come to the wrong place. In fact, that is exactly what I'm about to show you how to do, but trust me, you shouldn't be doing it. If you're trying to do that, that's a clear indication that you have relational data. And if you have relational data, that means you've made the wrong choice for your database. Don't use Mongo; use a relational database instead. I know it's cool and shiny and new, but that is not a good rationale to use it. SQL is your best bet. I'd recommend you read Why You Should Never Use MongoDB by Sarah Mei. Now, with that disclaimer out of the way, I'll dive right into it.

We've been using MongoDB for some of our data storage at work. The choice to use Mongo was a decision made well before I arrived, and incidentally, we might be changing that out at some point in the future. But nevertheless, as it's the system currently in place, I've had to work within that system.

There were a number of pre-established Mongo queries in the codebase I've been working on, and I'm sorry to say many of them were really quite slow. So over the course of a weekend, I tried out a couple of ideas that seemed intuitive enough and managed to speed up some of these common queries by an order of magnitude. The queries I'm talking about grabbed data from multiple collections simultaneously, hence why they were initially so slow, and hence the title of this blog post. In this post I'm going to dissect a number of the techniques I used to speed up these Mongo queries, with plenty of code examples along the way.

Let's say you've got a collection in Mongo called Transactions. This table has a variety of fields on each row, including one field called userId, which is just the ObjectID (aka foreign key, for you SQL folks) of a document in the separate Users collection. You might want to retrieve a list of transactions in a given time period, and show some information on the screen, like the date, the total amount, and the first and last name of that user. But for this first part, let's hold off on any attempts at JOINs, and just look at accessing the Transactions collection alone.

I ran some benchmarks with the following code on my local machine, which was also running a local instance of MongoDB.

MongoClient conn = new MongoClient(new ServerAddress("127.0.0.1", 27017));
DB db = conn.getDB("MyDatabase");
DBCollection transactions = db.getCollection("Transactions");

SimpleDateFormat f = new SimpleDateFormat("yyyy-MM-dd");
Date startDate = f.parse("2014-06-01");
DBObject gt = new BasicDBObject("$gt", startDate);
DBObject match = new BasicDBObject("created", gt);

Cursor cursor = transactions.find(match);
List<Map> rows = new ArrayList<>();

while ( cursor.hasNext() ) {
    Map<String, Object> row = new HashMap<>();
    DBObject dbObject = cursor.next();
    
    row.put("total", dbObject.get("total"));
    row.put("approved", dbObject.get("approved"));
    row.put("canceled", dbObject.get("total"));
    row.put("location", dbObject.get("canceled"));
    row.put("items", dbObject.get("items"));
    row.put("coupons", dbObject.get("coupons"));
    row.put("created", dbObject.get("created"));
    row.put("updated", dbObject.get("updated"));
    row.put("deleted", dbObject.get("deleted"));
    row.put("userId", dbObject.get("userId"));
    
    rows.add(row);
}
$conn = new Mongo('mongodb://localhost:27017');
$db = $conn->selectDB('MyDatabase');
$transactions = $db->selectCollection('Transactions');

$match = array(
    'created' => array('$gt' =>
        new MongoDate(strtotime('2014-06-01'))
    ));

$cursor = $transactions->find($match);
$rows = array();

while ( $cursor->hasNext() ) {
    $dbObject = $cursor->getNext();
    $rows[] = array(
        'total'    => $dbObject['total'],
        'approved' => $dbObject['approved'],
        'canceled' => $dbObject['canceled'],
        'location' => $dbObject['location'],
        'items'    => $dbObject['items'],
        'coupons'  => $dbObject['coupons'],
        'created'  => $dbObject['created'],
        'updated'  => $dbObject['updated'],
        'deleted'  => $dbObject['deleted'],
        'userId'   => $dbObject['userId']
    );
}

This code is obviously sanitized a bit here to highlight what I'm doing, but you might extend this to do any number of things. You might have some sort of POJO that corresponds to a single document in the collection, and instantiate a new one within each loop. Then you would call dbObject.get() to retrieve each of the properties of that row. I iterated this simple test hundreds of times on my local machine, and found, on average, it took 0.663 seconds to complete. And in case you're curious, the date range I've given here corresponds to roughly 30,000 documents. So that's not so bad.

But this pared-down example was not my use case, and my use case was performing poorly. So I thought, well, I don't need all those pieces of data in the dbObject. I really only needed two. So I formulated a hypothesis. My hypothesis was simple: if I only grab the data I need, and none of the data I don't, the query would perform better. This is akin to avoiding SELECT * in SQL (which is something you should always avoid). So to start, I made the most basic modification to the code possible, which you can see here:

SimpleDateFormat f = new SimpleDateFormat("yyyy-MM-dd");
Date startDate = f.parse("2014-06-01");
DBObject gt = new BasicDBObject("$gt", startDate);
DBObject match = new BasicDBObject("created", gt);

Cursor cursor = transactions.find(match);
List<Map> rows = new ArrayList<>();

while ( cursor.hasNext() ) {
    Map<String, Object> row = new HashMap<>();
    DBObject dbObject = cursor.next();
    
    row.put("created", dbObject.get("created"));
    row.put("total", dbObject.get("total"));
    row.put("userId", dbObject.get("userId"));
    
    rows.add(row);
}
$match = array(
    'created' => array('$gt' =>
        new MongoDate(strtotime('2014-06-01'))
    ));

$cursor = $transactions->find($match);
$rows = array();

while ( $cursor->hasNext() ) {
    $dbObject = $cursor->getNext();
    $rows[] = array(
        'created'  => $dbObject['created'],
        'total'    => $dbObject['total'],
        'userId'   => $dbObject['userId']
    );
}

All I've done here is remove the .get() call on the properties I didn't need. In fact, at this point, the database driver is still returning all of those properties; I'm just not accessing them. I wanted to see if that alone would make any difference. And in fact, it did. Hundreds of iterations of this code averaged in at 0.405 seconds. That's a 63% speed improvement. Of course the percentage makes it seem more grandiose than it really is, since that's only a 0.25 second improvement, which is not that big of a gain. But it is still an improvement, and it was consistent. Accessing fewer properties from the cursor results in a speed improvement. But while this sped things up a tiny bit, I knew that we could do better by forcing the database driver to stop returning the extraneous data, a la a project clause:

SimpleDateFormat f = new SimpleDateFormat("yyyy-MM-dd");
Date startDate = f.parse("2014-06-01");
DBObject gt = new BasicDBObject("$gt", startDate);
DBObject match = new BasicDBObject("created", gt);

DBObject project = new BasicDBObject("total", true);
project.put("userId", true);
project.put("created", true);

Cursor cursor = transactions.find(match, project);
List<Map> rows = new ArrayList<>();

while ( cursor.hasNext() ) {
    Map<String, Object> row = new HashMap<>();
    DBObject dbObject = cursor.next();
    
    row.put("created", dbObject.get("created"));
    row.put("total", dbObject.get("total"));
    row.put("userId", dbObject.get("userId"));
    
    rows.add(row);
}
$match = array(
    'created' => array('$gt' =>
        new MongoDate(strtotime('2014-06-01'))
    ));
$project = array(
    'created' => true,
    'total'   => true,
    'userId'  => true
);

$cursor = $transactions->find($match, $project);
$rows = array();

while ( $cursor->hasNext() ) {
    $dbObject = $cursor->getNext();
    $rows[] = array(
        'created'  => $dbObject['created'],
        'total'    => $dbObject['total'],
        'userId'   => $dbObject['userId']
    );
}

This isn't much different than the last example. I'm still selecting only the data points I care about, but now I've added a project clause to the find() call. This means the database driver is no longer returning all the extraneous properties in the first place. The results? On average, this call took 0.029 seconds. That's a 2,186% speed increase over our original query. And that is worth paying attention to. While my last example wasn't all that telling, this one, on the other hand, confirms my hypothesis. If you only select the data you need, and none of the data you don't need, your queries will perform better. (This is true on any database platform.) The consequence of this is that you can't really use a general-purpose POJO for your collection -- not if you want your code to perform well, that is. Instead, you might have any number of contextual POJOs that access different parts of the same collection. It's a trade-off that may prove worth it for the sheer speed.

And I had one more test, just because I was curious. Up until now I've been using the find() command to grab my data, but Mongo also has another way of retrieving data: the aggregate pipeline. I remember reading somewhere that the AP actually spun up multiple threads, whereas a simple find() call was restricted to one. (Don't ask me for a source on that, I'm vaguely remembering heresay.) So I wanted to see if simply switching out those method calls would have any added bonus. Here's what I tried:

SimpleDateFormat f = new SimpleDateFormat("yyyy-MM-dd");
Date startDate = f.parse("2014-06-01");
DBObject gt = new BasicDBObject("$gt", startDate);
DBObject match = new BasicDBObject("created", gt);

DBObject project = new BasicDBObject("total", true);
project.put("userId", true);
project.put("created", true);

AggregationOptions aggregationOptions = AggregationOptions.builder()
     .batchSize(100)
     .outputMode(AggregationOptions.OutputMode.CURSOR)
     .allowDiskUse(true)
     .build();

List<DBObject> pipeline = Arrays.asList(match, project);
Cursor cursor = transactions.aggregate(pipeline, aggregationOptions);
List<Map> rows = new ArrayList<>();

while ( cursor.hasNext() ) {
    Map<String, Object> row = new HashMap<>();
    DBObject dbObject = cursor.next();
    
    row.put("created", dbObject.get("created"));
    row.put("total", dbObject.get("total"));
    row.put("userId", dbObject.get("userId"));
    
    rows.add(row);
}
$match = array(
    'created' => array('$gt' =>
        new MongoDate(strtotime('2014-06-01'))
    ));
$project = array(
    'created' => true,
    'total'   => true,
    'userId'  => true
);

$pipeline = array(array('$match' => $match), array('$project' => $project));
$cursor = $transactions->aggregateCursor($pipeline);
$rows = array();

while ( $cursor->hasNext() ) {
    $dbObject = $cursor->getNext();
    $rows[] = array(
        'created'  => $dbObject['created'],
        'total'    => $dbObject['total'],
        'userId'   => $dbObject['userId']
    );
}

That test ran, on average, in 0.035 seconds. That's still a 1,794% speed increase over our first test, but it's actually 0.006 seconds slower than then last one. Of course a number that small is a negligible difference. But the fact that there is no difference is worth noting. There is no tangible benefit to using the aggregate pipeline, without a $group clause, versus an ordinary call to find(). So we'd might as well stick with find(), especially considering we weren't aggregating anything, anyway.

But now comes the question of how we go about getting data from other collections. That userId is effectively a foreign key, so we need to do additional queries to get that information. (Side note: you could just duplicate the relevant information instead of, or along with, the foreign key, since that's kind of the Mongo way. But what happens when a person changes their name? This is the problem with non-relational databases.)

The code that I had originally set out to improve did something that I immediately recognized as bad: it looped over the cursor on Transactions, and for each value, ran another query to the Users collection. I refer to these kind of queries as "one-off" queries, since that's kind of what they are. Let me show you some code to better explain what I mean.

SimpleDateFormat f = new SimpleDateFormat("yyyy-MM-dd");
Date startDate = f.parse("2014-06-01");
DBObject gt = new BasicDBObject("$gt", startDate);
DBObject match = new BasicDBObject("created", gt);

DBObject project = new BasicDBObject("total", true);
project.put("userId", true);
project.put("created", true);

Cursor cursor = transactions.find(match, project);
List<Map> rows = new ArrayList<>();

while ( cursor.hasNext() ) {
    Map<String, Object> row = new HashMap<>();
    DBObject dbObject = cursor.next();
    
    ObjectId userId = (ObjectId) dbObject.get("userId");
    row.put("created", dbObject.get("created"));
    row.put("total", dbObject.get("total"));
    row.put("userId", userId);
    
    // one-off query to the Users collection
    DBObject userProject = new BasicDBObject("firstName", true);
    userProject.put("lastName", true);
    DBObject user = users.findOne(userId, userProject);
    
    row.put("firstName", dbObject2.get("firstName"));
    row.put("lastName", dbObject2.get("lastName"));
    
    rows.add(row);
}
$match = array(
    'created' => array('$gt' =>
        new MongoDate(strtotime('2014-06-01'))
    ));
$project = array(
    'created' => true,
    'total'   => true,
    'userId'  => true
);

$cursor = $transactions->find($match, $project);
$rows = array();

while ( $cursor->hasNext() ) {
    $dbObject = $cursor->getNext();
    $userId = $cursor['userId'];
    
    // one-off query to the Users collection
    $userMatch = array('_id' => $userId);
    $userProject = array(
        'firstName' => true,
        'lastName'  => true
    );
    
    $user = $users->findOne($userMatch, $userProject);
    $rows[] = array(
        'created'   => $dbObject['created'],
        'total'     => $dbObject['total'],
        'userId'    => $dbObject['userId'],
        'firstName' => $user['firstName'],
        'lastName'  => $user['lastName']
    );
}

I can tell you that when I ran this code against my local instance of MongoDB, it took, on average, 0.319 seconds to run. That doesn't seem so bad at all, especially considering all it's doing. Again, this matches about 30,000 documents in the Transactions collection, and clearly we're making just as many calls to the Users collection. But while this seems fine on my local machine, that is not a realistic test. In real world circumstances, you would not have your database on the same server as your codebase. And even if you did, you won't forever. Inevitably you're going to need some code to run on a different server. So I re-ran the same tests using a remote instance of MongoDB. And that made a BIG difference. Suddenly this same little routine took 1709.182 seconds, on average. That's 28-and-a-half minutes. That is ridiculously bad. I will admit, though, that my wifi speed here at home is not the best. I re-ran the same test later, on a better network, and it performed at 829.917 seconds. That's still 14 minutes, which is dreadful.

Why would this simple operation take so long? And what can be done about it? Imagine this: let's say you went into the DMV office and needed to look up a bunch of records. You have a list of the records you need on a handy-dandy clipboard. So you stand in line, and once you're called to the desk, you ask the clerk, one by one, for the records on your clipboard. "Can you give me the details for Record A?" "Can you give me the details for Record B?" That's straight-forward enough, and will be efficient as it can be. But if the clerk you're talking to only has part of the data you need, and tells you you'll need to visit a different office to retrieve the other missing puzzle pieces, then it would be a bit slower.

If you're querying against your own local machine, it would go something like this:

  • Ask Cleark A for Record 1
  • Clerk A gives you everything they have about Record 1
    • Clerk A tells you to talk to Clerk B for the rest of the information
  • Leave Clerk A's line, and stand in Clerk B's line
  • Ask Clerk B for the rest of Record 1
  • Clerk B gives you the rest of Record 1
  • Leave Clerk B's line, and return to Clerk A's line
  • Repeat for Record 2...

That doesn't seem very efficient, does it? But that's the trouble with non-relational databases; disparate collections are in different places. And keep in mind that this analogy actually represents the best case scenario, where Clerk A and Clerk B are in the same room. But that isn't realistic. A more realistic illustration would involve Clerk A at the DMV office, and Clerk B located a mile or two down the road, at the Social Security office. So for each record on your clipboard, you drive back and forth from the DMV to the Social Security office. You can see why it'd be so slow. That "drive" back and forth is called network latency.

But we can do better than that. What if, instead of driving back and forth for each record, you simply asked the clerk at the DMV for all the data they had all at once, and then afterwards you compiled a comprehensive list of all the records you'd need from the Social Security office? That way, you'd only have to make the drive over there once, rather than making 30,000 drives back and forth. In Mongo, you'd only be doing two calls to the database: one bulk call to the Transactions collection, and then a subsequent bulk call to the Users collection. Here's some code to illustrate:

SimpleDateFormat f = new SimpleDateFormat("yyyy-MM-dd");
Date startDate = f.parse("2014-06-01");
DBObject gt = new BasicDBObject("$gt", startDate);
DBObject match = new BasicDBObject("created", gt);

DBObject project = new BasicDBObject("total", true);
project.put("userId", true);
project.put("created", true);

MongoJoinCache join = new MongoJoinCache();
Cursor cursor = transactions.find(match, project);
List<Map> rows = new ArrayList<>();

int i = 0;
while ( cursor.hasNext() ) {
    Map<String, Object> row = new HashMap<>();
    DBObject dbObject = cursor.next();
    
    Object userId = (ObjectId) dbObject.get("userId");
    row.put("created", dbObject.get("created"));
    row.put("total", dbObject.get("total"));
    row.put("userId", userId);
    
    join.add(userId.toString(), i);
    rows.add(row);
    i++;
}

DBObject userMatch = join.resolveCache();
DBObject userProject = new BasicDBObject("firstName", true);
userProject.put("lastName", true);

cursor = users.find(userMatch, userProject);
while ( cursor.hasNext() ) {
    DBObject dbObject = cursor.next();
    Object userId = (ObjectId) dbObject.get("_id");
    Set<Integer> indexes = join.get(userId.toString());
    
    for (Integer index : indexes) {
        Map<String, Object> row = rows.get(index);
        row.put("firstName", dbObject.get("firstName"));
        row.put("lastName", dbObject.get("lastName"));
        rows.add(index, row);
    }
}

public class MongoJoinCache {
    private final Set<String> objectIds;
    private final Map<String, Set<Integer>> objectToIndexMapping;
    private int total = 0;
    
    public MongoJoinCache() {
        objectIds = new HashSet<>();
        objectToIndexMapping = new HashMap<>();
    }
    
    public void add(String objectId, Integer index) {
        objectIds.add(objectId);
        Set<Integer> indexes;
        if (objectToIndexMapping.containsKey(objectId)) {
            indexes = objectToIndexMapping.get(objectId);
        } else {
            indexes = new HashSet<>();
        }
        indexes.add(index);
        objectToIndexMapping.put(objectId, indexes);
        total++;
    }
    
    public Set<Integer> get(String objectId) {
        return objectToIndexMapping.get(objectId);
    }
    
    public Integer size() {
        return total;
    }
    
    public DBObject resolveCache() {
        if (size() == 0) {
            return null;
        }
        
        final BasicDBList ids = new BasicDBList();
        for (String id : objectIds) {
            ids.add(new ObjectId(id));
        }
        
        DBObject match = new BasicDBObject("_id", new BasicDBObject("$in", ids));
        return match;
    }
}
$match = array(
    'created' => array('$gt' =>
        new MongoDate(strtotime('2014-06-01'))
    ));
$project = array(
    'created' => true,
    'total'   => true,
    'userId'  => true
);

$userIds = array();
$cursor = $transactions->find($match, $project);
$rows = array();

$i = 0;
while ( $cursor->hasNext() ) {
    $dbObject = $cursor->getNext();
    $userId = strval($dbObject['userId']);
    $userIds[$userId][] = $i;
    
    $rows[] = array(
        'created'  => $dbObject['created'],
        'total'    => $dbObject['total'],
        'userId'   => $dbObject['userId']
    );
    $i++;
}

$userMatch = array('_id' => array('$in' => array()));
foreach ( $userIds as $userId => $indexes ) {
    $userMatch['_id']['$in'][] = $userId;
}

$userProject = array('firstName' => true, 'lastName' => true);
$cursor = $users->find($userMatch, $userProject);

while ( $cursor->hasNext() ) {
    $dbObject = $cursor->getNext();
    $userId = strval($dbObject['userId']);
    
    foreach ( $indexes as $userIds[$userId] ) {
        $rows[$index]['firstName'] = $dbObject['firstName'];
        $rows[$index]['lastName'] = $dbObject['lastName'];
    }
}

Phew. That's a lot more code! Well, on the Java side anyway. (PHP arrays are awesome; Java people don't even understand.) But the million dollar question is whether there is any tangible benefit to all that extra code I just threw at you. Here are the results: When running against my local machine, this ran for an average of 0.339 seconds. Against the previously-mentioned 0.319 seconds on my local machine, this is slightly slower. So why bother with all this extra code if it's slower? Because that's not really a fair test. The difference of 10 milliseconds is negligible, but more importantly, there is no universe in which you can reliably expect to have zero network latency. There will always be network latency in production environments. So a real test is against a remote instance.

So what happens when I ran this code against a remote MongoDB server? Remember that it took 28 ½ minutes before. With this code shown above, however, on average, it ran in 6.191 seconds. You read that correctly. That is an astronomical speed improvement of 27,508%. And even if your network latency isn't as bad as mine was (and it shouldn't be), you can still see nonetheless that this method will always be faster, by an order of magnitude.

If you think about it, it makes sense. We took a query which had been O(n²) in complexity and reduced it down to O(2n) -- at most. In fact, it's probably less than that in practice, since there are bound to be duplicate foreign keys. I'm guessing that the developers of relational databases do something exactly like this, under-the-hood in their database code. In order to join two tables, they probably first have to collect all the foreign keys from the first table into a set, and then grab the corresponding rows en masse from the second table. The big difference is that with SQL, you don't have to think about any of this. You just type INNER JOIN and bam, it's all magically done for you. But because Mongo doesn't include this as a concept, you have to write your own database wrapper code in order to be able to use it efficiently.

So what's the takeaway from all of this? First, you can write your own wrapper code to effectively perform a join operation in Mongo, and doing so is hugely advantageous in terms of speed, albeit slightly annoying. But it's vital to your success if you're going to use Mongo. Secondly, what I hope is the bigger takeaway is that you shouldn't have to. Like I've been saying all along, if you have to resort to writing what is essentially your own low-level database code, it's a sure sign that you're using the wrong database. If you've read this far, that means you should probably just use SQL.

TL;DR

To conclude, here are the more practical takeaways broken down:

  • Only select the data you absolutely need, and nothing more
    • Use a project clause to limit your data
    • Don't instantiate general purpose objects with every property
  • Don't do one-off queries inside of loops!
  • Minimize network latency by grabbing batches

And that is how you do a "Join" in Mongo.


Post a Comment

Name:
Email:
(Will not be published)
Website:
(Optional)
Comment: