MongoDB: Optimizing Aggregation

MongoDB: Optimizing AggregationElvis RozarioBlockedUnblockFollowFollowingJan 29Recently I went through a couple of code reviews involving MongoDB aggregations.

The aggregations were giving the desired output but consuming a lot more time than it should, making the developers believe that the fault lies in either MongoDB (the database) or their server.

Some digging into their code made us realize that the fault lies in the way the aggregation stages were written.

While writing aggregation stages, make sure that you’re writing the minimum number of stages required and the cost of each stage should be minimal as well.

This is where efficiency comes into the picture.

For example, the cost of a group stage is definitely more than the cost of a project stage (in most use-cases).

I’ll be explaining one of the aggregation queries I was assigned to review.

I’ll be discussing the query written by the developer, how he came to that conclusion and how that query could be improved.

Optimizing AggregationI’m assuming that you have a basic understanding of MongoDB.

This can be called as an intermediate level article and may not be suited for those who’ve never written aggregation queries in MongoDB.

Here’s the collection on which the query needs to run.

Obviously the above is just a part of the entire collection.

The requirement is to count the unique products per day in the entire collection.

The result should look like this,As you can see, there are only 3 unique products in the sample collection, Book, iPhone 7 and MacBook Pro (couldn’t come up with better names), that means the sum of all the count values should be 3.

Since all three books were bought on the same day (27th), and no other products were bought that day, the unique product count for that day should be 1, whereas, on 28th, two different products were bought and hence the unique product count for that day should be 2.

Now let’s discuss the solution the developer came up with.

It includes four stages, two project stages and two group stages,He started off with a project stage for getting the day, month and year from the timestamp using $dayOfMonth, $month and $year operator respectively and stored it in different variables.

Then, since he needs the entire date in one variable, he used another project stage to concatenate those values and storing it in the time field.

Now that he has the time field, he used a group stage to group according to time and product, such that he gets a group for each unique time-product combination (to remove duplicates), then another group stage to get unique time with count as the sum of the products.

The output is as expected but he used four pipeline stages to achieve it.

Usually, any medium level developer should be able to do it in about 2-3 stages, considering he knows about the $dateToString operator.

The solution would be,As you can see, the $dateToString operator allows us to get the date in string format directly, thus combining the first two project stages into one.

This itself made the query quite efficient but it can be optimized even more.

The project stage can be completely avoided if we use the $dateToString operator in the group stage itself,As you can see, we’ve used the $dateToString operator in the group stage itself, thus avoiding another project stage.

Now our entire query contains only 2 stages, but can we improve it anymore?Actually, yes we can, as I said in the beginning, in most cases a project stage is cheaper than a group stage and the $addFields stage is even more so.

So the above query can be even more optimized as such,In the first group stage itself, we’re adding a $addToSet operator.

The $addToSet operator is similar to the $push operator but it makes sure only unique values are added to the array.

It takes care of getting the unique products in the count variable.

That itself gives us all the necessary information but in order to bring it to the required format, we added a $addFields stage which replaces the count variable to the length of the count array.

Obviously, the time required to run this query is a lot lesser than the query we initially had.

I’ll try to put in more examples of optimizing aggregations as far as possible.

That’s all folks!.

. More details

Leave a Reply