MongoDB & Node.js: Writing Efficient Operations and Aggregations
How to handle common data aggregation tasks in MongoDB as data scales
MongoDB is the dominant database adopting the document data model in use today. Even Mongo’s homepage title states their solution is the most popular database for modern apps, and this is hard to argue against given the success of the database and surrounding ecosystem of tools that facilitate all stages of the database pipeline, from deployment to testing and optimising performance.
The mongodb package currently boasts over 2.5 million weekly downloads at the time of writing, with clients ranging from hobbyists to large enterprises.
MongoDB will likely be the database of choice for you or your team if you are:
- Building a backend for iOS / Android apps and require a robust and reliable database to store app data.
- Running a cloud service that handles a large amount of data where processing power or data storage requirements go beyond a one instance solution. Such services can range from data analytics to social media content.
- A student and wish to learn a relevant database to boost your career prospects. Education is well supported at MongoDB, and there are indeed a range of courses on the MongoDB University platform.
Beyond a comprehensive library of educational material and wide-ranging platform & programming language support, MongoDB’s ability to scale is one of its major attractions.
The high level features of what allows MongoDB scale will be noted, before delving into specific query syntax to ensure they maintain efficient as your data scales up.
What Makes MongoDB Scalable
There are a range of factors that make MongoDB a powerful and flexible database, that can effectively scale databases consisting of billions of records.
The following list highlights a few features that allow this:
- Document based data need not adhere to a particular schema, allowing you to store an arbitrary number of data structures in a particular collection. This gives much greater flexibility in how data is represented and managed.
- To coincide with the previous point, Operators offer simple APIs that can perform complex queries, sampling techniques and complex data aggregations. We’ll be exploring some of these further down the article.
- Replication and sharding capabilities allows app data to scale globally. These features improve availability and act as effective load balancers. Replica sets also speed up content delivery whereby instances running in different regions can serve data to users closest to that region.
- MongoDB Community is free and actively maintained along with their Enterprise product. The database is well supported with deployment software like Kubernetes, and data is easy to browse with the Compass client.
- MongoDB have further attempted to streamline database management with the Atlas cloud service, that also supports auto scaling capabilities.
- These features undoubtedly make MongoDB an attractive database for a range of applications, but they do not exclusively guarantee a high performance.
What makes MongoDB performant: Good data and query design
What determines consistent performance as your data scales boils down to two main factors:
- How you structure your document data.
- How you write your queries to fetch or manipulate that data.
Whether simple queries or complex aggregations, MongoDB operations will load data of interest into memory as a query is processing. It is therefore important to consider how much memory will potentially be allocated when you are writing your queries, with the understanding that a single query could completely consume all your available memory.
The key to MongoDB efficiency is therefore to ensure documents include the necessary data to avoid further querying (such as storing product details in an order document rather than a productId that would require an additional query to a separate products document), in addition to only fetching the requirement data from a collection.
Storing additional metadata in documents can dramatically speed up your query or aggregation pipeline, trading some additional storage requirements for added query processing speed. On the flip side, the query itself should only fetch the fields you require and ignore as much data as possible in the process to avoid unneeded memory allocation.
These are the areas of interest the rest of this article explores, highlighting some key MongoDB features in the process.
Optimisation with Projections and Sampling
Projection enables you to filter the properties returned through a query; it is something that can dramatically increase performance and should never be ignored.
Projection is applied differently between CRUD operations and aggregation. For findOne(), the projection property is used within the query options object:
const sample = await db
.collection('samples')
.findOne({
_id: ObjectId(sampleId)
}, {
projection: {
name: 1,
overview: 1,
}
});When applied to find(), the project method is used instead. The following example fetches some “active” samples with a projection applied to the query. The query is deliberately head-scratching to introduce some operators:
const samples = await db
.collection('samples')
.find({
$not: {
active: { $eq: false }
}
})
.sort({ _id: 1 })
.project({
system: 0,
'meta.log': 0
})
.toArray();Applying 1 to the projection properties includes those properties in the returned results, whereas 0 omits those properties. In this latter example the query will emit meta.log and system properties from the returned results.
Applying projection to an aggregation works in a similar manner: it is just added as another aggregation pipeline component, commonly after a $match component is added.
For a full list of aggregation pipeline stages, consult this documentation.
The following example fetches 1000 samples at random from a collection of people. It filters 1000 records based on a location of the UK or HK, defines the projection to only return the name, age and net worth of the individuals, and finally defines the sample size of 1000:
const docs = await db
.collection('samples')
.aggregate([
{
$match: {
$or: {
country: { $eq: 'United Kingdom'},
country: { $eq: 'Hong Kong'},
}
}
}, {
$project: {
name: 1,
age: 1,
net_worth: 1,
}
}, {
$sample: { size: 1000 },
}
])
.toArray();return docs;
Sampling a set of records to discover trends is a lot more efficient than querying the entire data set, and with a big enough sample size you should discover the underlying trends in the data.
Logical and comparison query operators
The $or operator is good for filtering individual properties at a granular level. It is used above for the country property, but can combine a range of properties and corresponding conditions too. $or is part of a small collection of logical query operators, each of which can also be used individually or as a combination in a single $match query.
$eq on the other hand is a part of the comparison query operators that are also very useful for fine-tuning your queries.
Consider a scenario where I have a list of 100 users of interest, and I wish to fetch some activity data from them within a certain time period, and within a single query.
Using comparison query operators this can be achieved with ease:
let userIds = [...];
let weekStart = 1626627600;
let weekEnd = 1627232400;let activity = await db
.collection('activity')
.find({
user_id: {
$in: userIds
},
'meta.createdTimestamp': {
$gte: weekStart,
$lt: weekEnd,
}
})
.project({
meta: 0,
})
.toArray();
Note that the $in operator only queries records where the provided user IDs are present. In addition, the $gte and $lt operators are used to return results within a certain time period.
Although MongoDB supports date time objects, the easiest way to deal with time related data in my experience is by raw unix timestamps.
Counting Documents
When you simply need a raw figure of how many records meet a certain criteria, CRUD operations should be discarded in favour of the countDocuments function.
countDocuments will simply return an integer of the amount of documents that satisfy a query. It is useful by fetching figures for analytical purposes:
const outstanding_invoices = await db
.collection('invoices')
.countDocuments({
status: { $ne: 'SETTLED' },
createdAfter: 1626627600,
}, {
limit: 999
});The above query returns the amount of invoices that are not settled yet — that is to say documents with any status apart from SETTLED will be counted. Note that a limit should also be provided to further prevent the query from utilising too many resources, especially in the event you have millions of records.
A limit should ideally be a lower bound of where a realistic measure of your data set is reflected in the final count. In the event there are 999 unsettled invoices, this give me the general idea that many of my clients are probably avoiding their payment obligations — a limit of 9,999 or 99,999 would have given me the exact same insight, but would have consumed more resources in the process.
Avoiding loops with elemMatch
Executing MongoDB queries within for loops will dramatically slow down your API resolution times, and is something that should be a last resort scenario in a production environment.
We have already discussed some ways to avoid looping (such as with the $in operator for arrays of basic types), but $elemMatch takes this approach a step further by matching object properties of array entries.
An example scenario of this could be when checking whether an authentication token is valid. It is common practice to generate multiple authentication tokens as apps these days are multi-platform, and therefore support signing in from multiple devices.
The data structure for authentication tokens in a multi-device setting could resemble the following:
{
authTokens: [
{
platform: 'ios',
token: 'vjb2oi2es...',
}, {
platform: 'web',
token: 'vbg49qdj...',
}, {
platform: 'android',
token: 'nknhj39ad...',
}]
}When a single authentication token is provided (with or without the platform), $elemeMatch can be used to check if that token exists within the stored authentication tokens in a MongoDB collection:
let token = 'nknhj39ad...';
let platform = 'android';const user = await db
.collection('users')
.findOne({
authTokens: {
$elemMatch: {
token: token,
platform: platform
}
}
}, {
projection: {
email: 1,
name: 1,
username: 1,
}
});
The above query will return a single user record as the android authentication token was matched.
Other device related information can be used here to boost authentication security, such as the deviceId in a mobile app setting. For React Native apps for example, the react-native-device-info package can be used to extract unique identifiers of the user’s device:
import { Platform } from 'react-native'
import { getUniqueId } from 'react-native-device-info'const deviceId = getUniqueId();
const platform = Platform.OS;
Directly Manipulating Arrays
If an authentication token is no longer valid, we can also directly manipulate that array within a MongoDB query, preventing the need to firstly fetch the data to determine whether it should be manipulated.
The $pull operator is useful in this scenario. If I have recently introduced a deviceId property to my authentication process and wish to inject that deviceId into an existing record, I can firstly remove the old record with $pull, then use $addToSet to insert the re-formatted data:
await Connection.db
.collection('users')
.updateOne({
_id: userId
}, {
$pull: {
authTokens: {
token: token
}
},
$addToSet: {
authTokens: {
deviceInfo: {
deviceId: deviceId,
platform: platformOs,
}
tokenInfo: {
token: token,
createdTimestamp: created,
}
}
}
});This approach is useful as it allows us to refactor a record of an array within a single query.
In Summary
This article has demonstrated only some of the ways to keep your MongoDB queries optimised, highlighting the most critical approaches to keeping resource utilisation to a minimum.
To recap, the following strategies effectively keep your queries efficient as your data scales:
- Projection and sampling to avoid the entire data set being returned. Sampling can give you a large enough data set without having to return the entire set — which is sometimes not even possible in big data settings.
- Mix and match operators in both CRUD operations and aggregation pipelines to extract exactly what you need from your database, all within a single query.
- Counting documents for raw integer values with lower bound limits is an efficient means of generating statistics.
- Accessing and manipulating arrays via queries is a powerful concept that can simplify your overall API and prevent unnecessary looping. The authentication token setting is a good use case for this, but can ultimately be applied to any list of objects you need updating or refactoring.
These are query specific strategies, and can coincide with Atlas’s auto scaling capabilities, replication and sharding, and other means of scaling that the MongoDB ecosystem offers.
