An Excuse for MongoDB – Sample Document Schema and Queries

In my earlier post, I used MongoDB to identify some trends in several years of documents from a previous volunteer role. In this post, I’ll share about the process, a sample document schema, and the queries used to derive the insights to the questions.

Process Overview

This was a relatively simple project, so there are just 3 simple steps:

  1. Convert data from PDF to JSON
  2. Ingest into MongoDB Atlas
  3. Run queries

When I started converting the documents, I was glad that MongoDB was schemaless as some of the documents had missing/additional fields than what I expected. These usually stemmed from several experiments in how we divided up leadership responsibilities within the cohort (that I forgotten had taken place).

For step 2, the files were imported into the collection “schedule” via the mongoimport utility:

$ mongoimport --uri (cluster-uri) --collection schedule --file (path-to-file)

Sample Document Schema

A typical weekly schedule document schema looks like this:

{
	"trainingType": "",
	"startDateTime": "",
	"endDateTime": "",
	"reportingTimeOffsetMinutes": 0,
	"activities": [{
		"squadName": "",
		"startTime": "",
		"endTime": "",
		"topic": "",
		"subTopic": [""],
		"venue": "",
		"attire": "",
		"inCharge": [""]
	}]

}

Most of these fields appear consistently throughout the years, with the exception when there are other types of activities that require the students to be outside the school compounds.

Queries Used

Below are the queries used to answer the questions in the previous post, as well as a short description of what is going on. I found MongoDB’s Aggregation Framework really useful for complex queries that involve doing some pre-processing before aggregating the results.

1. What was the average duration of the weekly session across the years?

As there are several types of activities, I first do a filter to keep documents from the weekly sessions. Next, I find the duration in hours by taking the (endDateTime - startDateTime) / 3600000. Lastly, I $group the records by the year in which they started, and calculate the average duration.

db.schedule.aggregate([ 
  { $match: { $expr: {$eq: ["$trainingType", "weeklyParade" ]}}},

  { $addFields: { "trainingDurationHours": { $divide: [{ $subtract: [{$toDate: "$endDateTime"}, {$toDate: "$startDateTime"}]}, 3600000]} } },  

  { $group: { "_id": {$year: {$toDate: "$startDateTime"}}, 
              "averageDurationHours" : {$avg: "$trainingDurationHours"} } },

  { $sort: {"_id": 1} }
])

2. What is the cumulative duration spent, per cohort, in their school club activities?

As the documents are nested, some pre-processing work is required. To avoid complicating the problem, I first filter out all unnecessary fields with the $project operation. Next, I $unwind the activities array to create one document per element, and calculate the duration of that activity. Lastly, I $group and $sort the records based on the cohort name, adding up all the activity durations.

db.schedule.aggregate([ 
  { $project: {"activities": 1} },

  { $unwind: { path: "$activities", preserveNullAndEmptyArrays: false } },

  { $addFields: { "activityDurationHours": { $divide: [{ $subtract: [{$toDate: "$activities.endTime"}, {$toDate: "$activities.startTime"}]}, 3600000]} } },

  { $group: { "_id": "$activities.squadName", 
              "totalActivityDurationHours" : { $sum: "$activityDurationHours"} } },

  { $sort: {"_id": 1} }
])

3. Which cohort had the longest leadership position?

This is pretty much an iteration from the previous query, with the changes being 1) the addition of a $match operation that filters the unwound activities documents based on the topic, and 2) a different sorting approach based on the cohort name and decreasing cumulative activity duration.

db.schedule.aggregate([ 
  { $project: {"activities": 1} },

  { $unwind: { path: "$activities", preserveNullAndEmptyArrays: false } },

  { $match: {"activities.topic": "leadership role"} },

  { $addFields: { "activityDurationHours": { $divide: [{ $subtract: [{$toDate: "$activities.endTime"}, {$toDate: "$activities.startTime"}]}, 3600000]} } },

  { $group: { "_id": {"squadName": "$activities.squadName", "topic": "$activities.topic"},
              "totalActivityDurationHours" : { $sum: "$activityDurationHours"} } },

  { $sort: {"_id.squadName": 1, "totalActivityDurationHours": -1} }
])

4. What is the most frequently used venue for each year?

By now, the structure should be pretty much self explanatory, though the method to perform counting within the $group operation is interesting. After running the query, I had to do some manual formatting to get the first entry per year.

db.schedule.aggregate([ 
  { $project: {"activities": 1} },

  { $unwind: { path: "$activities", preserveNullAndEmptyArrays: false } },

  { $addFields: { "year": {$year: {$toDate: "$activities.startTime"} } } },

  { $group: { "_id": {"venue": "$activities.venue", "year": "$year"}, "counter":{$sum:1} } },

  { $sort: {"counter": -1} }
])

5. Who are the top 10 individuals who had the longest duration of being an activity in charge?

As each activity could have more than one individual, I had to do two $unwind operations to tease apart the information, before calculating the duration and aggregating them together. The rest of the query should be pretty much familiar at this point 🙂

db.schedule.aggregate([
  { $project: {"activities": 1} },

  { $unwind: { path: "$activities", preserveNullAndEmptyArrays: false } },

  { $unwind: { path: "$activities.inCharge", preserveNullAndEmptyArrays: false } },

  { $addFields: { "activityDurationHours": { $divide: [{ $subtract: [{$toDate: "$activities.endTime"}, {$toDate: "$activities.startTime"}]}, 3600000]} } },

  { $group: { "_id": "$activities.inCharge",  "totalActivityDurationHours" : { $sum: "$activityDurationHours"} } },

  { $sort: {"totalActivityDurationHours": -1} }
])

That’s it for the behind-the-scenes of this simple project using MongoDB 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s