Question

So in an application I'm currently working on, I'm having an issue designing a collection. I can't give away too much information about the application, but here is a similar (albeit not as realistic) example that would fit a similar situation:

Lets say you have a construction company, it would have a number of attributes, like name, office address, current projects, employees, etc.

Now in this example, a construction company might only work during certain times of the year. For example, a company may not work during the winter due to weather reasons. The application is requiring the company to list the times were they will be actively working on projects. This would be specific dates (including year), because they won't necessarily be the same every year, so the idea is the company would list

Lets say ACME construction co. every year works from May 1 to labour day (Canadian holiday that is often considered a the end of summer) which happens to be Sept 7 in 2020. Lets say they also happen to take the first 7 days of July off. So the 2 working seasons would be 2020-05-01 - 2020-06-30 and 2020-07-08 - 2020-09-07. This season will also hold other information, that isn't directly relevant to this issue.

Now here comes the trickier part, lets say Umbrella Corp needs a company to renovate their office kitchen. They would want to search for a construction company to do this. So Umbrella Corp would search companies that are available during the month of June to do their renovation.

Now back to the data,

I could store availability as an array on the company collection, but this technically grows without bounds which would be an antipattern. It would take ages for this array to actually grow to a size that would actually matter, but it feels wrong to do that, am I wrong in thinking this way?

I could also store these seasons in their own collection, referencing back to the company, but then querying for companies that have availability becomes quite the issue. I could do an aggregate, but in my testing it appears to be slower than ideal. Something like:

db.companies.aggregate([
 {
   $lookup: {
     from: 'seasons',
     as: 'seasons',
     let: { companyId: '$_id' }
     pipeline: [
       {
         $match: {
           $expr: {
             $and: [{ $eq: ['$companyId', '$$companyId'] , /* other condition to filter by availability*/}]
         }
       }
     ]
   }
 },
 {
   $match: { 'seasons.0': { $exists: true } }
 }
])

The other option was to break it into 2 queries, the first being db.seasons.find({ /* by availability */}) and then db.companies.find({ _id: { $in: [/* company ids found from first result */] Now these 2 queries themselves are very fast, but the first one slows down when all the records need to be loaded into memory. In my test I created 20000 companies and 40000 seasons, and the first query took around 20ms for the query, but 4s to load it all into an array in code to work with.

My third option I think would be when a seasons is created or updated, store the current/future season dates partially denormalized on the company like so:

{
  name: 'Acme Construction Co'
  seasons: [
    { _id: ObjectId(), startDate: '2020-05-01', endDate: '2020-06-03' }
    /* ... etc */
  ]
}

That way the company itself would directly contain the current information, and as new information is added the old records will slowly be purged so it would not grow without bounds.

What would be the recommended approach here? Is there maybe another solution I'm not thinking of?

Now the obvious answer might be change seasons to be recurring, but that is not an avenue we're looking to explore right now.

Was it helpful?

Solution

Some tricks that help with performance issues:

  • Use database limit/skip(both in queries and aggregations), and implement application pagination. If you expect users to only look for 50 records at a time, why are you sending 50,000?

  • When doing the aggregation, use the $lookup stage after filtering the collection ([$match,$lookup]). Avoid doing the whole aggregation and then filtering.

  • If your application algorithm can do operations in parallel. Using objects like Future/Promises, you can query collections concurrently.

  • If your application can be considered a client-side/server-side/database-side architecture:

    • When searching , let the client-side receive the object_id of the Company, and fetch/cache Company information independently and asynchronously. As many operations could involve a Company, having a local cache, avoids any additional server-side trip.

    • Let the client-side do the entire Company/Seasons aggregation.

    • If server-side holds database records in cache, server-side aggregation could be faster than database-side.

    • If server-side holds aggregation results in cache, even faster.

If still slow, I advise to not let junk inside the Company collection. Better to create a Company_Seasons collection that have everything you need already calculated, and keep Company and Seasons clean. Reasons:

  • Easier to analyze performance and manage db indexes
  • Easier to fix or rebuild if something goes very bad
  • Expansible, when you need to add more searchable fields, lesser risk of reaching document limit
  • Only keep data that are current/valid (like you said in the third option)
Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top