Question

I have a huge DB,and CalculateTasksFromDB() takes a long time (and lots of memory). Once that method is done,there is a huge list of tasks. There are worker processes in the system (at any point in time the system might have zero to 100 of those),each needs to get the next tasks it needs to work on. The tasks different processes need to get are mutually exclusive eg the task with the id 234234 should be processed by one process and one process only.

I have chosen to have a 'service process' (not sure if it's a correct terminology) running in the system, and implemented it as a http server. So literally to get its next list of task,each worker process goes to http://127.0.0.1:2131/tasks and gets a sublist of that huge list of tasks. To ensure that the tasks are mutual exclusive I use flask with threaded=False in the constructor.

I looked for info about my design, if it makes sense, what are the alternatives etc. but couldn't find any. So: 1.Does my approach make sense? 2.What are the alternatives? 3.What improvements would you suggest

Was it helpful?

Solution

Load Balancing

The problem sounds like one of Load Balancing, which I'd say is quite a common thing to want to do.

OP Proposed Solution

As with every software engineering problem, a hand-rolled solution is a valid way forward, however I can think of several reasons why that would not be my preferred approach.

  • Load balancing is a solved problem; many people (who are all much smarter than me, I'm sure) have already thought long and hard about it, and have made the fruits of their labour freely available so that the rest of us don't have to re-invent the wheel.
  • Rolling your own for anything always involves the need to test that code and do all the other zillions of things which you won't need to worry about when choosing someone else's already working/tested/stable solution.
  • Workers need some mechanism of knowing when a new task is available to start - HTTP is a connectionless protocol, and is not able to push notifications out to individual workers, so either workers would need to "poll" the load balancer (not very elegant or efficient) or it would need to use another protocol such as WebSockets.
  • A Load balancer is an infrastructure component, so to roll-your-own often involves the need to tackle issues such as Resilience, Recovery, High Availability, Throughput, Scalability, Logging, Monitoring...
  • There is a need for the load balancer to take responsibility for assigning tasks to workers, so if an individual worker happens to suddenly die or drop out in the middle of a task, the load balancer would ideally need a way to be able to re-assign that task to another available worker
  • In reality, there are sometimes valid reasons why a task may not be able to be completed (for example, the task may have been malformed, or there may be a bug in the worker code, etc.) - in this situation, a worker should be able to fail or reject the task to trigger a system monitoring alert somewhere.

Message Queueing

Whilst Message Queues are certainly not the only possible way of achieving load balancing, they tend to be among the simplest to use, since most "MQ" technologies are deliberately built with load balancing in mind (amongst other things).

Typical reasons for using a Message Queue (This answer is biased towards RabbitMQ since that's the tool I am most familiar with, but I'm sure other tools are all equally capable).

  • Message Queues tend to use protocols such as AMQP which allow push notifications - no need for polling; worker processes simply subscribe to a queue and its client library usually has quite a simple way for the worker to be provided with each Task, which that worker can acknowledge when the task is finished.
  • The message queue ensures that tasks are evenly and fairly distributed out to each worker
  • It provides guaranteed delivery as well as guarantees that tasks aren't duplicated; the message queue figures out all the networking and message delivery stuff so you don't have to.
  • Message broker technologies such as RabbitMQ can easily be configured as a cluster, and client libraries often have all the networking "stuff" taken care of without you needing to write any extra code at all to get high availability and failover capability. E.g. they're often able to automatically failover to other nodes in that cluster without you needing to write any additional code, nor need to worry about handling things like HTTP error codes, Retries, Timeouts, etc.
  • Message queue technologies often have "work queue" patterns built-in straight out-of-the-box with no need to write any additional code - some even take care of message serialisation too (no need to write any extra queries or database code).
  • Messaging technologies are often able to auto-detect when workers die unexpectedly, automatically re-queue any messages which were consumed without an acknowledgement.
  • Some tools allow Bad/Malformed tasks which cause errors to be 'rejected' by a worker, allowing it to land in a "Dead Letter" queue that can be observed separately.
  • The message technology can also be used as a way for workers to respond or send "follow up" messages back to the task producer(s) confirming if/when those tasks are completed, or maybe publish a message on to another worker which can persist the task result back into the original database.
  • The popular and well-known Message queue technologies are very well tested and supported, maintained by people who have a lot of expertise in messaging and infrastructure, so can generally be trusted to work as-intended in a production environment.
  • Message Queueing is generally quite a simple concept which pretty much just works straight out of the box with minimal fuss (At least, that is my experience with those I've used anyway). Of course, every new tool and technology has a learning curve, so there is always a need to invest time familiarising yourself with the tool, its patterns, and its client library (of course, some client libraries may require more effort than others).
  • a Well-supported 3rd-party solution also potentially solves a whole host of infrastructure/environmental problems which most people wouldn't even think of until the software is deployed into the real live/production environment (Because "it works on my machine" doesn't necessarily mean "it works in the real world").

Why not to use hand-rolled solutions

One thing to stress about "hand-rolled" solutions, particularly when it comes to infrastructure code, is that problems which may seem rather trivial can turn out to be deceptively complex and difficult once you take into account the fact that you are at the mercy of networks, databases, operating systems, and many other moving parts that are always somewhat out of your control.

The tough reality for any distributed solution is that individual applications and services have to cope with these things - they simply have no choice in the matter. It's important not to underestimate the importance of good-vs-bad infrastructure. The real time-consuming problems don't manifest themselves until long after it has been deployed into the production environment. The real-world cost of diagnosing faults in production and building resilience against these issues are always several orders of magnitude greater than the cost of building any initial solution in the long-term.

Even if your hand-rolled solution starts out being only 50-100 lines of trivial code that you could throw together in 30 minutes, the time spent writing that code is usually only a tiny part of delivering a working solution and keeping it working for many months/years.

For example: by writing your own load balancer, you also need to think about testing it (preferably with automated tests), maybe writing your own supporting client library (also with tests), perhaps also including extra code to protect against unexpected environmental problems, adding sufficient logging/monitoring to debug live issues, creating the CI/CD pipeline, sufficient documentation, providing app settings/config, managing the release process, etc. Many zillions of things that aren't about the core 'load balancing' problem but which are done for you when investing in a popular, tried, tested, well-supported, well-maintained and well-documented, 3rd-party solution which already takes care of a lot of stuff for you.

Example: RabbitMQ in Python

There's a very simple example using RabbitMQ Work Queues in Python here (There might be better client libraries for Python than the one used there) - but hopefully enough to get an idea and try it out (a single queue in RabbitMQ can have as many consumers/workers as you like to distribute work): https://www.rabbitmq.com/tutorials/tutorial-two-python.html

Licensed under: CC-BY-SA with attribution
scroll top