Question

We have a web service that accepts images and metadata uploaded by end users, and the uploaded images would go through multiple steps of processing / reviewing with human involved.

To monitor the status of the service, one of my colleagues suggested we should develop a monitoring program that emulates the behavior of end users, i.e., to submit synthesized data to the service on regular basis (a few thousand times every day). Going this path, we also need to hide those synthesized data from the UI (and statistics) so that they won't confuse end users. I personally believe this is a bit too heavy for monitoring, and quite invasive in that it does much more than just watch the service.

In general, is it a good practice to submit phantom data to a web service as a means of day-to-day monitoring? What would be some more lightweight alternatives that would yield approximately the same level of assurance of system healthiness?

Was it helpful?

Solution

tl;dr Your colleague's solution is both wrong and right.

Reading between the lines, it sounds like the real problem you're trying to solve is detecting liveness: you're trying to check if the service is responding to requests or not.

You should not mix liveness with correctness. Correctness is what your colleague is trying to solve by answering the question: "is my app doing what it's supposed to do?". Correctness is a very hard problem and cannot be solved by monitoring solutions (though see [1]). Liveness is the real problem - it asks "is my app able to accept requests?"

In the business, liveness is solved via a health check, and you never need to emulate the full user behaviour to do this. Instead, have your application developers add an endpoint that returns HTTP 200 when pinged. Then, poll that endpoint from a service colocated with it (e.g. a service that is as close to the application as possible, network-wise). If the endpoint doesn't return or throws a 500, the service is down and liveness is violated.

While you can have the healthcheck issue a network request (e.g. contact the database to see if it can connect, or have the healthcheck issuer contact your service over the Internet), I don't recommend it. This adds latency and you also end up testing an unreliable network, which can lead to false positives.

Instead, have local checks on each component (database + web app), and aggregate them all together for an overall picture of liveness. This ensures you have specificity (your monitoring measures what you think it measures), recency (your monitoring is real-time), and scalability (you have an architecture for monitoring new components when they arise).

Bear in mind this won't detect cases when your application is itself broken i.e. it's returning errors when people try to use it. That, however, is fine because this class of errors is solved by better testing before deploy (and see [1]).


[1] Your colleague's approach is a good way to do integration testing - create test accounts that emulate various workflows to catch regressions or service bugs. You should do his idea, but definitely not at the frequency a real monitoring solution needs (i.e. once a minute or higher, depending on your SLAs) because it's heavyweight and will almost certainly cause you resource exhaustion. If you add it as a pre-release/pre-deploy action, you can catch bugs and "soft failures" (errors that persist despite the app being available) better. Liveness checks suffice for "hard failures" (errors because the app is simply not available).

OTHER TIPS

I would use the real user's submissions to figure out if the service is running correctly.

The general approach would be:

Store some statistics about the the real users submitions like how many per seconds, number of failures, number of sucess and last request timestamp

The monitoring program would only retrieve the collected statistics and figure out based on them whether the system is healthy or not.

Monitoring is a feature of the application. In principle, there's no reason why it can't involve pushing data through the system. That's fundamentally what a heartbeat is, for example.

But before you implement a complex system that creates dummy data, then implement another complex system to make sure the data is removed and doesn't affect any other parts of the system, you should have a clear picture of what you really need.

What metrics are you trying to obtain? What will those metrics indicate? What are the thresholds? How can you respond? If obtaining the metrics is difficult, are there other metrics that could be used as a proxy?

Some raw metrics to watch might be

  • How many new uploads do you have per minute/hour/day/whatever?
  • What percentage makes it to each subsequent step?
  • How long does each step take?

These metrics themselves may actually not be that interesting. But they allow you to monitor for significant changes in those metrics. When one step's success rate goes from an average of 80% to 0%, you'll have something to investigate.

Licensed under: CC-BY-SA with attribution
scroll top