Question

it might be too opinionated, but I've been struggling in this for far too long and cant seem to make up my mind.

I am trying to compare two approaches to designing a system: Lets assume You want to crawl a site periodically, if some condition is met, you want to take a snapshot of the webpage. all that should be recorded in a datastore.

I split the job into 3 services:

  1. crawler-svc: query the web-page and provide its content data
  2. analytics-svc: receive content-data and decide its relevance (if relevant, we should snapshot)
  3. evidence-svc: collects evidence about the web-page incl. snapshot of the site

I struggle to decide between two approaches, described from 50 mile up:

Option 1: Orchestrating the operation with a "master" service (either a new service, or the analytic-svc). the master-svc will trigger request to crawler-svc, hand the results to analytic-svc, and depending on result will invoke evidence-svc.

Option2: Each service is tailor made for the environment: e.g. cawler-svc knows it should trigger periodically, analytic svc waits for "content-data-ready" events and evidence-svc waits for "relevant-page" events.

Please specify with your opinion the major reasons you rather have one over the other.

Was it helpful?

Solution

There are no absolutes in this, i.e. for some systems you want orchestration, for others individual independent services. Just analyse the problem to come up with a reasonable architecture and a workable design using components that you are comfortable with.

For the given example I would choose an approach that puts the analytics-svc in charge since it is the single high level decision point of the system.

I can't see a reason for the crawler to be its own service and would have that as a sub component of the analytics-svc instead. That would save you the headache of transfering the crawler result to the analytics-svc out of process.

For the snapshot generation I would use some existing solution that is triggered to run asynchronously by the analytics-svc. The snap shot most likely is produced in the file system. So tar zip it up and store a reference (such as file name) in the DB.

For optimization you could utilize the HTTP cache control headers before starting the crawler on an individual page.

OTHER TIPS

The alternative to orchestration is choreography. Now that the fancy terms are out of the way let’s get to the point.

Orchestration gives you a single point of management. It also gives you a single point of failure.

Choreography gives you systems that work on their own without being told what to do. It also gives systems that have to be changed on their own; each with its own flavor and overhead.

So while this choice impacts the mechanics of how the system will work I’m far more concerned with what it will be like to maintain it.

If your maintenance team is fine working with each little service directly every time a change is needed then choreography is fine. If you want to manage workflows in one place using one system then you want orchestration.

Well, you're not quite there. You don't count complexity, not at all.

As long as we talking about truly distributed system with precise SLAs, you have to provide certain availability rate. And here where your ideas fall apart: as long as you have some master managing some slaves, you have to think what happens if master goes down. You either restart it quickly and make sure it's OK (which is not always possible) or make sure a takeover will eventually happen and some of the slaves will be elected to become new master. In other words, you're bothered with distributed consensus protocol which is extremely hard to get right and even harder to test e2e.

More then that. Such a complexity is objective and you can't bypass it. There is no way you'll solve this problem with no consensus protocol (and other terribly complicated distributed stuff) hidden somewhere, so the question is where. Where to put it? Well, you could implement it yourself, at your application level. Either roll out your own solution that does not work as expected all the time and causes endless headache, or take something like ZooKeeper and implement some known solution on top of that. The other option is to build your architecture on top of some system that already implements that. Like Postgresql. Or Kafka. Or <name your favorite distributed database>. But that implies that database and it's capability dictates and predetermines set of possible solutions. As an example, if you decide to stick to Postgresql, you probably want your processing services to work in pull-mode, i.e. they gonna try to get new batch of fresh data periodically, process it and write outcome within a single transaction. That's gonna be different from RabbitMQ solution, which is push-mode approach.

So what? So rather than ask a question you've asked, I suggest another question:

Given certain complexity is objective and thus must exist somewhere, with respect to the SLAs I have, would I decide to be responsible for it myself or would I rather build my system on top of existing solutions made by other people and try to keep my part less complex?

Note that for long-term product companies it is pretty common to choose the first option over the second one. So there is no right answer without a context.

I'd have them kind of self managing using queues which is your option 2.

  1. Crawler looks at websites and puts content onto a "data" queue. That's it.
  2. Analytics pops site data off the data queue and makes a decision to put the data onto the "evidence" queue (or throw it away or some other queue whatever the fate of fails is).
  3. Evidence service pops the evidence queue and does whatever it is designed to do.

Make each part responsible for reading its input queue and creating work for downstream in the form of another queue item. The advantage with this is the components are stand-alone and don't really need orchestrating as there is only one entry point (the queue) that is self-managed. The data can come from anywhere at any time so you've added some flex here.

Licensed under: CC-BY-SA with attribution
scroll top