Pergunta

I want to ask for a review of my big data app plan. I haven’t much experience in that field, so every single piece of advice would be appreciated.

Here is a link to a diagram of the architecture: My architecture overview

Our app is webanalytics app. I know that already there are many of them however, we use it to analyse our traffic data and we have already built-in some useful features. Our app (like each traffic analyser app) should have ability to grow and process large volumes of data. I designed a system having that principle in mind. Not everything in that plan is my original idea. I have drawn inspiration from lambda architecture (from Bigdata book by Nathan Marz, twitter architect). So after a quick introduction, let's have a look at my idea.

On the bottom, we have 'client'. This is a person which uses app to analyse their traffic. In front of the client, I put Nginx proxy to load balance traffic. I heard that node is not that good in serving static files, and Nginx is better (Apache would be equivalent, for example, Eran Hammer posted on Github article where he wrote that Wallmart uses Apache to handle https traffic, and then it proxies it to node js). Then we have node js app (running with Upstart), which is the heart of system, it has to analyse data and saves configuration to relational database (I think, that relational database is better, because in our system which currently uses MongoDB we have some ridiculous nesting, and some documents can have in one collection from 200 bytes up to unlimited space, and if we consider padding factor which mongo would create for that bigger, and then when someone edit small document it has side effect - mongo has to rewrite everything in that document and store it at the end of collection. We can set manual padding factor, however, it still feels little dirt and isn't best practice, (and probably wouldn't prevent rewriting). I have chosen Oracle, but my boss suggested using PostgreSQL because Oracle might be little costly. I thought, maybe we can go one step further and just use MySQL. That database shouldn't have any kind of big load, so let's keep it simple.

Now let's move to other side and analyze how data goes to the system. Every report goes to the reporting endpoint. It is obvious that this is some kind of ajax request, so the path to endpoint would be there in JavaScript files on our server. I thought, that we can manipulate that path in case of big overload, so we have reporting 'endpoints' not 'endpoint' (maybe some subdomains). Every endpoint should be able to send reports in two places. To Apache Hadoop and Cassandra. Apache Hadoop is awesomely scalable and has ability to store petabytes of data without any problem. Here we can store immutable data. If anything goes wrong, we can then restore data or process it again and again if we wish to. It is batch layer so it has to have big latency. To store and partition data we can use Pail, for processing Cascalog (two Java libraries) or just raw MapReduce. Cassandra should recompense latency. When batch processing is finished, equivalent data from Cassandra should be erased. The batch layer should generate 'reports ('images' of queries for hours, days, moths etc.) for client', and then store it in ElephantDB, which is also well scalable platform developed by Nathan Marz among others. Having images is a big advantage over querying raw data because we can do it much faster. I want to compensate big latency of data in ElephantDB by querying Cassandra only for the most recent period, which wasn't already processed from Hadoop to ElephantDB. The batch layer will process data without end in 'while (true)' loop, generating new images or updating older.

Now if you read everything, I'm very impressed with your patience ;). If you have any suggestions or you see that something is wrong, missing or want clarification, let me know, and I will respond to every feedback messages that you post.

My architecture overview

PS. Here is an example of write and read (query) request processing. We have (on top of my plan) requests with info that page has been visited. Endpoint sends (with unique id, so when network is divided we can then, while processing remove duplicates) request to Apache Hadoop, and Cassandra. Hadoop processes all incoming requests and generates for example (format of data can be diffrent, but this is generally idea):

//by day
 _______ ________________ ______________ 
|  url  |      views     |     date     |
|  '/'  |      3455      |  13-02-2016  |

//by hour
 _______ ________________ __________________
|  url  |     views      |       hour       |
|  '/'  |      355       | 13-02-2016 08:00 |

Then, when we want to get views for specyfic url, and specyfic period we don't have to query whole our master dataset, just few documents with our pre-processed data images. When someone wants to get more recent data, we send also request to Cassandra, where we want to have that kind of images too, and they should be updated with every query request (but this depends on how much data we expect to have in specyfic table/collection, sometimes it can be better to just update on every write request or don't update anything at all and don't generate images). We have also in that way 'caching' of request in 'images', and if someone changes period, we have already processed data to query on, and just have to add some little amount of most recent data.

Is connection between each part of system ok and makes sense or am I missing something important and for example I shouldn't use relational database or proxy for any reason, as I said I have only theoretical experience, and I know that someone more experienced may have some useful advices. And if everything is ok, someone who wants to build that kind of architecture can use it as example for future

Foi útil?

Solução

The architecture is good enough to handle many requests per second, as long as you test it and profile it and it proves to handle the load that it is required to handle.

Let me quote Donald Knuth, Computer Programming as an Art, 1974:

The real problem is that programmers have spent far too much time worrying about efficiency in the wrong places and at the wrong times; premature optimization is the root of all evil (or at least most of it) in programming.

What does it even mean to handle "many requests per second"? Ten request? Ten thousand? Ten million? This is all very relative, any recommendation will be very subjective, and even if you spend ten years analyzing it before implementation, you will most likely get very surprised the first time it actually starts handling the real load.

Licenciado em: CC-BY-SA com atribuição
scroll top