Question

We've got problem with sharing data in session between lots of separate thread/nodes.

Scenario looks like this: There is some global pool of values type String for example 123ABC-1299AA. User When send command to application need to take one of this value. There can't be two sessions connected with the same value.

In environment we have 4 servers with 50 threads waiting for requests.

Current situation: We store in Oracle DataBase mapping value - session. With UNIQUE value constraint. When we receive request, we query table for all values, and iterate over pool, with omitting used values, to find one free. If another thread put new value while we iterating, we get Unique constraint, and try another one. Also DataBase is design to some persistent data, this king of mapping is not persistent data.

Problems: Oracle doesn't handle one small table with huge amount read/write operations. Row lock contentions, redo log issue etc. Beside when we do some maintenance in data base we're observing timeouts, because, performance of db is decreasing.

Requirements:

  • One value for session, no duplication.
  • Small response time, under 100ms.
  • Data consistency on all application nodes.
  • No lost mapping during some kind of fail.
  • Solution should handle traffic up to 100 TPS.

What you suggest to use in such kind of case?

Currently we're evaluating CouchBase for session Management, but it has a lot of bugs, and poor fail over (could take up to 10 min).

Était-ce utile?

La solution

I think you dont just need a good choice of database but also some application restructure. I assume that you are using a centralized Oracle database and each of your 4 servers write to the same database. And i am also assuming that you are doing client side load balancing i.e your clients randomly choose one of the application servers to send the request to.

So here are some of my thoughts about this :

-- Why dont you shard the database into n shards ( n = 4 in your case ). And have each app server look at its own shard anytime a new request comes. Idea is you want to avoid any kind of contention. There is a worst case scenario here. What if you end up in a situation where all the requests are coming to one of the app server and that app server's shard is now full, while there are empty slots in shards of other slots. If your clients are randomly selecting an app server you will be fine. But you would still want to take care of above worst case scenario. So my solution to that would be : If an app server realizes that its shard is full, it will randomly pick another shard of another app server and will query that shard for empty slot. So notice that you end up in the same situation where two app servers could be reading/writing same table row. But the chances of this happening are reduced by orders of magnitude and your architecture is more stable.

But we can do more to take care of the situation where 2 app servers could be reading/writing to the same shard because one of the app server's shard is full. Here are some ideas : -- Check-Then-Act is always a problem in any kind of state sharing system. So if two app servers are looking at same shard for an empty slot they will compete ( race condition ). What i will do in that case is write my query in such a way that if the current value is different than the value i read then my app server logic will do the search again. If current value is same as the value i read then mark the slot as NON-Empty. Please note that you need to do this as the same transaction so as to be atomic. It needs to be something like Compare-And-Swap ( CAS ). No matter what database you use SQL or NoSQL, you will have to write logic in your application to handle Check-Then-Act scenarios in case of race conditions

I also think that Oracle is not a great choice. First of all its expensive and secondly it looks like all you need is key-value store and so oracle is an overkill for that in my opinion. Also if you have one Oracle database you have a single point of failure there.

So overall, a clustered ( or sharded ) keyvalue store is a good approach for you to take. In my opinion, Couchbase is not a bad choice. I have used it for large scale session management. In our case we used Moxi to to the mapping ( to couchbase nodes for us ). In our case if the couchbase node went down Moxi took some time ( ~ 2-5 mins ) to realize that node has gone down and elect a new master couchbase replica. But if your application is more sensitive, you can use Couchbase cluster without Moxi and keep the logic of mapping request to couchbase node in your app.

But in your current scenario, what happens if your Oracle database goes down ? I am sure your downtime is more than 3-5 minutes. Also our failover time was 2-5 minutes because we had 90+ couchbase nodes. In your case you would probably just need a few ( 4-5 ) and failover would be in seconds.

Another thing i would do in your case is that i would want application to reserve some slots in advance so that my read and write become fast. So when an application has reserved slots those nodes will be marked as "reserved". If that application server ever goes down, all its reserved nodes will be freed. Or you can write your query like this

if session == "reserved" and app_server.health != "ACTIVE" then node is considered free.

So by mere writing of app_server.health = INACTIVE in your database you automatically free all the nodes marked by it. I hope i conveyed the idea

Data infrastructure scaling is an interesting problem and something i love. Let me know if you have any question and i will be happy to help.

My main suggestions are :

-- Can you not take an approach where you use GUID for each user session instead of allocating a pre-calculated session-id ( or value ) to each user ?

-- Think of sharding. Sharding does not mean go to cluster approach. You can shard with in a single node to avoid contention between database tables.

-- Evaluate NOSQL for your use. Couchbase ( Document store ) is good option and i have used it for storing user sessions in a large scale. If you want to go for free open source alternatives, Riak, Redis are good too. Redis would not come up with built-in clustering. But most likely you do not need it. Redis is super powerup and will have very very fast read and writes. Twitter uses Redis for keeping tweets of all users.

-- Eventually you do have to handle race conditions in your code. So avoid race conditions, but be ready to deal with them when they happen

-- If you just have one Oracle DB, you already have single point of failure and your failover time will be multiple minutes ( or hours ) anyways. So do not be afraid to venture into clustered key-value stores like Couchbase, Voldemort etc. For a cluster of 4-5 notes your failover would be pretty quick ( in seconds )

Autres conseils

Exactly it's cluster of 3 nodes, with failover to second cluster. Oracle was a first choice because it's reliability and data consistency.

Sharing data between data base is option, but there is problem when one instance is down for some reason. Others will use first theirs parts and next they will try to use 4th part. It's problem because this instance will have to check every time all it resource. So getting value will be much slower. We aimed to get value in less than 0.5 s. We've got failover so when one node goes down, there is no problem because switch took few seconds. Problem is when instance hangs (node take queries and not processing), or when db interconnect is slow, one node lock some rows, and other try also to lock(raw lock contention).

About question: --We need to assign unique value to session; maintenance of all this value will be hard in code, but it's possible

-- Our governance process won't agree on that, security, failover etc.

-- I'm evaluating CouchBase right now, and it looks ok right now, but we hit a problem with failover, switch to another node could take few minutes, and also there is not guarantee that failing node will flush data to second one. I will read about other ones.

-- With Oracle our failover to another cluster/node is transparent for us. The worst case is when node hangs, usually because some oracle bugs. Then we have some like 10 min when 50-80% of traffic is failing.

Thanks for response, I'll evaluate CB deeply and read about others solutions.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top