Session Management across multiple Thread/nodes

Question 1

I think you dont just need a good choice of database but also some application restructure. I assume that you are using a centralized Oracle database and each of your 4 servers write to the same database. And i am also assuming that you are doing client side load balancing i.e your clients randomly choose one of the application servers to send the request to.

So here are some of my thoughts about this :

-- Why dont you shard the database into n shards ( n = 4 in your case ). And have each app server look at its own shard anytime a new request comes. Idea is you want to avoid any kind of contention. There is a worst case scenario here. What if you end up in a situation where all the requests are coming to one of the app server and that app server's shard is now full, while there are empty slots in shards of other slots. If your clients are randomly selecting an app server you will be fine. But you would still want to take care of above worst case scenario. So my solution to that would be : If an app server realizes that its shard is full, it will randomly pick another shard of another app server and will query that shard for empty slot. So notice that you end up in the same situation where two app servers could be reading/writing same table row. But the chances of this happening are reduced by orders of magnitude and your architecture is more stable.

But we can do more to take care of the situation where 2 app servers could be reading/writing to the same shard because one of the app server's shard is full. Here are some ideas : -- Check-Then-Act is always a problem in any kind of state sharing system. So if two app servers are looking at same shard for an empty slot they will compete ( race condition ). What i will do in that case is write my query in such a way that if the current value is different than the value i read then my app server logic will do the search again. If current value is same as the value i read then mark the slot as NON-Empty. Please note that you need to do this as the same transaction so as to be atomic. It needs to be something like Compare-And-Swap ( CAS ). No matter what database you use SQL or NoSQL, you will have to write logic in your application to handle Check-Then-Act scenarios in case of race conditions

I also think that Oracle is not a great choice. First of all its expensive and secondly it looks like all you need is key-value store and so oracle is an overkill for that in my opinion. Also if you have one Oracle database you have a single point of failure there.

So overall, a clustered ( or sharded ) keyvalue store is a good approach for you to take. In my opinion, Couchbase is not a bad choice. I have used it for large scale session management. In our case we used Moxi to to the mapping ( to couchbase nodes for us ). In our case if the couchbase node went down Moxi took some time ( ~ 2-5 mins ) to realize that node has gone down and elect a new master couchbase replica. But if your application is more sensitive, you can use Couchbase cluster without Moxi and keep the logic of mapping request to couchbase node in your app.

But in your current scenario, what happens if your Oracle database goes down ? I am sure your downtime is more than 3-5 minutes. Also our failover time was 2-5 minutes because we had 90+ couchbase nodes. In your case you would probably just need a few ( 4-5 ) and failover would be in seconds.

Another thing i would do in your case is that i would want application to reserve some slots in advance so that my read and write become fast. So when an application has reserved slots those nodes will be marked as "reserved". If that application server ever goes down, all its reserved nodes will be freed. Or you can write your query like this

if session == "reserved" and app_server.health != "ACTIVE" then node is considered free.

So by mere writing of app_server.health = INACTIVE in your database you automatically free all the nodes marked by it. I hope i conveyed the idea

Data infrastructure scaling is an interesting problem and something i love. Let me know if you have any question and i will be happy to help.

My main suggestions are :

-- Can you not take an approach where you use GUID for each user session instead of allocating a pre-calculated session-id ( or value ) to each user ?

-- Think of sharding. Sharding does not mean go to cluster approach. You can shard with in a single node to avoid contention between database tables.

-- Evaluate NOSQL for your use. Couchbase ( Document store ) is good option and i have used it for storing user sessions in a large scale. If you want to go for free open source alternatives, Riak, Redis are good too. Redis would not come up with built-in clustering. But most likely you do not need it. Redis is super powerup and will have very very fast read and writes. Twitter uses Redis for keeping tweets of all users.

-- Eventually you do have to handle race conditions in your code. So avoid race conditions, but be ready to deal with them when they happen

-- If you just have one Oracle DB, you already have single point of failure and your failover time will be multiple minutes ( or hours ) anyways. So do not be afraid to venture into clustered key-value stores like Couchbase, Voldemort etc. For a cluster of 4-5 notes your failover would be pretty quick ( in seconds )

Question 2

Exactly it's cluster of 3 nodes, with failover to second cluster. Oracle was a first choice because it's reliability and data consistency.

Sharing data between data base is option, but there is problem when one instance is down for some reason. Others will use first theirs parts and next they will try to use 4th part. It's problem because this instance will have to check every time all it resource. So getting value will be much slower. We aimed to get value in less than 0.5 s. We've got failover so when one node goes down, there is no problem because switch took few seconds. Problem is when instance hangs (node take queries and not processing), or when db interconnect is slow, one node lock some rows, and other try also to lock(raw lock contention).

About question: --We need to assign unique value to session; maintenance of all this value will be hard in code, but it's possible

-- Our governance process won't agree on that, security, failover etc.

-- I'm evaluating CouchBase right now, and it looks ok right now, but we hit a problem with failover, switch to another node could take few minutes, and also there is not guarantee that failing node will flush data to second one. I will read about other ones.

-- With Oracle our failover to another cluster/node is transparent for us. The worst case is when node hangs, usually because some oracle bugs. Then we have some like 10 min when 50-80% of traffic is failing.

Thanks for response, I'll evaluate CB deeply and read about others solutions.