Question

I know that a big part of the performance from Couchbase comes from serving in-memory documents and for many of my data types that seems like an entirely reasonable aspiration but considering how user-data scales and is used I'm wondering if it's reasonable to plan for only a small percentage of the user documents to be in memory all of the time. I'm thinking maybe only 10-15% at any given time. Is this a reasonable assumption considering:

  • At any given time period there will be a only a fractional number of users will be using the system.
  • In this case, users only access there own data (or predominantly so)
  • Recently entered data is exponentially more likely to be viewed than historical user documents

UPDATE:

Some additional context:

  • Let's assume there's a user base of a 1 million customers, that 20% rarely if ever access the site, 40% access it once a week, and 40% access it every day.
  • At any given moment, only 5-10% of the user population would be logged in
  • When a user logs in they are like to re-query for certain documents in a single session (although the client does do some object caching to minimise this)
  • For any user, the most recent records are very active, the very old records very inactive

In summary, I would say of a majority of user-triggered transactional documents are queried quite infrequently but there are a core set -- records produced in the last 24-48 hours and relevant to the currently "logged in" group -- that would have significant benefits to being in-memory.

Two sub-questions are:

  1. Is there a way to indicate a timestamp on a per-document basis to indicate it's need to be kept in memory?
  2. How does couchbase overcome the growing list of document id's in-memory. It is my understanding that all ID's must always be in memory? isn't this too memory intensive for some apps?
Was it helpful?

Solution

First,one of the major benefits to CB is the fact that it is spread across multiple nodes. This also means your queries are spread across multiple nodes and you have a performance gain as a result (I know several other similar nosql spread across nodes - so maybe not relevant for your comparison?).

Next, I believe this question is a little bit too broad as I believe the answer will really depend on your usage. Does a given user only query his data one time, at random? If so, then according to you there will only be an in-memory benefit 10-15% of the time. If instead, once a user is on the site, they might query their data multiple times, there is a definite performance benefit.

Regardless, Couchbase has pretty fast disk-access performance, particularly on SSDs, so it probably doesn't make much difference either way, but again without specifics there is no way to be sure. If it's a relatively small document size, and if it involves a user waiting for one of them to load, then the user certainly will not notice a difference whether the document is loaded from RAM or disk.

Here is an interesting article on benchmarks for CB against similar nosql platforms.

Edit:

After reading your additional context, I think your scenario lines up pretty much exactly how Couchbase was designed to operate. From an eviction standpoint, CB keeps the newest and most-frequently accessed items in RAM. As RAM fills up with new and/or old items, oldest and least-frequently accessed are "evicted" to disk. This link from the Couchbase Manual explains more about how this works.

I think you are on the right track with Couchbase - in any regard, it's flexibility with scaling will easily allow you to tune the database to your application. I really don't think you can go wrong here.

Regarding your two questions:

  1. Not in Couchbase 2.2
  2. You should use relatively small document IDs. While it is true they are stored in RAM, if your document ids are small, your deployment is not "right-sized" if you are using a significant percentage of the available cluster RAM to store keys. This link talks about keys and gives details relevant to key size (e.g. 250-byte limit on size, metadata, etc.).

OTHER TIPS

Basically what you are making a decision point on is sizing the Couchbase cluster for bucket RAM, and allowing a reduced residency ratio (% of document values in RAM), and using Cache Misses to pull from disk.

However, there are caveats in this scenario as well. You will basically also have relatively constant "cache eviction" where "not recently used" values are being removed from RAM cache as you pull cache missed documents from disk into RAM. This is because you will always be floating at the high water mark for the Bucket RAM quota. If you also simultaneously have a high write velocity (new/updated data) they will also need to be persisted. These two processes can compete for Disk I/O if the write velocity exceeds your capacity to evict/retrieve, and your SDK client will receive a Temporary OOM error if you actually cannot evict fast enough to open up RAM for new writes. As you scale horizontally, this becomes less likely as you have more Disk I/O capacity spread across more machines all simultaneously doing this process.

If when you say "queried" you mean querying indexes (i.e. Views), this is a separate data structure on disk that you would be querying and of course getting results back is not subject to eviction/NRU, but if you follow the View Query with a multi-get the above still applies. (Don't emit entire documents into your Index!)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top