Persistence Strategies - Filling a cache strategy

https://softwareengineering.stackexchange.com/questions/306172

10-12-2020
|

题

Let's say I have an interface IDataAccessObject<TSource> with CRUD methods to access a data source.

I have an implementation StrategyDataAccessObject<TSource>, which uses an strategy pattern in order to fetch and update data.

All persistence strategies should implement IPersistenceStrategy<TSource>, so for instance, I may have:

CachePersistenceStrategy<TSource>
FileSystemPersistenceStrategy<TSource>
SQLServerPersistenceStrategy<TSource>
...

StrategyDataAccessObject<TSource> is injected a list of these strategies.

The workflow is somewhat like this:

When the user queries the data source (i.e., does not change the inner data model), the DAO iterates over all strategies, and if any of them return a non empty result, breaks out of the loop and returns it to the user
When the user requests a change to the data model (i.e., insertion, update, deletion), then the DAO applies the change to all the strategies.

So if we have the following list of strategies: [cache, fileSystem, sqlServer] and the user queries for the user with ID = 1, then we will inspect the cache, if it is not there, move on to inspect the file system. if it is there, we don't go to the SQL Server database, we return what we found on the file system.

A cache strategy is initialized empty, so I have to fill in the data somehow.

Providing the DAO does not know anything about the strategies implementation (it just sees an interface) and knowing that each strategy is unaware of the existence of the other, how can I fill my cache strategy with data from the next strategies?

解决方案

I'm far from a design patterns wizard, but this doesn't sound to me like a natural implementation of the strategy pattern given this kind of synchronization that's going on between each entity as opposed to simply choosing one appropriate strategy based on runtime input.

Instead it sounds strikingly similar to the memory hierarchy used by our hardware ranging from disk to DRAM to CPU cache to register.

Provided that you can find a common, abstract interface for all these data sources, one thing that might be worth trying is to instead aggregate them using an intrusive doubly-linked list representation, like so (note that the list pointers/references can and probably should be stored in concrete objects):

_{A singly-linked representation would also do just fine here, though the doubly-linked representation may simplify implementation quite a bit at a most trivial cost given the coarseness of these nodes.}

This will allow you to do the dependency injection at the data source level which can become useful for automating the synchronization/caching that occurs between these data sources in a very symmetrical fashion and making the process of caching no longer an external concern of the DAO, but instead an internal concern of the data storage type.

We can then link together the data sources like so:

The DAO can then be injected with an IDataSource-compliant node, preferably your fastest one (the memory cache) since it's typically going to be the first node we check for the availability of data prior to checking the bigger but more expensive nodes linked to it.

This allows quite a bit of flexibility with how you choose to cache data, since you can skip the caching by simply injecting the DAO directly with the SQL Server data source if caching is undesirable. It's also easy to do things like take the disk cache out of the picture by simply connecting the memory cache to the server and injecting the DAO with the memory cache.

Read Propagation

A cache strategy is initialized empty, so I have to fill in the data somehow.

If you apply this kind of memory hierarchy design, then the process of reading can fill these intermediary data storage caches implicitly/automatically in the process of the DAO simply requesting to perform operations from its injected data source.

We begin with the DAO requesting to read from its injected data source, the memory cache. The memory cache then checks in its load implementation if the data is available. If not, it recursively requests a read operation from the bigger, linked data source, the disk cache, and this repeats.

Here's the twist: if a smaller data source does not find the requested data and thus propagates a read request to a bigger data source, the returned data is then stored (cached) in the smaller, faster data source prior to returning it back to the caller. This logic is symmetrical across data sources and can be provided as part of an abstract base class. The pseudocode would be as follows:

DataSource::read(key):                 // implemented once in a base class
   data = load(key)                    // abstract
   if data != null:
       return data
   if bigger_node != null:
       data = bigger_node.read(key)
       store(key, data)                // abstract: cache data prior to returning
       return data
   return null

Eviction

These intermediary storage caches may end up reaching some state of being "full". If so, an eviction strategy is necessary. A simple and effective one is to simply evict LRU (least recently-used) data whenever a cache is full. In that case, we need some kind of time stamp to update when data is accessed in the intermediary cache storage (not for the SQL server, since it's modeling nonvolatile storage), like so:

Memory::load(key):                     // overridden by each data storage type
   if data[key] != null:
       update_timestamp(data[key])
   return data[key]

Memory::store(key, new_data):          // overridden by each data storage type
    if not key_exists(key) and full():
        evict_lru_data()               // evict(remove) least-recently used data
    data[key] = new_data
    update_timestamp(data[key])

Create, Update, Delete

For these kinds of mutating operations, an easy strategy is as you described: simply mirror the operations to all the data sources.

create(key, data):
    store(key, data)
    if bigger_node != null:
       bigger_node.create(key, data)

... and so forth. The update operation can be similar to read propagation by copying the data into all three storage types, with create, as implemented above, likewise filling all three data storage types.

A more complex strategy is to only perform these operations in the faster data storage types and then propagate modifications to the bigger ones later in a deferred fashion (ex: during eviction, at periodic intervals, on shutdown). However, that might screw with atomicity and make transaction-safety more difficult in addition to exacerbating synchronization issues with the latest data available on the server.

Going Beyond

There are potentially some more complex issues here like the fact that an intermediary cache may not necessarily be up-to-date with what's on the server if there are other DAO clients in the mix. There's also transaction-safety to consider and, of course, thread-safety. However, these concerns apply regardless of how you choose to design this as they're conceptually attached to the notion of intermediary caches between your database server.

In any case, I hope this kind of design might help a bit and potentially offer some solutions to these problems, or at least offer a new kind of perspective to this problem whether or not this kind of design is used.

许可以下： CC-BY-SA 和归因

不隶属于 softwareengineering.stackexchange