Architecture recommendation using SQL Server for real-time aggregation and denormalization

https://stackoverflow.com/questions/16950380

31-05-2022
|

Question

We have an enterprise LOB application for managing millions of bibliographic (lots of text) records using SQLServer (2008). The database is very normalized (a complete record might easily be made of up ten joined tables plus nested collections). Write transactions are fine, and we have a very responsive search solution for now, which makes generous use of full-text indexing and indexed views.

The issue is that in reality, much of what the research users need could be better served by a read-only warehouse-type copy of the data, but it would need to be continually copied near real-time (latency of a few minutes is fine).

Our search is optimized by several calculated columns or composite tables already, and we would like to add more. Indexed views cannot cover all needs because of their constraints (such as no outer joins). There are dozens of 'aspects' to this data, much like a read-only data warehouse might provide, involving permissions, geography, category, quality, and counts of associated documents. We also compose complex xml representations of the records that are fairly static and could be composed and stored once.

The total amount of denormalization, calculation and search optimization provokes an unacceptable delay if done completely via triggers, and is also prone to lock conflicts.

I've researched some of Microsoft's SQL Server suggestions, and I would like to know if anyone having experience with similar requirements has can offer recommendation from the following three (or other suggestions that use the SQL Server/.Net stack):

Transactional replication to a read-only copy - but it is unclear from the documentation how much one can change the schema on the subscriber side and add triggers, calculated columns or composite tables;
Table partitioning - not to alter the data, but perhaps to segment large areas of data that currently are recalculated constantly, such as permissions, record type (60), geographical region, etc...would that allow triggers on the transactional side to run with less locks?
Offline batch processing - Microsoft uses that phrase often, but does not give great examples, except for 'checking for signs of credit card fraud' on the subscriber side of transaction replication...which would be a great sample, but how is that done exactly in practice? SSIS jobs that run every 5 minutes? Service Broker? External executables that poll continually? We want to avoid the 'run a long process at night' solution, and we also want to avoid locking up the transactional side of things by running an update-intensive aggregating/compositing routine every 5 minutes on the transactional server.
- Update to #3: after posting, I found this SO answer with a link to Real Time Data Integration using Change Tracking, Service Broker, SSIS and triggers - looks promising - would that be a recommended path?
- Another Update: which, in turn, has helped me find rusanu.com - all things ServiceBroker by SO user Remus Rusanu. The asyncrhonous messaging solutions seem to match our scenario much better than the Replication scenarios...

Solution

Service Broker technology is good for serving your task although there are maybe potential drawback depending on your particular system configuration. The most valuable feature IMO is ability to decouple two kind of processing - writing and aggregation. You will be able to do this even using different databases/SQL Server instances/physical servers in very reliable way. Of course you need to spend some time designing message exchange process - specifying message formats, planning conversations, etc., because this has huge influence on satisfaction from resulting system.

I've used SSBS for my task that was more or less similar - near real-time creation of analytic data warehouse based on regular data flow.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow