Could Apache spark be an option?

https://softwareengineering.stackexchange.com/questions/376579

spark

07-02-2021
|

Pergunta

Today we are using SQL server with multiple indexed views. Whenever we update the source tables for the view there is too long delay.

I have no experience with Spark, so the question is: Can we input the data from the source tables, create the "data view" in spark and then only select the data we need from it? but still keep the "data view" ready for reading later (not saved to disk, but as long as the spark server is running)?

very simplified SQL example of data we need (not actually code):

SELECT tbl1.id as id1,tbl2.id as id,tbl3.id as id3 from tbl1
Cross join tbl2
Cross join tbl3

So we add data from tbl1, tbl2 and tbl3 to spark, run the data transformation and then we can select data from spark like

SELECT * FROM SPARK where id1 = 74 and id2 = 85 and id3 > 45 and id3 < 90

(the question is about spark, not tuning SQL queries and sql indexes)

Solução

TL;DR: yes, you can do this, but I would not recommend this path

Spark waits for a request to output, and then determines what to read from inputs and how to transform it. That means it will read from the source tables when you perform a query. The intermediate results can be cached in RAM or on disk by spark. So, yes, it can work in the way you describe. It was however not designed to be used in such a way, so you may find difficulties. Additionally, the use of spark itself is complicated, especially when it is a long-running job, and you will likely spend much more effort getting spark to work reliably for your use case than to explore alternate paths.

I suggest to try to find a solution in-database (e.g. memory tables) or using a dedicated caching server solution (e.g. redis).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange