Pergunta

I develop a web application in Angular (frontend) and Scala (backend) for a big data team. Because they use large files for export/import, I build a module which is a copy of Microsoft Excel.

So, what is the flux for import files:

the client send the file to api1 api1 save the file on a temp folder and send response to client that the import has began. this time, api1 call api2 (service) to process this file, map rows and columns in list of objects, create a table in database and insert bulk rows (2500 lines for each query) For import, that's ok, is a process in background and the client don't need to wait for it because will see results from first second (first bulk insert is realy fast). We talk here about excels with some hundred thousands of rows, maybe millions. after processing, the file is deleted from temp folder Now, the problem is at export: How I need to think about exporting a db table like this? Because, if I don't save it anywhere, I need to get all data from table and compute a temp file (excel worksheet) which take some time (maybe 5, 10, 15 minutes) and I can not keep a connection client - server opened so much time. Anyway, if I can use sockets instead http-requests for this, the client will need to wait this time to compute the file. Is annoying for him.

One solution is to keep the temp file on server/cloud, but probably this will be (or can be) altered by users and need to be updated before downloading. My question is ... how I can map db tables in excel files to give it to users instantly, when they want to download a table?

For my api2 I'll use Apache Spark for reading the files and write into database. But anyway, this will remain an background process and will be decoupled from the user request.

Foi útil?

Solução

My question is ... how I can map db tables in excel files to give it to users instantly, when they want to download a table?

Simple answer: you can't, unless you have the files prepared in advance.

.xlsx files are ZIP archives which can't easily be produced in a streaming fashion, so to give one to the user you will need to create it from the database contents. You may well be able to speed up the creation process by using appropriate tools, but you can't get it down to zero.

If your users will only accept .xlsx files you can stop here. They will have to wait for their files.

If .csv files are acceptable, you may be able to create them as a stream, so file creation and network transmission happen in parallel, and the user gets immediate feedback.

For data scientists working with specialized tools, binary interchange formats may also be appropriate. Which formats are usable by your users, and whether these formats are suitable for streaming, is something that you'd have to find out.

Licenciado em: CC-BY-SA com atribuição
scroll top