SQL Database VS. Multiple Flat Files (Thousands of small CSV's)

https://stackoverflow.com/questions/11143724

16-06-2021
|

문제

We are designing an update to a current system (C++\CLI and C#). The system will gather small (~1Mb) amounts of data from ~10K devices (in the near future). Currently, they are used to save device data in a CSV (a table) and store all these in a wide folder structure.

Data is only inserted (create / append to a file, create folder) never updated / removed. Data processing is done by reading many CSV's to an external program (like Matlab). Mainly be used for statistical analysis.

There is an option to start saving this data to an MS-SQL database. Process time (reading the CSV's to external program) could be up to a few minutes.

How should we choose which method to use?
Does one of the methods take significantly more storage than the other?
Roughly, when does reading the raw data from a database becomes quicker than reading the CSV's? (10 files, 100 files? ...)

I'd appreciate your answers, Pros and Cons are welcome.

Thank you for your time.

해결책

Well if you are using data in one CSV to get data in another CSV I would guess that SQL Server is going to be faster than whatever you have come up with. I suspect SQL Server would be faster in most cases, but I can't say for sure. Microsoft has put a lot of resources into make a DBMS that does exactly what you are trying to do.

Based on your description it sounds like you have almost created your own DBMS based on table data and folder structure. I suspect that if you switched to using SQL Server you would probably find a number of areas where things are faster and easier.

Possible Pros:

Faster access
Easier to manage
Easier to expand should you need to
Easier to enforce data integrity
Easier to design more complex relationships

Possible Cons:

You would have to rewrite your existing code to use SQL Server instead of your current system
You may have to pay for SQL Server, you would have to check to see if you can use Express

Good luck!

다른 팁

I'd like to try hitting those questions a bit out of order.

Roughly, when does reading the raw data from a database becomes quicker than reading the CSV's? (10 files, 100 files? ...)

Immediately. The database is optimized (assuming you've done your homework) to read data out at incredible rates.

Does one of the methods take significantly more storage than the other?

Until you're up in the tens of thousands of files, it probably won't make too much of a difference. Space is cheap, right? However, once you get into the big leagues, you'll notice that the DB is taking up much, much less space.

How should we choose which method to use?

Great question. Everything in the database always comes back to scalability. If you had only a single CSV file to read, you'd be good to go. No DB required. Even dozens, no problem.

It looks like you could end up in a position where you scale up to levels where you'll definitely want the DB engine behind your data pretty quickly. When in doubt, creating a database is the safe bet, since you'll still be able to query that 100 GB worth of data in a second.

This is a question many of our customers have where I work. Unless you need flat files for an existing infrastructure, or you just don't think you can figure out SQL Server, or if you will only have a few files with small amounts of data to manage, you will be better off with SQL Server.

If you have the option to use a ms-sql database, I would do that.

Maintaining data in a wide folder structure is never a good idea. Reading your data would involve reading several files. These could be stored anywhere on your disk. Your file-io time would be quite high. SQL server being a production database has these problems already taken care of.

You are reinventing the wheel here. This is how foxpro manages data, one file per table. It is usually a good idea to use proven technology unless you are actually making a database server.

I do not have any test statistics here, but reading several files will almost always be slower than a database if you are dealing with any significant amount of data. Given your about 10k devices, you should consider using a standard database.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow