Data analysis using R/python and SSDs

https://stackoverflow.com/questions/4262984

27-09-2019
|

Question

Does anyone have any experience using r/python with data stored in Solid State Drives. If you are doing mostly reads, in theory this should significantly improve the load times of large datasets. I want to find out if this is true and if it is worth investing in SSDs for improving the IO rates in data intensive applications.

Solution

My 2 cents: SSD only pays off if your applications are stored on it, not your data. And even then only if a lot of access to disk is necessary, like for an OS. People are right to point you to profiling. I can tell you without doing it that almost all of the reading time goes to processing, not to reading on the disk.

It pays off far more to think about the format of your data instead of where it's stored. A speedup in reading your data can be obtained by using the right applications and the right format. Like using R's internal format instead of fumbling around with text files. Make that an exclamation mark: never keep on fumbling around with text files. Go binary if speed is what you need.

Due to the overhead, it generally doesn't make a difference if you have an SSD or a normal disk to read your data from. I have both, and use the normal disk for all my data. I do juggle around big datasets sometimes, and never experienced a problem with it. Off course, if I have to go really heavy, I just work on our servers.

So it might make a difference when we're talking gigs and gigs of data, but even then I doubt very much that disk access is the limiting factor. Unless your continuously reading and writing to the disk, but then I'd say you should start thinking again about what exactly you're doing. Instead of spending that money on SDD drives, extra memory could be the better option. Or just convince the boss to get you a decent calculation server.

A timing experiment using a bogus data frame, and reading and writing in text format vs. binary format on a SSD disk vs. a normal disk.

> tt <- 100
> longtext <- paste(rep("dqsdgfmqslkfdjiehsmlsdfkjqsefr",1000),collapse="")
> test <- data.frame(
+     X1=rep(letters,tt),
+     X2=rep(1:26,tt),
+     X3=rep(longtext,26*tt)
+ )

> SSD <- "C:/Temp" # My ssd disk with my 2 operating systems on it.
> normal <- "F:/Temp" # My normal disk, I use for data

> # Write text 
> system.time(write.table(test,file=paste(SSD,"test.txt",sep="/")))
   user  system elapsed 
   5.66    0.50    6.24 

> system.time(write.table(test,file=paste(normal,"test.txt",sep="/")))
   user  system elapsed 
   5.68    0.39    6.08 

> # Write binary
> system.time(save(test,file=paste(SSD,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> system.time(save(test,file=paste(normal,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> # Read text 
> system.time(read.table(file=paste(SSD,"test.txt",sep="/"),header=T))
   user  system elapsed 
   8.57    0.05    8.61 

> system.time(read.table(file=paste(normal,"test.txt",sep="/"),header=T))
   user  system elapsed 
   8.53    0.09    8.63 

> # Read binary
> system.time(load(file=paste(SSD,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0 

> system.time(load(file=paste(normal,"test.RData",sep="/")))
   user  system elapsed 
      0       0       0

OTHER TIPS

http://www.codinghorror.com/blog/2010/09/revisiting-solid-state-hard-drives.html has a good article on SSDs, comments offer alot of insights.

Depends on the type of analysis you're doing, whether it's CPU bound or IO bound. Personal experience dealing with regression modelling tells me former is more often the case, SSDs wouldn't be of much use then.

In short, best to profile your application first.

Sorry but I have to disagree with most rated answer by @joris. It's true that if you run that code, binary version almost takes zero time to be written. But that's because the test set is weird. The big columm 'longtext' is the same for every row. Data frames in R are smart enough no to store duplicate values more than once (via factors).

So at the end we finish with a text file of 700MB versus a binary file of 335K (Of course binary is much faster xD)

-rw-r--r-- 1 carlos carlos 335K Jun  4 08:46 test.RData
-rw-rw-r-- 1 carlos carlos 745M Jun  4 08:46 test.txt

However if we try with random data

> longtext<-paste(sample(c(0:9, letters, LETTERS),1000*nchar('dqsdgfmqslkfdjiehsmlsdfkjqsefr'), replace=TRUE),collapse="")
> test$X3<-rep(longtext,26*tt)
> 
> system.time(write.table(test,file='test.txt'))
   user  system elapsed 
  2.119   0.476   4.723 
> system.time(save(test,file='test.RData'))
   user  system elapsed 
  0.229   0.879   3.069

and files are not that different

-rw-r--r-- 1 carlos carlos 745M Jun  4 08:52 test.RData
-rw-rw-r-- 1 carlos carlos 745M Jun  4 08:52 test.txt

As you see, elapsed time is not the sum of user+system...so the disk is the bottleneck in both cases. Yes binary storing will always be faster since you don't have to include semicolon, quotes or staff like that, but just dumping memory object to disk.

BUT there is always a point where disk becomes bottleneck. My test was ran in a research server where via NAS solution we get disk read/write times over 600MB/s. If you do the same in your laptop, where is hard to go over 50MB/s, you'll note the difference.

So, if you actually have to deal with real bigData (and repeating one million times the same thousand character string is not big data), when the binary dump of the data is over 1 GB, you'll appreciate having a good disk (SSD is a good choice) for reading input data and writing results back to disk.

I have to second John's suggestion to profile your application. My experience is that it isn't the actual data reads that are the slow part, it's the overhead of creating the programming objects to contain the data, casting from strings, memory allocation, etc.

I would strongly suggest you profile your code first, and consider using alternative libraries (like numpy) to see what improvements you can get before you invest in hardware.

The read and write times for SSDs are significantly higher than standard 7200 RPM disks (it's still worth it with a 10k RPM disk, not sure how much of an improvement it is over a 15k). So, yes, you'd get much faster times on data access.

The performance improvement is undeniable. Then, it's a question of economics. 2TB 7200 RPM disks are $170 a piece, and 100GB SSDS cost $210. So if you have a lot of data, you may run into a problem.

If you read/write a lot of data, get an SSD. If the application is CPU intensive, however, you'd benefit much more from getting a better processor.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow