Question

I have a dataset with > 1000K rows and 5 columns. (material & prices been the relevant columns)

I have written a 'reactive' Shiny app which uses ggplot2 to create a boxplot of the price of the various materials. e.g the user selects 4-5 materials from a list and then Shiny creates a boxplot of the price of each material :

Price spread of: Made of Cotton, Made of Paper, Made of Wood

It also creates a material combination data plot of the pricing spread of the combination of all the materials

e.g Boxplot of Price spread of: Made of Cotton & Paper & Wood

It is working relatively quickly for the sample dataset (~5000 rows) but I am worried about scaling it effectively.

The dataset is static so I look at the following solutions:

  1. Calculate the quartile ranges of the various materials (data <- summary(data)) and then use googleViz to create a candle stick,

    however I run into problems when trying to calculate the material combination plot as there are over 100 materials, so calculating all the possible combinations offline is not feasible.

  2. Calculate the quartile ranges of the various materials (data <- summary(data)) and then create a matrix which stores the row numberof the summary data (min,median,max,1st&3rd quartile) for each material. I can then use some rough calculations to establish the summary() data for the material combination plot, and then plot using GoogleVIZ however I have little experience with this type of calculation using Shiny.

Can anyone suggest the most robust and scalable way to calculate & boxplot reactive subsets using Shiny?

I understand this a question related to method, rather than code, but I am new to the capabilities of R and am still digesting the different class capabilities, and don't want to 'miss a trick' so to speak.

As always thanks!

Please see below for methods reviewed.

Quartile Clustering: A quartile based technique for Generating Meaningful Clusters http://arxiv.org/ftp/arxiv/papers/1203/1203.4157.pdf

Conditionally subsetting and calculating a new variable in dataframe in shiny

Was it helpful?

Solution

If you really have a dataset that has more than 1000K, which is 1M. It is probably in a flat file or in a database. You can always do some precalculations and store the result in a database table and use shiny app to call that table instead of loading everything into R every time people open up your shiny app.

I have built several shiny apps for internal use and the lesson I have learned is that: before you build your app, you need to carefully think about, how can I minimize the calculations for R and at the same time deliver the info to app user. Some of our data is 10billion+ and use Hive query will take more than 1 hour. Then I ended up precalculate result and put it on the crontab to update the result table every midnight.

I prefer, maybe your method2? or store the precalculation in a mysql database. (Maybe a Python script update the table once a day if you need some real-time feature later).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top