Creating queries on the fly and general manipulation for dataset of half a million data records

https://datascience.stackexchange.com/questions/10877

16-10-2019
|

Question

What would one use for manipulating data of the kind below ?

a) Data is bio-markers of different globs

GlobA 3 4 5 ....
GlobB 2 1 1 ....
GlobC 3 2 1 ....

b) Manipulations are queries like:
- show me the Globs where average of each Glob in file Efficiency is greater than 50%
- or show me sorted Glob list where first and second sort criterion are xx and yy)
- Construct a chart of globs that differ in criteria x by integer 3 (show me globs whose average over 5 runs for Structure and Efficiency differ from their nearest neighbors by 3)

Currently, this data is stored in a 100MB Excel file that is painfully slow to load on the speediest computer our lab can afford.

Ideally, there would be some open source program that accepts csv files of this data and has ability for user to construct queries that can be stored in a library for easy pulling up, charting abilities would be great too.
Here are 2 files (real data would be around 40 such files each file containing 20K rows):

Efficiency File:

Glob,Run1,Run2,Run3,Run4,Run5   
SigX,6.2,4.8,2.4,4.32,5.59  
SigY,8.44,8.16,5.99,0.98,9.6   
SigZ,0.00,0.00,0.00,0.01,0.20

Structure File:

Glob,Run1,Run2,Run3,Run4,Run5   
SigX,3.2,3.8,2.4,7.32,6.32  
SigY,2.4,5.16,6.99,0.98,9.6   
SigZ,1.02,0.00,2.23,0.01,0.20

Solution

You can do this in pandas since your data set is small. For "big" data that does not fit in memory you would want to use a database; PostgreSQL with the PostGIS extension would be ideal, since it handles the nearest neighbor part, which is the most challenging aspect. Here are some sample queries, in python.

Show me the Globs where average of each Glob in file Efficiency is greater than 50%

import pandas

efficiency = pandas.read_csv('efficiency.csv', sep=',', index_col=0)
structure = pandas.read_csv('structure.cv', sep=',', index_col=0)

efficiency[efficiency.mean(1) > 0.5]

Sort the Structure list descending on Run3, then ascending on Run2.

structure.sort_values(by=["Run3", "Run2"], ascending=[False, True])

I'm not sure how you're defining the distance so I am unable to demonstrate the last part.

OTHER TIPS

Checkout the R programming language, specifically the dplyr and ggplot2 packages.

Try QlikView. It's not open source but is free as a beer, does everything you described with awesome speed and is very convenient.

Licensed under: CC-BY-SA with attribution

Not affiliated with datascience.stackexchange