Creating queries on the fly and general manipulation for dataset of half a million data records
-
16-10-2019 - |
Question
What would one use for manipulating data of the kind below ?
a) Data is bio-markers of different globs
- GlobA 3 4 5 ....
- GlobB 2 1 1 ....
- GlobC 3 2 1 ....
b) Manipulations are queries like:
- show me the Globs where average of each Glob in file Efficiency is greater than 50%
- or show me sorted Glob list where first and second sort criterion are xx and yy)
- Construct a chart of globs that differ in criteria x by integer 3 (show me globs whose average over 5 runs for Structure and Efficiency differ from their nearest neighbors by 3)
Currently, this data is stored in a 100MB Excel file that is painfully slow to load on the speediest computer our lab can afford.
Ideally, there would be some open source program that accepts csv files of this data and has ability for user to construct queries that can be stored in a library for easy pulling up, charting abilities would be great too.
Here are 2 files (real data would be around 40 such files each file containing 20K rows):
Efficiency File:
Glob,Run1,Run2,Run3,Run4,Run5
SigX,6.2,4.8,2.4,4.32,5.59
SigY,8.44,8.16,5.99,0.98,9.6
SigZ,0.00,0.00,0.00,0.01,0.20
Structure File:
Glob,Run1,Run2,Run3,Run4,Run5
SigX,3.2,3.8,2.4,7.32,6.32
SigY,2.4,5.16,6.99,0.98,9.6
SigZ,1.02,0.00,2.23,0.01,0.20
Solution
You can do this in pandas since your data set is small. For "big" data that does not fit in memory you would want to use a database; PostgreSQL with the PostGIS extension would be ideal, since it handles the nearest neighbor part, which is the most challenging aspect. Here are some sample queries, in python.
Show me the Globs where average of each Glob in file Efficiency is greater than 50%
import pandas
efficiency = pandas.read_csv('efficiency.csv', sep=',', index_col=0)
structure = pandas.read_csv('structure.cv', sep=',', index_col=0)
efficiency[efficiency.mean(1) > 0.5]
Sort the Structure list descending on Run3, then ascending on Run2.
structure.sort_values(by=["Run3", "Run2"], ascending=[False, True])
I'm not sure how you're defining the distance so I am unable to demonstrate the last part.