Accessing large data sets and/or storing them
-
30-04-2021 - |
Pregunta
At the moment I am dealing with large amounts of float/double datasets to be used for calculation. I have a set of files to compare Data A to Data B and I would like to compute the Euclidean distance / Cosine similarity. I.E. Data A point 1 iterates through Data B Points to find the nearest neighbour.
The data is given in a text file - no issues with that. What would be an ideal way to go about storing/reading the information?
I would have to repeat Data B for all points in Data A. The data is to be stored as floats. Each Data point may have dimensions. A file may contain up to about 2mil floats.
Should I go about using :
- Constantly reading Data B's file and parsing the string (I feel that this is highly inefficient)
- Storing the data in a List (An array of floats)
- Using a Memory-Map IO?
- HashMap (I am relatively new to HashMap, they say that the positions of the collection may change over time, if i am just iterating through with no modifications, will the positions change?)
Solución
2M floats is not that much at all, it will be perfectly fine to put them all in a list. One list for A, one for B. If A and B are multidimensional, float[][] is just fine. If you find you are running out of memory, try loading the whole B first, but one data point from A at a time.
Otros consejos
The basic solution is the best one: just a float[][]
. That's almost certainly the most memory-efficient and the fastest solution, and very simple.