Accessing large data sets and/or storing them

https://stackoverflow.com/questions/9307793

30-04-2021
|

Question

At the moment I am dealing with large amounts of float/double datasets to be used for calculation. I have a set of files to compare Data A to Data B and I would like to compute the Euclidean distance / Cosine similarity. I.E. Data A point 1 iterates through Data B Points to find the nearest neighbour.

The data is given in a text file - no issues with that. What would be an ideal way to go about storing/reading the information?

I would have to repeat Data B for all points in Data A. The data is to be stored as floats. Each Data point may have dimensions. A file may contain up to about 2mil floats.

Should I go about using :

Constantly reading Data B's file and parsing the string (I feel that this is highly inefficient)
Storing the data in a List (An array of floats)
Using a Memory-Map IO?
HashMap (I am relatively new to HashMap, they say that the positions of the collection may change over time, if i am just iterating through with no modifications, will the positions change?)

Solution

2M floats is not that much at all, it will be perfectly fine to put them all in a list. One list for A, one for B. If A and B are multidimensional, float[][] is just fine. If you find you are running out of memory, try loading the whole B first, but one data point from A at a time.

OTHER TIPS

The basic solution is the best one: just a float[][]. That's almost certainly the most memory-efficient and the fastest solution, and very simple.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow