Question

At the moment I am dealing with large amounts of float/double datasets to be used for calculation. I have a set of files to compare Data A to Data B and I would like to compute the Euclidean distance / Cosine similarity. I.E. Data A point 1 iterates through Data B Points to find the nearest neighbour.

The data is given in a text file - no issues with that. What would be an ideal way to go about storing/reading the information?

I would have to repeat Data B for all points in Data A. The data is to be stored as floats. Each Data point may have dimensions. A file may contain up to about 2mil floats.

Should I go about using :

  1. Constantly reading Data B's file and parsing the string (I feel that this is highly inefficient)
  2. Storing the data in a List (An array of floats)
  3. Using a Memory-Map IO?
  4. HashMap (I am relatively new to HashMap, they say that the positions of the collection may change over time, if i am just iterating through with no modifications, will the positions change?)
Was it helpful?

Solution

2M floats is not that much at all, it will be perfectly fine to put them all in a list. One list for A, one for B. If A and B are multidimensional, float[][] is just fine. If you find you are running out of memory, try loading the whole B first, but one data point from A at a time.

OTHER TIPS

The basic solution is the best one: just a float[][]. That's almost certainly the most memory-efficient and the fastest solution, and very simple.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top