Question

At the moment I am dealing with large amounts of float/double datasets to be used for calculation. I have a set of files to compare Data A to Data B and I would like to compute the Euclidean distance / Cosine similarity. I.E. Data A point 1 iterates through Data B Points to find the nearest neighbour.

The data is given in a text file - no issues with that. What would be an ideal way to go about storing/reading the information?

I would have to repeat Data B for all points in Data A. The data is to be stored as floats. Each Data point may have dimensions. A file may contain up to about 2mil floats.

Should I go about using :

  1. Constantly reading Data B's file and parsing the string (I feel that this is highly inefficient)
  2. Storing the data in a List (An array of floats)
  3. Using a Memory-Map IO?
  4. HashMap (I am relatively new to HashMap, they say that the positions of the collection may change over time, if i am just iterating through with no modifications, will the positions change?)
Était-ce utile?

La solution

2M floats is not that much at all, it will be perfectly fine to put them all in a list. One list for A, one for B. If A and B are multidimensional, float[][] is just fine. If you find you are running out of memory, try loading the whole B first, but one data point from A at a time.

Autres conseils

The basic solution is the best one: just a float[][]. That's almost certainly the most memory-efficient and the fastest solution, and very simple.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top