Вопрос

At the moment I am dealing with large amounts of float/double datasets to be used for calculation. I have a set of files to compare Data A to Data B and I would like to compute the Euclidean distance / Cosine similarity. I.E. Data A point 1 iterates through Data B Points to find the nearest neighbour.

The data is given in a text file - no issues with that. What would be an ideal way to go about storing/reading the information?

I would have to repeat Data B for all points in Data A. The data is to be stored as floats. Each Data point may have dimensions. A file may contain up to about 2mil floats.

Should I go about using :

  1. Constantly reading Data B's file and parsing the string (I feel that this is highly inefficient)
  2. Storing the data in a List (An array of floats)
  3. Using a Memory-Map IO?
  4. HashMap (I am relatively new to HashMap, they say that the positions of the collection may change over time, if i am just iterating through with no modifications, will the positions change?)
Это было полезно?

Решение

2M floats is not that much at all, it will be perfectly fine to put them all in a list. One list for A, one for B. If A and B are multidimensional, float[][] is just fine. If you find you are running out of memory, try loading the whole B first, but one data point from A at a time.

Другие советы

The basic solution is the best one: just a float[][]. That's almost certainly the most memory-efficient and the fastest solution, and very simple.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top