Domanda

I have a space separated data file with 4.5 million entries in the following format

CO_1 A 0 0 0 0 1

CO_2 A 0 0 0 0 1

This data file is used as an input to the Self-Organizing Map (SOM) algorithm that iterates through this file 100 (in my case) times.

I use the following readFile function to copy the file completely into the temp string and pass the string on to the SOM algorithm.

public String readFile()
{
    String temp = "";

    try
    {
        FileReader file = new FileReader(FILE_LOCATION);
        BR = new BufferedReader(file);
        String strLine = null;

        while((strLine = BR.readLine()) != null)
        {
            temp += strLine + "\n";
        }
    }
    catch(Exception e)
    {
        
    }
    
    return temp;
}

How ever I feel the above method puts a heavy burden on memory and slows down the iterations which could result in memory overruns. Currently I'm running this code in a cluster with 30GB memory allocation and the execution has not even completed a single iteration for about 36 hours.

I cannot partially read the file (as in blocks of lines) since the SOM will have to poll for data once the initial block is done which could result in even further complications.

Any ideas how I could improve this so I could successfully iterate 4.5 million entries 100 times.

EDIT

The whole files is read in to the string using the above method only once. Then the string variable is used throughout the 100 iterations. However, each time string tokenizers has been utilized to process each line in the file * number of iterations.

È stato utile?

Soluzione

Don't ever use string concatenation for this kind of purpose.
Instead of String, use StringBuffer class for this purpose.
Consider Following example:

public StringBuffer readFile()
{
    StringBuffer tempSB = new StringBuffer();

    try
    {
        FileReader file = new FileReader(FILE_LOCATION);
        BR = new BufferedReader(file);
        String strLine = null;

        while((strLine = BR.readLine()) != null)
        {
            tempSB.append(strLine);
            tempSB.append("\n");
        }
    }
    catch(Exception e)
    {

    }

    return temp;
}  

This will save your heap memory.

Altri suggerimenti

I'd like to complement the other answers. Even though I think you should store your data in a more efficient data structure than just a string, I think there might be another reason you code is slow.

Since your file size seems to be around 100 MB, your code might be slowing down because Eclipse has not allocated enough heap space for it. Try adding the following flag:

-Xmx4G

This will give your code 4 GB of heap space to work with. To do this, in Eclipse go to:

// Run -> Run Configurations -> <Select your main class on the left>
// -> <Select the 'Arguments' tab>
// -> <Add the string "-Xmx4G" to the 'VM arguments' text area>

This might speed it up!

Reading a file with String += is very expensive. I suggest you parse the entries into a data structure and this should take about 1-10 seconds. To iterate this repeatedly should take less than a second. 4.5 million entries which use say 110 bytes per entry should use about 0.5 GB, perhaps 1 GB for a more complex structure. which shouldn't be enough to worry about.

if you need to parse the txt serial file and be able to read it randomly, use a persistent storage, like a SQL DB or no-SQL one or even the Lucene Search Engine. This will give you the benefits like:

  • you don't have to load the whole files into RAM
  • you can use stream-processing -> read the file line-by-line and keep only the actual line in RAM
  • parsing and persisting of source files would cost a bit more time, but the random access would be way faster.
  • you can even parse and read your data in several threads independently
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top