Question

I am trying to run a program that analyzes a bunch of text files containing numbers. The total size of the text files is ~12 MB, and I take 1,000 doubles from each of 360 text files and puts them into a vector. My problem is that I get about halfway through the list of text files and then my computer slows down until it isn't processing any more files. The program is not infinite looping, but I think I have a problem with using too much memory. Is there a better way to store this data that won't use as much memory?

Other possibly relevant system information:

Running Linux

8 GB memory

Cern ROOT framework installed (I don't know how to reduce my memory footprint with this though)

Intel Xeon Quad Core Processor

If you need other information, I will update this list

EDIT: I ran top, and my program uses more memory, and once it got above 80% i killed it. There's a lot of code, so I'll pick out the bits where memory is being allocated and such to share. EDIT 2: My code:

void FileAnalysis::doWork(std::string opath, std::string oName)
{
//sets the ouput filepath and the name of the file to contain the results
outpath = opath;
outname = oName;
//Reads the data source and writes it to a text file before pushing the filenames into a vector
setInput();
//Goes through the files queue and analyzes each file
while(!files.empty())
{
    //Puts all of the data points from the next file onto the points vector then deletes the file from the files queue
    readNext();
    //Places all of the min or max points into their respective vectors
    analyze();
    //Calculates the averages and the offset and pushes those into their respective vectors
    calcAvg();
}
makeGraph();
}

//Creates the vector of files to be read
void FileAnalysis::setInput()
{
string sysCall = "", filepath="", temp;
filepath = outpath+"filenames.txt";
sysCall = "ls "+dataFolder+" > "+filepath;
system(sysCall.c_str());
ifstream allfiles(filepath.c_str());
while (!allfiles.eof())
{
    getline(allfiles, temp);
    files.push(temp);
}
}
//Places the data from the next filename into the files vector, then deletes the filename from the vector
void FileAnalysis::readNext()
{
cout<<"Reading from "<<dataFolder<<files.front()<<endl;
ifstream curfile((dataFolder+files.front()).c_str());
string temp, temptodouble;
double tempval;
getline(curfile, temp);
while (!curfile.eof())
{

    if (temp.size()>0)
    {
        unsigned long pos = temp.find_first_of("\t");
        temptodouble = temp.substr(pos, pos);
        tempval = atof(temptodouble.c_str());
        points.push_back(tempval);
    }
    getline(curfile, temp);
}
setTime();
files.pop();
}
//Sets the maxpoints and minpoints vectors from the points vector and adds the vectors to the allmax and allmin vectors
void FileAnalysis::analyze()
{
for (unsigned int i = 1; i<points.size()-1; i++)
{
    if (points[i]>points[i-1]&&points[i]>points[i+1])
    {
        maxpoints.push_back(points[i]);
    }
    if (points[i]<points[i-1]&&points[i]<points[i+1])
    {
        minpoints.push_back(points[i]);
    }
}
allmax.push_back(maxpoints);
allmin.push_back(minpoints);
}
//Calculates the average max and min points from the maxpoints and minpoints vector and adds those averages to the avgmax and avgmin vectors, and adds the offset to the offset vector
void FileAnalysis::calcAvg()
{
double maxtotal = 0, mintotal = 0;
for (unsigned int i = 0; i<maxpoints.size(); i++)
{
    maxtotal+=maxpoints[i];
}
for (unsigned int i = 0; i<minpoints.size(); i++)
{
    mintotal+=minpoints[i];
}
avgmax.push_back(maxtotal/maxpoints.size());
avgmin.push_back(mintotal/minpoints.size());
offset.push_back((maxtotal+mintotal)/2);

}

EDIT 3: I added in the code to reserve vector space and I added code to close the files, but my memory still gets filled to 96% before the program stops...

Was it helpful?

Solution

This could be optimized endlessly, but my immediate reaction would be to use a container other than vector. Remember that storage for a vector is allocated serially in memory, which means adding additional elements causes a reallocation of the entire vector if there isn't enough current space to hold the new elements.

Try a container optimized for constant insertions, such as a queue or list.

Alternatively, if vector is required, you could try allocating the expected memory footprint up-front to avoid continuous reallocation. See vector.reserve(): Vector. Note that the reserved capacity is in terms of elements, not bytes.

int numberOfItems = 1000;
int numberOfFiles = 360;

size_type totalExpectedSize = (numberOfItems) * (numberOfFiles);
myVector.reserve( totalExpectedSize );

---------- EDIT FOLLOWING CODE POST ----------

My immediate concern would be the following logic in analyze():

for (unsigned int i = 1; i<points.size()-1; i++) 
{     
    if (points[i]>points[i-1]&&points[i]>points[i+1])     
    {         
        maxpoints.push_back(points[i]);     
    }     
    if (points[i]<points[i-1]&&points[i]<points[i+1])     
    {         
        minpoints.push_back(points[i]);     
    } 
} 
allmax.push_back(maxpoints); 
allmin.push_back(minpoints); 

Specifically, my concern is the allmax and allmin containers, onto which you are pushing copies of the maxpoints and minpoints containers. The maxpoints and minpoints containers themselves can grow quite large with this logic, depending on the datasets.

You're incurring the cost of container copies several times. Is it really necessary to copy the minpoints/maxpoints containers into allmax/allmin? Without knowing a bit more, it's hard to optimize your storage design.

I don't see anywhere that minpoints and maxpoints are actually emptied, which means that over time they can grow very large, and their corresponding copies to the allmin/allmax containers will grow very large. Are minpoints/maxpoints supposed to represent the min/max points for just one file?

As an example, let's look at a simplified minpoints and allmin scenario (but keep in mind that this applies to max just as well, and both are on a larger scale than shown here). This is, obviously, a dataset engineered to show my point:

File 1: 2 1 2 1 2 1 2 1 2 1 2
minpoints: [1 1 1 1 1]
allmin:    [1 1 1 1 1]

File 2: 3 2 3 2 3 2 3 2 3 2 3
minpoints: [1 1 1 1 1 2 2 2 2 2]
allmin:    [1 1 1 1 1 1 1 1 1 1 2 2 2 2 2]

File 3: 4 3 4 3 4 3 4 3 4 3 4
minpoints: [1 1 1 1 1 2 2 2 2 2 3 3 3 3 3]
allmin:    [1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3]

There are other optimizations and critiques to be made, but for now I'm limiting this to trying to solve your immediate problem. Can you post the makeGraph() function, as well as the definitions of all containers involved (points, minpoints, maxpoints, allmin, allmax)?

OTHER TIPS

A few things to try:

  1. Run top to see how much memory your program is using.
  2. Run a smaller example of the problem (e.g. read 10 floats from 1 file) under valgrind and check for memory leaks.
  3. Pre-allocate the size of vector required (over-estimate) using reserve()
  1. Check the memory usage is what you expect, ie. that you're not leaking resources (do you fail to free any memory, fail to close any files?)
  2. Try reserving the vector to the full size you need up-front and see if it allocates correctly.
  3. Do you need all the results in memory at once? Can you write them to a file instead?

If necessary, you can try:

  • using a smaller datatype than double
  • using an array (if you're worried about overhead) instead of a vector
  • using a linked list of vectors, if you're worried about memory fragmentation

But that shouldn't be necessary (or be coutnerproductive) as I agree, what you're doing sounds like it should work.

Look at your code and the number of iterations. your process might consume a lot CPU if you have so many iterations with out Sleep or event based programming..

OR

Pre-allocate the number of elements for vector so that the vector no need to re-allocate.. But this will be overkill

Since mostly your program which consumes the CPU, run your process as a background and use the top command to see the CPU and memory usage of your program.

You may be running into an issue with using eof() in your readNext() method. For example, see this SO question and section 15.4/15.5 in the C++ FAQ. If that is indeed this issue then fixing the read loop to check the return status of getline() should fix the problem.

If it doesn't, I would start by debugging to see where/how the program is "crashing". In a case like this I would probably start by simple logging via printf() to the console or log file and output the current file and status every 1000 lines to start with. Let it run a few times and check the log output for any obvious signs of trouble (i.e., it never gets past reading file #3).

If that is that not enough to expose the problem then add more detailed logging in the necessary spots and/or break into the debugger and start tracing (useful when your concept of the code differs from the computers, we often read what we think the code should be doing instead of what it actually says).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top