Memory issue when Reading HUGE csv file, STORE as Person objects, Write into multiple cleaner/smaller CSV files

https://stackoverflow.com/questions/21637293

08-10-2022
|

Question

I have two text files with comma delimited values. One is 150MB and the other is 370MB, so these guys have three million+ rows of data.

One document holds information about, let's say soft drink preferences, and the next might have information about, let's say hair colors.

Example soft drinks data file, though in the real file the UniqueNames are NOT in order, nor are the dates:

"UniqueName","softDrinkBrand","year"
"001","diet pepsi","2004"
"001","diet coke","2006"
"001","diet pepsi","2004"
"002","diet pepsi","2005"
"003","coca cola","2004"

Essentially, there are too many lines of data to use excel, so I want to create Person objects using a Person class to hold the data about each person.

Each Person object holds twenty array lists, two for each of ten years 2004-2013, e.g.,

...
private ArrayList<String> sodas2013= new ArrayList<String>();
private ArrayList<String> hairColors2013= new ArrayList<String>();
private ArrayList<String> sodas2014= new ArrayList<String>();
private ArrayList<String> hairColors2014= new ArrayList<String>();
...

I wrote a program to read the rows of a data file, one at a time, using a BufferedReader. For each row, I clean up the data (split on the commas, delete quote marks...), and then, if that particular uniqueID isn't in a Hashtable yet, I add it, as well as create a new Person object from my Person class, and then I store the data I want into the Person class' ArrayList as above. If the unique ID is already present, I just call a Person method to see if the soda, or hair color, is already in the array list for that particular year (as written in the csv file).

The goal is to output twenty different csv files in the end, one tying people to sodas drunk in each year, one to hair colors for that year. They would look like this...

2004 file using above example input file:

UID    pepsi    coca cola    diet pepsi    diet coke    etc
001    false    false    true    false    etc
002    false    false    false    false    etc
003    false    true    false    false    etc

Now, when I have test files of only like 100 lines each, this works beautifully. I save all the data in my Person objects, and then I use methods to match Hashtable uniqueNames to uniqueSoftDrinkNames by year stored in the Person objects to write files with rows of personID, then true/false for every possible soda that any uniqueID had tried in any year. The data looks like the above info.

So, I know the code works and does what I want it to. The problem, now is...

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOfRange(Unknown Source)
at java.lang.String.<init>(Unknown Source)
at java.lang.StringBuffer.toString(Unknown Source)
at java.util.regex.Matcher.appendReplacement(Unknown Source)
at java.util.regex.Matcher.replaceAll(Unknown Source)
at java.lang.String.replaceAll(Unknown Source)
at CleanDataFiles.main(CleanDataFiles.java:43)

Where line 43 is:

temp = temp.replaceAll("\"", "");

...which is just a simple point of getting rid of quote marks in a given substring after having split a line by the commas.

It takes about ten minutes of the computer running this program to reach this error, and both times I ran the program, it gave me the same error and the same line.

I'm reading the CSV document line by line, so I'm not storing huge amounts of data in a giant string or anything as I read the file. The only place I'm storing tons of data is in my Hashtables in my main class where I store personIDs and personObjects, and two more hashtables where I store all possible hair colors and all possible sodas, and in all of those person objects, each with twenty arraylists of all the soda and hair color info by year.

My supposition is that the memory issue is in storing these tens of thousands of unique person objects with all the data associated with them. That said, I got the error in the same place in a part of my program where I'm merely reading the csv file and cleaning up individual entries...

In any case, MY QUESTION (you were all waiting for this!)

Are there better ways to do this? Instead of tens of thousands or low hundreds of thousands of Person objects holding all this data... should I be creating tens of thousands of Person text files and opening and closing them each time I read a new line of the CSV file and query whether the information is duplicate or new, and if new, add it to the Person file? And then when all is said and done, open each person file to read the information, interpret, and then write it into my growing output file one line at a time, closing that person file, then opening the next one for the next line, etc.?

Or, HOPEFULLY, is there a sillier and easier to solve issue elsewhere in this whole mess do you think, in order to not run out of memory while cleaning up and organizing my data files for further analysis?

I appreciate any help or suggestions! Thank you.

Solution

Here are a couple of thoughts. First, it may be that you have plenty of memory free on your machine but are just not allocating enough for the JVM. Try something like this:

java -Xms2048M -Xmx4096M YourProgram

Of course, the values will depend on how much memory your machine has.

Also, why are you using an ArrayList of String's in each Person object? If you can determine the possible sodas or whatever ahead of time then you could use an array of int's, that should save some memory.

Another option would be to do it piecewise, first do sodas and when you are done do haircolors, et cetra.

OTHER TIPS

I'd say that your problem calls for a relational database. You'll be able to:

Store data on disk
Query the data, joining specified attributes.

You might even use a embedded database (http://www.h2database.com/ --- this database is contained in a single jar file, so external server program).

You can try import into a lightweight database and use sql to query your required information.

You could replace your Hashtable with a java.util.Properties. You can write the contents onto a file using. From the javadocs:

After the entries have been written, the output stream is flushed. The output stream remains open after this method returns.

Or you could try out a disk backed HashMap like JDBM2. From it's webpage,

JDBM2 was developed to support astronomical calculation with data which does not fit into memory. It also provides storage for astronomical planetarium Asterope.

One optimization for reducing memory usage will be, instead of storing the drink type string as a String (in the arraylist), just store an id for that. So, you may replace the ArrayList of strings with an ArrayList of integers. Also, the Drink string to integer Id may be in a different HashMap. You may like to use Trove library for primitive collections. Check http://trove.starlight-systems.com/ . Also, when you detect that you get complete information for one person, it may be a candidate to be flushed out to the files and it no longer to be in memory. You may just mark that person as "Done" in another HashMap.

But finally, a database is a better option for this problem. An embedded DB like JavaDB should suffice. External memory cache such as Memcache, Redis can also be used.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow