Question

I am trying to write an application that will take a very large sql text file ~60GB (257 million lines) and split each of the COPY statements into seperate text files.

However the code I am currently using causes an OutOfMemoryError due to the lines exceeding the Scanner Buffer limit. The first statement is going to be ~40 million lines long.

public static void readFileByDelimeter(String fileName, String requestType, String output) throws FileNotFoundException {

//creating file instance
File file = new File(fileName);

//create scanner instance
Scanner scanner = new Scanner(file, "latin1");

//set custom delimeter
scanner.useDelimeter("COPY");

int number = 0;
System.out.println("Running......");
while (scanner.hasNext()){
    String line = scanner.next();
    if (line.length() > 20) {
        //save statements to seperate SQL files
        PrintWriter out = new PrintWriter("statement" + number + ".sql");
        out.println("COPY" + line.trim());
        out.close();
        }
        number++;
    }

System.out.println("Completed");
}

Please provide recommendation as to whether this is the wrong method for performing this or alterations to the existing method.

Thanks

Was it helpful?

Solution

First, why you are creating or some other process is creating 60GB file ! maybe you need to take a look at that process to fix that process to generate smaller sql text file instead of creating a new process. However, if this is a one time thing that you need to do then that might be fine but to address your question I would use the BufferedReader to read and process the records if it's a large file as you indicated.

BufferedReader br = new BufferedReader(new FileReader(file));
String line;
while ((line = br.readLine()) != null) {
   // process the line. and write into your output file and close the file.
   }
br.close();

OTHER TIPS

Me personally: I use BufferedReader in stead of Scanner. It also has a convenient readLine() method and I've never had any performance issues with it. The only thing is that you'd need to manually check if a line read is one that you want to process, but that's usually as simple as applying the String class methods.

That's not an answer to your actual question, but I consider it a decent easy to use alternative.

Try something like this (but prettier):

Scanner sc = new Scanner(new BufferedReader(new FileReader(file)));

This decorates the whole thing with a BufferedReader, meaning that not all of the file's content will be loaded into memory at once. You can use the Scanner in the same way.

try to use a BufferedReader. Direct use of scanner with file or raw file streams woudl load up the data in memory and wont flush it out on GC. Bets approach is to use BufferedReader and read one line at a time and do manual string checks and splitting. If done correctly this way you can give the GC enough opportunity to reclaim memory when needed

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top