Question

I have around 100 files in a folder. Each file will have data like this and each line resembles an user id.

960904056
6624084
1096552020
750160020
1776024
211592064
1044872088
166720020
1098616092
551384052
113184096
136704072

And I am trying to keep on merging the files from that folder into a new big file until the total number of user id's become 10 Million in that new big file.

I am able to read all the files from a particular folder and then I keep on adding the user id's from those files in a linkedhashset. And then I was thinking to see whether the size of hashset is 10 Million and if it is 10 million then write all those user id's to a new text file. Is that feasoible solution?

That 10 million number should be configurable. In future, If I need to change that 10 million 1o 50Million then I should be able to do that.

Below is the code I have so far

public static void main(String args[]) {

    File folder = new File("C:\\userids-20130501");
    File[] listOfFiles = folder.listFiles();

    Set<String> userIdSet = new LinkedHashSet<String>();
    for (int i = 0; i < listOfFiles.length; i++) {
        File file = listOfFiles[i];
        if (file.isFile() && file.getName().endsWith(".txt")) {
            try {
                List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
                userIdSet.addAll(content);
                if(userIdSet.size() >= 10Million) {
                    break;
                }
                System.out.println(userIdSet);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

Any help will be appreciated on this? And any better way to do the same process?

Was it helpful?

Solution

Continuing from where we left. ;)

You can use the FileUtils to write the file along with the writeLines() method.

Try this -

public static void main(String args[]) {

File folder = new File("C:\\userids-20130501");

Set<String> userIdSet = new LinkedHashSet<String>();
int count = 1;
for (File file : folder.listFiles()) {
    if (file.isFile() && file.getName().endsWith(".txt")) {
        try {
            List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
            userIdSet.addAll(content);
            if(userIdSet.size() >= 10Million) {
                File bigFile = new File("<path>" + count + ".txt");
                FileUtils.writeLines(bigFile, userIdSet);
                count++;
                userIdSet = new LinkedHashSet<String>(); 
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
  }
}

If the purpose of saving the data in the LinkedHashSet is just for writing it again to another file then I have another solution.

EDIT to avoid OutOfMemory exception

public static void main(String args[]) {
File folder = new File("C:\\userids-20130501");

int fileNameCount = 1;
int contentCounter = 1;
File bigFile = new File("<path>" + fileNameCount + ".txt");
boolean isFileRequired = true;
for (File file : folder.listFiles()) {
    if (file.isFile() && file.getName().endsWith(".txt")) {
        try {
            List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
            contentCounter += content.size();
            if(contentCounter < 10Million) {
                FileUtils.writeLines(bigFile, content, true);
            } else {
                fileNameCount++;
                bigFile = new File("<path>" + fileNameCount + ".txt");
                FileUtils.writeLines(bigFile, content);
                contentCounter = 1;
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
  }
}

OTHER TIPS

You can avoid the use of the Set as intermediate storage if you write at the same time that you read from file. You could do something like this,

import java.io.BufferedReader;
import java.io.FileNotFoundException;
import java.io.FileReader;
import java.io.IOException;
import java.io.PrintWriter;


public class AppMain {
  private static final int NUMBER_REGISTERS = 10000000;

  private static String[] filePaths = {"filePath1", "filePaht2", "filePathN"}; 
  private static String mergedFile = "mergedFile";


  public static void main(String[] args) throws IOException {
    mergeFiles(filePaths, mergedFile);
  }

  private static void mergeFiles(String[] filePaths, String mergedFile) throws IOException{
    BufferedReader[] readerArray = createReaderArray(filePaths);
    boolean[] closedReaderFlag = new boolean[readerArray.length];

    PrintWriter writer = createWriter(mergedFile);

    int currentReaderIndex = 0;
    int numberLinesInMergedFile = 0;

    BufferedReader currentReader = null;
    String currentLine = null;
    while(numberLinesInMergedFile < NUMBER_REGISTERS && getNumberReaderClosed(closedReaderFlag) < readerArray.length){
      currentReaderIndex = (currentReaderIndex + 1) % readerArray.length; 

      if(closedReaderFlag[currentReaderIndex]){
       continue;
      }

      currentReader = readerArray[currentReaderIndex];

      currentLine = currentReader.readLine();
      if(currentLine == null){
       currentReader.close();
       closedReaderFlag[currentReaderIndex] = true;
       continue;
      }

      writer.println(currentLine);
      numberLinesInMergedFile++;
    }

    writer.close();
    for(int index = 0; index < readerArray.length; index++){
      if(!closedReaderFlag[index]){
       readerArray[index].close();
      }
    }

  }

  private static BufferedReader[] createReaderArray(String[] filePaths) throws FileNotFoundException{
    BufferedReader[] readerArray = new BufferedReader[filePaths.length];

    for (int index = 0; index < readerArray.length; index++) {
      readerArray[index] = createReader(filePaths[index]);
    }

    return readerArray;
  }

  private static BufferedReader createReader(String path) throws FileNotFoundException{
    BufferedReader reader = new BufferedReader(new FileReader(path));

    return reader;
  }

  private static PrintWriter createWriter(String path) throws FileNotFoundException{
    PrintWriter writer = new PrintWriter(path);

    return writer;
  }

  private static int getNumberReaderClosed(boolean[] closedReaderFlag){
    int count = 0;

    for (boolean currentFlag : closedReaderFlag) {
      if(currentFlag){
    count++;
      }
    }

    return count;
  }
}

The way you're going, you likely may run out of memory, your are keeping an unnecessary record in userIdSet.

A slight modification that can improve your code is as follows:

public static void main(String args[]) {

    File folder = new File("C:\\userids-20130501");
    File[] listOfFiles = folder.listFiles();

    // there's no need for the userIdSet!
    //Set<String> userIdSet = new LinkedHashSet<String>();

    // Instead I'd go for a counter ;)
    long userIdCount = 0;

    for (int i = 0; i < listOfFiles.length; i++) {
        File file = listOfFiles[i];
        if (file.isFile() && file.getName().endsWith(".txt")) {
            try {
                List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));
                // I just want to know how many lines there are...
                userIdCount += content.size();

                // my guess is you'd probably want to print what you've got
                // before a possible break?? - You know better!
                System.out.println(content);

                if(userIdCount >= 10Million) {
                    break;
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

Like I noted, just a slight modification. It was not my intention to run a very detailed analysis on your code. I just pointed out a glaring mis-design.

Finally, where you stated System.out.println(content);, you might consider writing to file at that point.

If you will write to file one line at a time, you try-catch block may look like this:

try {
    List<String> content = FileUtils.readLines(file, Charset.forName("UTF-8"));

    for(int lineNumber = 0; lineNumber < content.size(); lineNumber++){
        if(++userIdCount >= 10Million){
           break;
        }
        // here, write to file... But I will use simple System.out.print for example
        System.out.println(content.get(lineNumber));
    }
} catch (IOException e) {
    e.printStackTrace();
}

Your code can be improved in many ways, but I don't have time to do that. But I hope my suggestion can push you further to the front in the right track. Cheers!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top