Java BufferedWriter Creating Null Characters

Question 1

EDIT: Looking at the hexdump diff you gave, the only difference is that one has LF line endings (0A) and the other has CRLF line endings (0D 0A). All the other data in your diff is shifted ahead to accomodate the extra byte.

The CRLF is the default line ending on the OS you're using. If you want a specific line ending in your output, write the string "\n" or "\r\n".

Previously I noted that the Scanner doesn't specify a charset. It should specify the appropriate one that the input is known to be encoded in. However, this isn't the source of the unexpected output.

Question 2

Scanner.nextLine() is eating the existing line endings.
The javadoc for nextLine states:

This method returns the rest of the current line, excluding any line separator at the end.

The javadoc for BufferedWriter.newLine explains:

Writes a line separator. The line separator string is defined by the system property line.separator, and is not necessarily a single newline ('\n') character.

In your case your system's default newline seperator is "\n". The EDI file you are parsing uses "\r\n".

Using the system defined newLine seperator isn't the appropriate thing to do in this case. The newline separator to use is dictated by the file format and should be put in a format specific static constant somewhere.

Change "out.newLine();" to "out.write("\r\n");"

Question 3

I think what is going on is the following

All lines that contain ^L (ff) get modified to remove everything before the ^L but in addition you have the side effect in 1 that all \r (cr) also get removed. However, if cr appears before ^L nextLine() is treating that as a line too. Note how, in the output file below, the number of cr + nl is 6 in the input file and the number of cr + nl is also 6 but they're all nl, so the line with c gets preserved because it's being treated on a different line than ^L. Probably not what you want. See below.

Some observations

The source file is being generated on a system that uses \r\n to define a new line, and your program is being run on a system that does not. Because of this all occurrences of 0xd are going to be removed. This will make the two files different sizes even if there are no ^L.
But you probably overlooked #1 because vim will operate in DOS mode (recognize \r\n as a newline separator) or non-DOS mode (only \n) depending on what it reads when it opens the file and hides the fact from the user if it can. In fact to test I had to brute force in \r using ^v^m because I was editing on Linux using vim more here.
Your means to test is probably using od -x (for hex right)? But that outputs ints which is not what you want. Consider the following input file and output file. After your program runs. As viewed in vi

Input file

a
b^M
c^M^M ^L
d^L

Output file

a
b
c

Well maybe that's right, lets see what od has to say

od -x of input File

0a61    0d62    630a    0d0d    0c20    640a    0a0c

od -x of output File

0a61    0a62    0a63    0a0a    000a

Huh, what where did that null come from? But wait from the man page of od

-t type     Specify the output format.  type is a string containing one or more of the following kinds of type specifiers:

   q          a       Named characters (ASCII).  Control characters are displayed using the following names:
-h, -x      Output hexadecimal shorts.  Equivalent to -t x2.
-a          Output named characters.  Equivalent to -t a.

Oh, ok so instead use the -a option

od -a of input

a  nl   b  cr  nl   c  cr  cr  sp  ff  nl   d  ff  nl

od -a of output

a  nl   b  nl   c  nl  nl  nl  nl

Forcing java to ignore \r

And finally, all that being said, you really have to overcome the implicit understanding of java that \r delimits a line, even contrary to the documentation. Even when explicitly setting the scanner to use a \r ignoring pattern, it still operates contrary to the documentation and you must override that again by setting the delimiter (see below). I've found the following will probably do what you want by insisting on Unix line semantics. I also added in some logic to not output a blank line.

public static void repl(File original,File file) throws IOException
{
   Scanner fileScanner = new Scanner(original);
   Pattern pattern1 = Pattern.compile("(?d).*");

   fileScanner.useDelimiter("(?d)\\n");

   BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF8"));

   while(fileScanner.hasNext(pattern1))
   {
      String next = fileScanner.next(pattern1);

      next = next.replaceAll("(?d)(.*\\x0C)|(\\x0D)","");
      if(next.length() != 0)
      {
         out.write(next);
         out.newLine();
      }
   }
   out.flush();
   out.close();
}

With this change, the output above changes to.

od -a of input

a  nl   b  cr  nl   c  cr  cr  sp  ff  nl   d  ff  nl

od -a of output

a  nl   b  nl

Question 4

Stuart Caie provided the answer. if you are looking for an code to avoid these characters.

Basic issue is , Org file using different line separator and the new file using different line separator character.

One easy way, find the Org file Separator character and use the same in new file.

    try(BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)));
            Scanner fileScanner = new Scanner(original);) {
        String lineSep = null;
        boolean lineSepFound = false;
        while(fileScanner.hasNextLine())
        {

            if (!lineSepFound){
                MatchResult matchResult = fileScanner.match();
                if (matchResult != null){
                    lineSep = matchResult.group(1);
                    if (lineSep != null){
                        lineSepFound = true;
                    }
                }
            }else{
                out.write(lineSep);
            }
            String next = fileScanner.nextLine();
            next = next.replaceAll(".*\\x0C", ""); //remove up to ^L
            out.write(next);

        }
    } catch ( IOException e) {
        e.printStackTrace();
    }

Note ** MatchResult matchResult = fileScanner.match(); would provide the matchResult for the last Match performed. And in our case we have used hasNextLine() - Scanner used linePattern to find the next line .. Scanner.hasNextLine Source code finding the line Separator ,

but unfortunately no way to get the line separator back. So i have used thier code to get the lineSep only once. and used that lineSep for creating new file.

Also per your code , you would be having extra line separator at the end of file. Corrected here.

Let me know if that works.