Question

I've been using Java's BufferedWriter to write to a file to parse out some input. When I open the file after, however, there seems to be added null characters. I tried specifying the encoding as "US-ASCII" and "UTF8" but I get the same result. Here's my code snippet:

Scanner fileScanner = new Scanner(original);
BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "US-ASCII"));
while(fileScanner.hasNextLine())
  {
     String next = fileScanner.nextLine();
     next = next.replaceAll(".*\\x0C", ""); //remove up to ^L
     out.write(next);
     out.newLine();
  }
 out.flush();
 out.close();

Maybe the issue isn't even with the BufferedWriter?

I've narrowed it down to this code block because if I comment it out, there are no null-characters in the output file. If I do a regex replace in VIM the file is null-character free (:%s/.*^L//g).

Let me know if you need more information.

Thanks!

EDIT: hexdump of a normal line looks like: 0000000 5349 2a41 3030 202a

But when this code is run the hexdump looks like: 0000000 5330 2a49 4130 202a

I'm not sure why things are getting mixed up.

EDIT: Also, even if the file doesn't match the regex and runs through that block of code, it comes out with null characters.

EDIT: Here's a hexdump of the first few lines of a diff: http://pastie.org/pastes/8964701/text

command was: diff -y testfile.hexdump expectedoutput.hexdump

The rest of the lines are different like the last two.

Was it helpful?

Solution

EDIT: Looking at the hexdump diff you gave, the only difference is that one has LF line endings (0A) and the other has CRLF line endings (0D 0A). All the other data in your diff is shifted ahead to accomodate the extra byte.

The CRLF is the default line ending on the OS you're using. If you want a specific line ending in your output, write the string "\n" or "\r\n".

Previously I noted that the Scanner doesn't specify a charset. It should specify the appropriate one that the input is known to be encoded in. However, this isn't the source of the unexpected output.

OTHER TIPS

Scanner.nextLine() is eating the existing line endings.
The javadoc for nextLine states:

This method returns the rest of the current line, excluding any line separator at the end.

The javadoc for BufferedWriter.newLine explains:

Writes a line separator. The line separator string is defined by the system property line.separator, and is not necessarily a single newline ('\n') character.

In your case your system's default newline seperator is "\n". The EDI file you are parsing uses "\r\n".

Using the system defined newLine seperator isn't the appropriate thing to do in this case. The newline separator to use is dictated by the file format and should be put in a format specific static constant somewhere.

Change "out.newLine();" to "out.write("\r\n");"

I think what is going on is the following

All lines that contain ^L (ff) get modified to remove everything before the ^L but in addition you have the side effect in 1 that all \r (cr) also get removed. However, if cr appears before ^L nextLine() is treating that as a line too. Note how, in the output file below, the number of cr + nl is 6 in the input file and the number of cr + nl is also 6 but they're all nl, so the line with c gets preserved because it's being treated on a different line than ^L. Probably not what you want. See below.

Some observations

  1. The source file is being generated on a system that uses \r\n to define a new line, and your program is being run on a system that does not. Because of this all occurrences of 0xd are going to be removed. This will make the two files different sizes even if there are no ^L.

  2. But you probably overlooked #1 because vim will operate in DOS mode (recognize \r\n as a newline separator) or non-DOS mode (only \n) depending on what it reads when it opens the file and hides the fact from the user if it can. In fact to test I had to brute force in \r using ^v^m because I was editing on Linux using vim more here.

  3. Your means to test is probably using od -x (for hex right)? But that outputs ints which is not what you want. Consider the following input file and output file. After your program runs. As viewed in vi

Input file

a
b^M
c^M^M ^L
d^L

Output file

a
b
c

Well maybe that's right, lets see what od has to say

od -x of input File

0a61    0d62    630a    0d0d    0c20    640a    0a0c 

od -x of output File

0a61    0a62    0a63    0a0a    000a

Huh, what where did that null come from? But wait from the man page of od

-t type     Specify the output format.  type is a string containing one or more of the following kinds of type specifiers:

   q          a       Named characters (ASCII).  Control characters are displayed using the following names:
-h, -x      Output hexadecimal shorts.  Equivalent to -t x2.
-a          Output named characters.  Equivalent to -t a.

Oh, ok so instead use the -a option

od -a of input

a  nl   b  cr  nl   c  cr  cr  sp  ff  nl   d  ff  nl

od -a of output

a  nl   b  nl   c  nl  nl  nl  nl 

Forcing java to ignore \r

And finally, all that being said, you really have to overcome the implicit understanding of java that \r delimits a line, even contrary to the documentation. Even when explicitly setting the scanner to use a \r ignoring pattern, it still operates contrary to the documentation and you must override that again by setting the delimiter (see below). I've found the following will probably do what you want by insisting on Unix line semantics. I also added in some logic to not output a blank line.

public static void repl(File original,File file) throws IOException
{
   Scanner fileScanner = new Scanner(original);
   Pattern pattern1 = Pattern.compile("(?d).*");

   fileScanner.useDelimiter("(?d)\\n");

   BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file), "UTF8"));

   while(fileScanner.hasNext(pattern1))
   {
      String next = fileScanner.next(pattern1);

      next = next.replaceAll("(?d)(.*\\x0C)|(\\x0D)","");
      if(next.length() != 0)
      {
         out.write(next);
         out.newLine();
      }
   }
   out.flush();
   out.close();
}

With this change, the output above changes to.

od -a of input

a  nl   b  cr  nl   c  cr  cr  sp  ff  nl   d  ff  nl

od -a of output

a  nl   b  nl

Stuart Caie provided the answer. if you are looking for an code to avoid these characters.

Basic issue is , Org file using different line separator and the new file using different line separator character.

One easy way, find the Org file Separator character and use the same in new file.

    try(BufferedWriter out = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(file)));
            Scanner fileScanner = new Scanner(original);) {
        String lineSep = null;
        boolean lineSepFound = false;
        while(fileScanner.hasNextLine())
        {

            if (!lineSepFound){
                MatchResult matchResult = fileScanner.match();
                if (matchResult != null){
                    lineSep = matchResult.group(1);
                    if (lineSep != null){
                        lineSepFound = true;
                    }
                }
            }else{
                out.write(lineSep);
            }
            String next = fileScanner.nextLine();
            next = next.replaceAll(".*\\x0C", ""); //remove up to ^L
            out.write(next);

        }
    } catch ( IOException e) {
        e.printStackTrace();
    }

Note ** MatchResult matchResult = fileScanner.match(); would provide the matchResult for the last Match performed. And in our case we have used hasNextLine() - Scanner used linePattern to find the next line .. Scanner.hasNextLine Source code finding the line Separator ,

but unfortunately no way to get the line separator back. So i have used thier code to get the lineSep only once. and used that lineSep for creating new file.

Also per your code , you would be having extra line separator at the end of file. Corrected here.

Let me know if that works.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top