Question

I am opening a file and placing it's contents into a string buffer to do some lexical analysis on a per-character basis. Doing it this way enables parsing to finish faster than using a subsequent number of fread() calls, and since the source file will always be no larger than a couple MBs, I can rest assured that the entire contents of the file will always be read.

However, there seems to be some trouble in detecting when there is no more data to be parsed, because ftell() often gives me an integer value higher than the actual number of characters within the file. This wouldn't be a problem with the use of the EOF (-1) macro, if the trailing characters were always -1... But this is not always the case...


Here's how I am opening the file, and reading it into the string buffer:

FILE *fp = NULL;
errno_t err = _wfopen_s(&fp, m_sourceFile, L"rb, ccs=UNICODE");
if(fp == NULL || err != 0) return FALSE;
if(fseek(fp, 0, SEEK_END) != 0) {
    fclose(fp);
    fp = NULL;
    return FALSE;
}

LONG fileSize = ftell(fp);
if(fileSize == -1L) {
    fclose(fp);
    fp = NULL;
    return FALSE;
}
rewind(fp);

LPSTR s = new char[fileSize];
RtlZeroMemory(s, sizeof(char) * fileSize);
DWORD dwBytesRead = 0;
if(fread(s, sizeof(char), fileSize, fp) != fileSize) {
    fclose(fp);
    fp = NULL;
    return FALSE;
}

This always appears to work perfectly fine. Following this is a simple loop, which checks the contents of the string buffer one character at a time, like so:

char c = 0;
LONG nPos = 0;
while(c != EOF && nPos <= fileSize)
{
    c = s[nPos];
    // do something with 'c' here...
    nPos++;
}

The trailing bytes of the file are usually a series of ý (-3) and « (-85) characters, and therefore EOF is never detected. Instead, the loop simply continues onward until nPos ends up being of higher value than fileSize -- Which is not desirable for proper lexical analysis, because you often end up skipping the final token in a stream which omits a newline character at the end.


In a Basic Latin character set, would it be safe to assume that an EOF char is any character with a negative value? Or perhaps there is just a better way to go about this?


#EDIT: I have just tried to implement the feof() function into my loop, and all the same, it doesn't seem to detect EOF either.

Was it helpful?

Solution

Assembling comments into an answer...

  • You leak memory (potentially a lot of memory) when you fail to read.

  • You haven't allowed for a null terminator at the end of the string read.

  • There's no point in zeroing the memory when it is all about to be overwritten by the data from the file.

  • Your test loop is accessing memory out of bounds; nPos == fileSize is one beyond the end of the memory you allocated.

    char c = 0;
    LONG nPos = 0;
    while(c != EOF && nPos <= fileSize)
    {
        c = s[nPos];
        // do something with 'c' here...
        nPos++;
    }
    
  • There are other problems, not previously mentioned, with this. You did ask if it is 'safe to assume that an EOF char is any character with a negative value', to which I responded No. There are several issues here, that affect both C and C++ code. The first is that plain char may be a signed type or an unsigned type. If the type is unsigned, then you can never store a negative value in it (or, more accurately, if you attempt to store a negative integer into an unsigned char, it will be truncated to the least significant 8* bits and will be treated as positive.

  • In the loop above, one of two problems can occur. If char is a signed type, then there is a character (ÿ, y-umlaut, U+00FF, LATIN SMALL LETTER Y WITH DIAERESIS, 0xFF in the Latin-1 code set) that has the same value as EOF (which is always negative and usually -1). Thus, you might detect EOF prematurely. If char is an unsigned type, then there will never be any character equal to EOF. But the test for EOF on a character string is fundamentally flawed; EOF is a status indicator from I/O operations and not a character.

  • During I/O operations, you will only detect EOF when you've attempted to read data that isn't there. The fread() won't report EOF; you asked to read what was in the file. If you tried getc(fp) after the fread(), you'd get EOF unless the file had grown since you measured how long it is. Since _wfopen_s() is a non-standard function, it might be affecting how ftell() behaves and the value it reports. (But you later established that wasn't the case.)

  • Note that functions such as fgetc() or getchar() are defined to return characters as positive integers and EOF as a distinct negative value.

    If the end-of-file indicator for the input stream pointed to by stream is not set and a next character is present, the fgetc function obtains that character as an unsigned char converted to an int.

    If the end-of-file indicator for the stream is set, or if the stream is at end-of-file, the end-of- file indicator for the stream is set and the fgetc function returns EOF. Otherwise, the fgetc function returns the next character from the input stream pointed to by stream. If a read error occurs, the error indicator for the stream is set and the fgetc function returns EOF.289)

    289) An end-of-file and a read error can be distinguished by use of the feof and ferror functions.

    This indicates how EOF is separate from any valid character in the context of I/O operations.

You comment:

As for any potential memory leakage... At this stage in my project, memory leaks are one of many problems with my code which, as of yet, are of no concern to me. Even if it didn't leak memory, it doesn't even work to begin with, so what's the point? Functionality comes first.

It is easier to head off memory leaks in error paths at the initial coding stage than to go back later and fix them — because you may not spot them because you may not trigger the error condition. However, the extent to which that matters depends on the intended audience for the program. If it is a one-off for a coding course, you may be fine. If you're the only person who'll use it, you may be fine. But if it will be installed by millions, you'll have problems retrofitting the checks everywhere.

I have swapped _wfopen_s() with fopen() and the result from ftell() is the same. However, after changing the corresponding lines to LPSTR s = new char[fileSize + 1], RtlZeroMemory(s, sizeof(char) * fileSize + 1); (which should also null-terminate it, btw), and adding if(nPos == fileSize) to the top of the loop, it now comes out cleanly.

OK. You could use just s[fileSize] = '\0'; to null terminate the data too, but using RtlZeroMemory() achieves the same effect (but would be slower if the file is many megabytes in size). But I'm glad the various comments and suggestions helped get you back on track.


* In theory, CHAR_BITS might be larger than 8; in practice it is almost always 8 and for simplicity, I'm assuming it is 8 bits here. The discussion has to be more nuanced if CHAR_BITS is 9 or more, but the net effect is much the same.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top