Question

I am trying to work with a file, and convert it into some kind of data structure (Text is an "array" of paragraphs, paragraph is an "array" of sentences and sentence is an "array" of words, which are char*).

To make everything easy for myself I am using data streams (ifstream, to be exact), but one of the problems I met was defining where paragraphs end (2 '\n' is considered as an end of a paragraph). The simple way is to go char by char on with the text and check each one of them if it is a space or '\n', but that's long and kind of painful.

The code looks something like this:

    std::ifstream fd(filename);
    char buffer[128];

    while(fd >> buffer)
    {
        /* Some code in here that does things with buffer */
    }

And - well, it works, but ignores all paragraphs completely. fd.get(buffer, 128, '\n') doesn't work as needed either - it cuts off everything after reading 1 time.

So - is there a way to do this easier than reading char by char? Can't use getline() since the task forbids us to use vectors or strings.

UPDATE

So it seems that std::istream::getline may do the trick for me, but it still is not quite what I expected. It reads, well, first line, and after that something weird happens.

The code looks like that:

std::ifstream fd(fl);
char buffer[128];
fd.getline(buffer, 128);
std::cout << "555 - [" << buffer << "]" << std::endl;
std::cout << fd.gcount() << std::endl;
fd.getline(buffer, 128);
std::cout << "777 - [" << buffer << "]" << std::endl;
std::cout << fd.gcount() << std::endl;

And the output looks like that

]55 - [text from file
23
]77 - [
2

And - yeah, I don't think I understand what's going on.

Was it helpful?

Solution

From what I understood, you may not use any of the std containers.

So what I think is possible:

  1. Read the entire file to a buffer
  2. Tokenize the buffer for paragraphs
  3. Tokenize each paragraph for sentences
  4. Tokenize each sentence for words

For the first part, you may use:

//! Reads a file to a buffer, that must be deleted afterwards
char* readFile(const char *filename) {
  std::ifstream ifs(filename, std::ifstream::binary);

  if (!filename.good())
    return NULL;

  ifs.seekg(0, ifs.end);
  size_t len = ifs.tellg();
  ifs.seekg(0, ifs.beg);

  char* buffer = new char[len];
  if (!buffer) { // Check for failed alocation
    ifs.close();
    return NULL;
  }

  if (ifs.read(buffer, len) != len) { // Check if the entire file was read
    delete[] buffer;
    buffer = NULL;
  }
  ifs.close();
  return buffer;
}

With that function ready, all we need now is to use it and tokenize the string. For that, we must define our types (basing on linked lists, using C coding format)

struct Word {
  char *contents;
  Word *next;
};

struct Sentence {
  Word *first;
  Sentence *next;
};

struct Paragraph {
  Sentence *first;
  Paragraph *next;
};

struct Text {
  Paragraph *first;
};

With the types defined, we can now start reading our text:

//! Splits a sentence in as many Word elements as possible
void readSentence(char *buffer, size_t len, Word **target) {
    if (!buffer || *buffer == '\0' || len == 0) return;

    *target = new Word;
    (*target)->next = NULL;

    char *end = strpbrk(buffer, " \t\r\n");

    if (end != NULL) {
        (*target)->contents = new char[end - buffer + 1];
        strncpy((*target)->contents, buffer, end - buffer);
        (*target)->contents[end - buffer] = '\0';
        readSentence(end + 1, strlen(end + 1), &(*target)->next);
    }
    else {
        (*target)->contents = _strdup(buffer);
    }
}

//! Splits a paragraph from a text buffer in as many Sentence as possible
void readParagraph(char *buffer, size_t len, Sentence **target) {
    if (!buffer || *buffer == '\0' || len == 0) return;

    *target = new Sentence;
    (*target)->next = NULL;

    char *end = strpbrk(buffer, ".;:?!");

    if (end != NULL) {
        char *t = new char[end - buffer + 2];
        strncpy(t, buffer, end - buffer + 1);
        t[end - buffer + 1] = '\0';
        readSentence(t, (size_t)(end - buffer + 1), &(*target)->first);
        delete[] t;

        readParagraph(end + 1, len - (end - buffer + 1), &(*target)->next);
    }
    else {
        readSentence(buffer, len, &(*target)->first);
    }
}

//! Splits as many Paragraph as possible from a text buffer
void readText(char *buffer, Paragraph **target) {
    if (!buffer || *buffer == '\0') return;

    *target = new Paragraph;
    (*target)->next = NULL;

    char *end = strstr(buffer, "\n\n"); // With this, we have a pointer to the end of a paragraph. Pass to our sentence parser.
    if (end != NULL) {
        char *t = new char[end - buffer + 1];
        strncpy(t, buffer, end - buffer);
        t[end - buffer] = '\0';
        readParagraph(t, (size_t)(end - buffer), &(*target)->first);
        delete[] t;

        readText(end + 2, &(*target)->next);
    }
    else
        readParagraph(buffer, strlen(buffer), &(*target)->first);
}

Text* createText(char *contents) {
    Text *text = new Text;
    readText(contents, &text->first);
    return text;
}

As an example, you may use it like this:

int main(int argc, char **argv) {
    char *buffer = readFile("mytext.txt");
    Text *text = createText(buffer);
    delete[] buffer;

    for (Paragraph* p = text->first; p != NULL; p = p->next) {
        for (Sentence* s = p->first; s != NULL; s = s->next) {
            for (Word* w = s->first; w != NULL; w = w->next) {
                std::cout << w->contents << " ";
            }
        }
        std::cout << std::endl << std::endl;
    }

    return 0;
}

Please keep in mind that this code might or might not work, since I did not test this.

Sources:

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top