Pybrain Text Classification: data and input

https://stackoverflow.com/questions/18070368

23-06-2022
|

Question

I have 3 sets of sentences (varying in word counts), but I don't know how to extract features from the text such that the input dimension will remain the same.

For example, I've tried bag-of-words but, since the word-count variation causes input-dimension variation, I eventually get errors.

I would much appreciate it if you could show me an approach to preparing the string data for the neural network.

Thank you!

(Python 2.7 in Windows 7)

La solution

How to format the input

This is an extraction from wikipedia.org

Here are two simple text documents:

John likes to watch movies. Mary likes too.
John also likes to watch football games.

Based on these two text documents, a dictionary is constructed as:

{
    "John": 1,
    "likes": 2,
    "to": 3,
    "watch": 4,
    "movies": 5,
    "also": 6,
    "football": 7,
    "games": 8,
    "Mary": 9,
    "too": 10
}

which has 10 distinct words. And using the indexes of the dictionary, each document is represented by a 10-entry vector:

[1, 2, 1, 1, 1, 0, 0, 0, 1, 1]
[1, 1, 1, 1, 0, 1, 1, 1, 0, 0]

Your input will remain the same size, regardless of the length of your document. I hope this will help you.

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow