processing strings of text for neural network input

Question 1

I'll go ahead and summarize our discussion as the answer here.

Your goal is to be able to incorporate text into your neural network. We have established that traditional ANNs are not really suitable for analyzing text. The underlying explanation for why this is so is based around the idea that ANNs operate on inputs that are generally a continuous range of values and the nearness of two values for an input means some sort of nearness in their meaning. Words do not have this idea of nearness and so, there's no real numerical encoding for words that can make sense as input to an ANN.

On the other hand, a solution that might work is to use a more traditional semantic analysis which could, perhaps produce sentiment ranges for a list of topics and then those topics and their sentiment values could possibly be used as input for an ANN.

Question 2

In response to your comments, no, your proposed scheme doesn't quite make sense. An artificial neuron output by its nature represents a continuous or at least a binary value. It does not makes sense to map between a huge discrete enumeration (like UTF-8 characters) and the continuous range represented by a floating point value. The ANN will necessarily act like 0.1243573 is an extremely good approximation to 0.1243577 when those numbers could easily be mapped to the newline character and the character "a", for example, which would not be good approximations for each other at all.

Quite frankly, there is no reasonable representation for "general unicode string" as inputs to an ANN. A reasonable representation depends on the specifics of what you're doing. It depends on your answers to the following questions:

Are you expecting words to show up in the input strings as opposed to blocks of characters? What words are you expecting to show up in the strings?
What is the length distribution of the input strings?
What is the expected entropy of the input strings?
Is there any domain specific knowledge you have about what you expect the strings to look like?

and most importantly

What are you trying to do with the ANN. This is not something you can ignore.

Its possible you might have a setup for which there is no translation that will actually allow you to do what you want with the neural network. Until you answer those questions (you skirt around them in your comments above), it's impossible to give a good answer.

I can give an example answer, that would work if you happened to give certain answers to the above questions. For example, if you are reading in strings with arbitrary length but composed of a small vocabulary of words separated by spaces, then I would suggest a translation scheme where you make N inputs, one for each word in the vocabulary, and use a recurrent neural network to feed in the words one at a time by setting the corresponding input to 1 and all the others to 0.

Question 3

I think it would be fascinating to feed in text (encoded at the character level) to a deep belief network, to see what properties of the language it can discover.

There has been a lot of work done recently on Neural Network Language modeling (mainly at the word level, but also at the character level)

See these links for more info

http://www.stanford.edu/group/pdplab/pdphandbook/handbookch8.html http://code.google.com/p/word2vec/

The word vectors are encoded by training on a large corpus of wikipedia articles etc.. and have been able to acquire semantic and syntactic features, which allows a "distance" to be defined between them"

"It was recently shown that the word vectors capture many linguistic regularities, for example vector operations vector('king') - vector('man') + vector('woman') is close to vector('queen')"

Also see this great research paper by Ilya Sutskever on generating random characters, which exhibit the features of the english language after being trained on wikipedia. Amazing stuff!

http://www.cs.toronto.edu/~ilya/pubs/2011/LANG-RNN.pdf http://www.cs.toronto.edu/~ilya/rnn.html (Online text generation text demo - very cool!)

Question 4

It is not exactly clear what you are trying to do, but I guess that it seems to be in some sense related to what people call "Natural Language". There are lots of references about this... I am not an expert, but I know for example that there are some interesting references by O'Reilly.

From the NN perspective there are lots of different NN models. I think you are referring to the most popular one known as Multilayer perceptron with a kind of backpropagation algorithm, but there are lots of models of associative memory that may be more suitable for your case. A very good reference about this is the Simon Haykin book.

However, if I tried to do something like this, I would start trying to understand how the frequency of letters, syllables and words arise together in English language (?).

I hope that I helped. As I told before, I am not an expert in the field.

Question 5

Solution 1: A = Alt(65) = 65 Bin = 01000001

Solution 2: Add words to dictionary database and add id(int) field. Convert id field to binary.

In NN use letters or Words id (in binary)

processing strings of text for neural network input

Update on November 2013

Update on January 2021