I understand that ANN input must be normalized, standardized, etc.
Leaving the peculiarities and models of various ANN's aside, how can I preprocess UTF-8 encoded text within the range of {0,1} or alternatively between the range {-1,1} before it is given as input to neural networks?
I have been searching for this on google but can't find any information (I may be using the wrong term).
- Does that make sense?
- Isn't that how text is preprocessed for neural networks?
- Are there any alternatives?
Update on November 2013
I have long accepted as correct the answer of Pete.
However, I have serious doubts, mostly due to recent research I've been doing on Symbolic knowledge and ANN's.
Dario Floreano and Claudio Mattiussi in their book explain that such processing is indeed possible, by using distributed encoding.
Indeed if you try a google scholar search, there exists a plethora of neuroscience articles and papers on how distrubuted encoding is hypothesized to be used by brains in order to encode Symbolic Knowledge.
Teuvo Kohonen, in his paper "Self Organizing Maps" explains:
One might think that applying the neural adaptation laws to a
symbol set (regarded as a set of vectorial variables) might create a
topographic map that displays the "logical distances" between the
symbols. However, there occurs a problem which lies in the different
nature of symbols as compared with continuous data. For the latter,
similarity always shows up in a natural way, as the metric differences
between their continuous encodings. This is no longer true for
discrete, symbolic items, such as words, for which no metric has been
defined. It is in the very nature of a symbol that its meaning is
dissociated from its encoding.
However, Kohonen did manage to deal with Symbolic Information in SOMs!
Furthermore, Prof Dr Alfred Ultsch in his paper "The Integration of Neural Networks with
Symbolic Knowledge Processing" deals exactly with how to process Symbolic Knowledge (such as text) in ANN's. Ultsch offers the following methodologies for processing Symbolic Knowledge: Neural Approximative Reasoning, Neural Unification, Introspection and Integrated
Knowledge Acquisition. Albeit little information can be found on those in google scholar or anywhere else for that matter.
Pete in his answer is right about semantics.
Semantics in ANN's are usually disconnected. However, following reference, provides insight how researchers have used RBMs, trained to recognize similarity in semantics of different word inputs, thus it shouldn't be impossible to have semantics, but would require a layered approach, or a secondary ANN if semantics are required.
Natural Language Processing With Subsymbolic Neural Networks, Risto Miikkulainen, 1997
Training Restricted Boltzmann Machines on Word Observations, G.E.Dahl, Ryan.P.Adams, H.Rarochelle, 2012
Update on January 2021
The field of NLP and Deep Learning has seen a resurgence in research in the past few years and since I asked that Question. There are now Machine-learning models which address what I was trying to achieve in many different ways.
For anyone arriving to this question wondering on how to pre-process text in Deep Learning or Neural Networks, here's a few helpful topics, none of which are Academic, but simple to understand and which should get you started on solving similar tasks:
At the time I was asking that question, RNN, CNN and VSM were about to start being used, nowadays most Deep Learning frameworks support extensive NLP support. Hope the above helps.