Natural Language Processing - Converting Text Features Into Feature Vectors

Question

I'm not sure what values your attributes can take on, but perhaps this example will help you:

Suppose we are conducting a supervised learning experiment to try to determine if a period marks the end of a sentence or not, EOS and NEOS respectively. The training data came from normal sentences in a paragraph style format, but were transformed to the following vector model:

Column 1: Class: End-of-Sentence or Not-End-of-Sentence
Columns 2-8: The +/- 3 words surrounding the period in question
Columns 9,10: The number of words to the left/right, respectively, of the period before the next reliable sentence delimiter (e.g. ?, ! or a paragraph marker).
Column 11: The number of spaces following the period.

Of course, this is not a very complicated problem to solve, but it's a nice little introduction to Weka. We can't just use the words as features (really high dimensional space), but we can take their POS (part of speech) tags. We can also extract the length of words, whether or not the word was capitalized, etc.

So, you could feed anything as testing data, so long as you're able to transform it into the vector model above and extract the features used in the .arff.

The following (very small portion of) .arff file was used for determining whether a period in a sentence marked the end of or not:

@relation period

@attribute minus_three {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute minus_three_length real
@attribute minus_three_case {'UC','LC','NA'}
@attribute minus_two {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute minus_two_length real
@attribute minus_two_case {'UC','LC','NA'}
@attribute minus_one {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute minus_one_length real
@attribute minus_one_case {'UC','LC','NA'}
@attribute plus_one {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute plus_one_length real
@attribute plus_one_case {'UC','LC','NA'}
@attribute plus_two {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute plus_two_length real
@attribute plus_two_case {'UC','LC','NA'}
@attribute plus_three {'CC', 'CD', 'DT', 'FW', 'IN', 'JJ', 'JJR', 'JJS', 'LS', 'MD', 'NN', 'NNPS', 'NNS', 'NP', 'PDT', 'PRP', 'PRP$', 'RB', 'RBR', 'RBS', 'RP', 'UH', 'VB', 'VBD', 'VBG', 'VBN', 'VBP', 'VBZ', 'WDT', 'WP','WRB', 'NUM', 'PUNC', 'NEND', 'RAND'}
@attribute plus_three_length real
@attribute plus_three_case {'UC','LC','NA'}
@attribute left_before_reliable real
@attribute right_before_reliable real
@attribute spaces_follow_period real
@attribute class  {'EOS','NEOS'}

@data

VBP, 2, LC,NP, 4, UC,NN, 1, UC,NP, 6, UC,NEND, 1, NA,NN, 7, LC,31,47,1,NEOS
NNS, 10, LC,RBR, 4, LC,VBN, 5, LC,?, 3, NA,NP, 6, UC,NP, 6, UC,93,0,0,EOS
VBD, 4, LC,RB, 2, LC,RP, 4, LC,CC, 3, UC,UH, 5, LC,VBP, 2, LC,19,17,2,EOS

As you can see, each attribute can take on whatever you want it to:

real denotes a real number
I made up LC and UC to denote upper case and lower case, respectively
Most of the other values are POS tags

You need to figure out exactly what your features are, and what values you'll use to represent/classify them. Then, you need to transform your data into the format defined by your .arff.

To touch on your punctuation question, let's suppose that we have sentences that all end in . or ?. You can have an attribute called punc, which takes two values:

@attribute punc {'p','q'}

I didn't use ? because that is what is (conventionally) assigned when a data point is missing. Our you could have boolean attributes that indicates whether a character or what have you was present (with 0, 1 or false, true). Another example, but for quality:

@attribute quality {'great','good', 'poor'}

How you determine said classification is up to you, but the above should get you started. Good luck.