Creating a feature function for POS tagging

Question

Use a dictionary which maps words to numeric ids. If your vocabulary has 10,000 items in it, your dictionary maps each word to a number in the range 0-9999 and every word is represented as a binary vector of length 10,000 where only one element is active: that corresponding to the word's id in the dictionary.

If you want extra features besides word ids, you can just tack these on to the end of the feature vector: that is, you can make features 10,000+ be the capitalisation feature, the previous tag feature (will need binary coding as above) etc.

As a final point, POS tagging is an instance of a structured prediction problem, rather than a series of independent classifications. If this becomes more than an academic exercise, you'll want to move to the structured perceptron, or another structured learning method like a CRF or struct-SVM.

EDIT: a simple example

Imagine I have a five word vocabulary, {the,cat,sat,on,mat}, and a reduced tagset {DET,N,V,PREP}. My sentence is thus:

(The,DET) (cat,N) (sat,V) (on,PREP) (the,DET) (mat,N).

Now I want a feature vector for each word, from which I would like to be able to predict the tag. I am going to use features 0-4 as my word id indicator functions, so that feature 0 corresponds to 'the', feature 1 to 'cat' and so on. This gives me the following feature vectors (with the intended 'class' or tag assignment following the ->):

[1 0 0 0 0] -> DET
[0 1 0 0 0] -> N
[0 0 0 0 0] -> V
...

I could treat these as instances and apply my learning algorithm of choice to this task, however, word ID functions alone won't buy me much. I decide I want to incorporate some HMM-like intuition into my classifications, so I also add feature functions which indicate what the previous tag was. So I use features 5-8 as indicators for this, with 5 corresponding to DET, 6 to N, and so on. Now I have the following:

[1 0 0 0 0 0 0 0 0] -> DET (because this is the first word there's no previous tag)
[0 1 0 0 0 1 0 0 0] -> N
[0 0 0 0 0 0 1 0 0] -> V

Now I can keep adding features to my heart's content, using for example feature 9 to indicate whether the word is capitalised or not, feature 10 might be whether the word matches a list of known proper nouns, etc etc. If you read a little about structured prediction tasks and methods, you should see why using a model customised for this task (easiest is an HMM, but I'd want to progress to a CRF/Structured Perceptron/StructSVM for state of the art performance) is superior to treating this as a bunch of independent decisions.