Question

This script is compling without errors in play.golang.org: http://play.golang.org/p/Hlr-IAc_1f

But when I run in on my machine, much longer than I expect happens with nothing happening in the terminal.

What I am trying to build is a PartOfSpeech Tagger.

I think the longest part is loading lexicon.txt into a map and then comparing each word with every word there to see if it has already been tagged in the lexicon. The lexicon only contains verbs. But doesn't every word need to be checked to see if it is a verb.

The larger problem is that I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.

Was it helpful?

Solution 2

You've got a large array argument in this function:

func stringInArray(a string, list [214]string) bool{
    for _, b := range list{
        if b == a{
            return true;
        }
    }
    return false
}

The array of stopwords gets copied each time you call this function.

Mostly in Go, you should uses slices rather than arrays most of the time. Change the definition of this to be list []string and define stopWords as a slice rather than an array:

stopWords := []string{
    "and", "or", ...
}

Probably an even better approach would be to build a map of the stopWords:

isStopWord := map[string]bool{}
for _, sw := range stopWords {
    isStopWord[sw] = true
}

and then you can check if a word is a stopword quickly:

if isStopWord[word] { ... }

OTHER TIPS

(Quoting):

I don't know how to determine if a word is a verb with an easy heuristic like adverbs, adjectives, etc.

I can't speak to any issues in your Go implementation, but I'll address the larger problem of POS tagging in general. It sounds like you're attempting to build a rule-based unigram tagger. To elaborate a bit on those terms:

  • "unigram" means you're considering each word in the sentence separately. Note that a unigram tagger is inherently limited, in that it cannot disambiguate words which can take on multiple POS tags. E.g., should you tag 'fish' as a noun or a verb? Is 'last' a verb or an adverb?
  • "rule-based" means exactly what it sounds like: a set of rules to determine the tag for each word. Rule-based tagging is limited in a different way - it requires considerable development effort to assemble a ruleset that will handle a reasonable portion of the ambiguity in common language. This effort might be appropriate if you're working in a language for which we don't have good training resources, but in most common languages, we now have enough tagged text to train high-accuracy tagging models.

State-of-the-art for POS tagging is above 97% accuracy on well-formed newswire text (accuracy on less formal genres is naturally lower). A rule-based tagger will probably perform considerably worse (you'll have to determine the accuracy level needed to meet your requirements). If you want to continue down the rule-based path, I'd recommend reading this tutorial. The code is based on Haskell, but it will help you learn the concepts and issues in rule-based tagging.

That said, I'd strongly recommend you look at other tagging methods. I mentioned the weaknesses of unigram tagging. Related approaches would be 'bigram', meaning that we consider the previous word when tagging word n, 'trigram' (usually the previous 2 words, or the previous word, the current word, and the following word); more generally, 'n-gram' refers to considering a sequence of n words (often, a sliding window around the word we're currently tagging). That context can help us disambiguate 'fish', 'last', 'flies', etc.

E.g., in

We fish

we probably want to tag fish as a verb, whereas in

ate fish

it's certainly a noun.

The NLTK tutorial might be a good reference here. An solid n-gram tagger should get you above 90% accuracy; likely above 95% (again on newswire text).

More sophisticated methods (known as 'structured inference') consider the entire tag sequence as a whole. That is, instead of trying to predict the most probable tag for each word separately, they attempt to predict the most probable sequence of tags for the entire input sequence. Structured inference is of course more difficult to implement and train, but will usually improve accuracy vs. n-gram approaches. If you want to read up on this area, I suggest Sutton and McCallum's excellent introduction.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top