Question

I'm a little bit confused how to determine part-of-speech tagging in English. In this case, I assume that one word in English has one type, for example word "book" is recognized as NOUN, not as VERB. I want to recognize English sentences based on tenses. For example, "I sent the book" is recognized as past tense.

Description:

I have a number of database (*.txt) files: NounList.txt, verbList.txt, adjectiveList.txt, adverbList.txt, conjunctionList.txt, prepositionList.txt, articleList.txt. And if input words are available in the database, I assume that type of those words can be concluded. But, how to begin lookup in the databases? For example, "I sent the book": how to begin a search in the databases for every word, "I" as Noun, "sent" as verb, "the" as article, "book" as noun? Any better approach than searching every word in every database? I doubt that every databases has unique element.

I enclose my perspective here.

private List<string> ParseInput(String allInput)
{
    List<string> listSentence = new List<string>();

    char[] delimiter = ".?!;".ToCharArray();
    var sentences = allInput.Split(delimiter, StringSplitOptions.RemoveEmptyEntries).Select(s => s.Trim());

    foreach (var s in sentences)
        listSentence.Add(s);

        return listSentence;
}

private void tenseReviewMenu_Click(object sender, EventArgs e)
    {
        string allInput = rtbInput.Text;

        List<string> listWord = new List<string>();
        List<string> listSentence = new List<string>();

        HashSet<string> nounList = new HashSet<string>(getDBList("nounList.txt"));
        HashSet<string> verbList = new HashSet<string>(getDBList("verbList.txt"));
        HashSet<string> adjectiveList = new HashSet<string>(getDBList("adjectiveList.txt"));
        HashSet<string> adverbList = new HashSet<string>(getDBList("adverbList.txt"));

        char[] separator = new char[] { ' ', '\t', '\n', ',' etc... };         

        listSentence = ParseInput(allInput);

        foreach (string sentence in listSentence)
        {
            foreach (string word in sentence.Split(separator))
                if (word.Trim() != "")
                    listWord.Add(word);               
        }

        string testPOS = "";

        foreach (string word in listWord)
        {
            if (nounList.Contains(word.ToLowerInvariant()))
                testPOS += "noun ";
            else if (verbList.Contains(word.ToLowerInvariant()))
                testPOS += "verb ";
            else if (adjectiveList.Contains(word.ToLowerInvariant()))
                testPOS += "adj ";
            else if (adverbList.Contains(word.ToLowerInvariant()))
                testPOS += "adv ";

        }
        tbTest.Text = testPOS;
    }

POS tagging is my secondary explanation in my assignment. So I use a simple approach to determine POS tagging that is based on database. But, if there's a simpler approach: easy to use, easy to understand, easy to get pseudocode, easy to design... to determine POS tagging, please let me know.

Was it helpful?

Solution

I hope the pseudocode I present below proves helpful to you. If I find time, I'd also write some code for you.

This problem can be tackled by following the steps below:

  1. Create a dictionary of all the common sentence patterns in the English language. For example, Subject + Verb is an English pattern and all the sentences like I sleep, Dog barked and Ship will arrive match the S-V pattern. You can find a list of the most common english patterns here. Please note that for some time you may need to keep revising this dictionary to enhance the accuracy of your program.

  2. Try to fit the input sentence in one of the patterns in the dictionary you created above, for example, if the input sentence is Snakes, unlike elephants, are venomous., then your code must be able to find a match with the pattern: Subject, unlike AnotherSubject, Verb Object or S-,unlike-S`-, -V-O. To successfully perform this step, you may need to write code that's good at spotting Structure Markers like the word unlike, in this example sentence.

  3. When you have found a match for your input sentence in your pattern dictionary, you can easily assign a tag to each word in the sentence. For example, in our sentence, the word Snakes would be tagged as a subject, just like the word elephants, the word are would be tagged as a verb and finally the word venomous would be tagged as an object.

  4. Once you have assigned a unique tag to each of the words in your sentence, you can go lookup the word in the appropriate text files that you already have and determine whether or not your sentence is valid.

  5. If your sentence doesn't match any sentence pattern, then you have two options:

    a) Add the pattern of this unrecognized sentence in your pattern dictionary if it is a valid English sentence.

    b) Or, discard the input sentence as an invalid English sentence.

Things like what you're trying to achieve are best solved using machine learning techniques so that the system can learn any new patterns. So, you may want to include a trainer system that would add a new pattern to your pattern dictionary whenever it finds a valid English sentence not matching any of the existing patterns. I haven't thought much about how this can be done, but for now, you may manually revise your Sentence Pattern dictionary.

I'd be glad to hear your opinion about this pseudocode and would be available to brainstorm it further.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top