Question

I have been instructed with what I would call creating an Index.

Basically, the user is expected to be able to right some text in an empty text box. At the click of a button, the output is expected to show an alphabetically sorted list of the words entered, as well as a line number at which they appear.

so for example:

One fish
two fish
red fish
blue fish.

Black fish
blue fish
old fish
new fish.

This one has
a little star.

This one has a little car.
Say! What a lot
of fish there are.

A 12, 14, 15
ARE 16
BLACK 6
BLUE 4, 7
CAR 14
FISH 1, 2, 3, 4,
HAS 11, 14
LITTLE 12, 14
LOT 15
NEW 9
OF 16
OLD 8
ONE 1, 11, 14
RED 3
SAY 15
STAR 12
THERE 16
THIS 11, 14
TWO 2
WHAT 15

This text was used with reference from a Java document for creating an Index and I followed it through and the expectations are the same as mine, just in another language.

I'm working on paper at the minute to work out an algorithm but I'm getting a little frustrated with my efforts!

A couple of more requirements:

The maximum amount of line number occurrences is 4 so even if a word occurs on 10 different lines, it should only be referenced 4 times

Grammar must be ignored so words which contain !.,? must be removed Words which are spelt :HeLlO must be spelt: hello

Thank you in advanced for the help

Was it helpful?

Solution

If you need to display the wordsin the order of which they appear in the text file, then change from a HashTable to a SortedList.

    Dim hshResults As New Hashtable()

    Dim lstLinesOfText As List(Of String) = IO.File.ReadAllLines("C:\YourFile.txt").ToList()

    Dim intLineCursor As Integer = 0

    For Each strLine As String In lstLinesOfText

        Dim lstWords As List(Of String) = strLine.Split(" ").ToList()

        For Each strWord As String In lstWords

            ProcessWord(strWord, hshResults, intLineCursor)

        Next

        intLineCursor += 1

    Next

    Dim strOutput As String = String.Empty

    For Each o As DictionaryEntry In hshResults

        strOutput += CStr(o.Key) & " "

        Dim lstLinesWhereWordIsFount As List(Of Integer) = CType(o.Value, List(Of Integer))

        For Each i As Integer In lstLinesWhereWordIsFount

            strOutput += CStr(i) & " "

        Next

        'Some cleanup of extra spaces.
        strOutput = Trim(strOutput) & ControlChars.NewLine

    Next

Private Sub ProcessWord(ByVal strWord As String, ByRef hshResults As Hashtable, ByVal intLineIndex As Integer)

    Dim lstLinesWhereWordIsFound As List(Of Integer) = (From o As DictionaryEntry In hshResults _
                                                        Where CStr(o.Key) = strWord _
                                                        Select CType(o.Value, List(Of Integer))).FirstOrDefault()

    If lstLinesWhereWordIsFound Is Nothing Then

        'Add this word.
        Dim lstNewHashTableValue As New List(Of Integer)
        lstNewHashTableValue.Add(intLineIndex + 1) 'Indexes in the programming world start at 0.

        hshResults.Add(CObj(strWord), CObj(lstNewHashTableValue))

    Else

        'Add the line number for this word.
        If lstLinesWhereWordIsFound.Count < 5 Then

            'Make sure we're not duplicating a line number for this word.
            If (From i As Integer In lstLinesWhereWordIsFound _
                Where i = intLineIndex).Count = 0 Then

                lstLinesWhereWordIsFound.Add(intLineIndex + 1)

                hshResults(strWord) = CObj(lstLinesWhereWordIsFound)

            End If

        End If

    End If

End Sub

Edit: Explaination of Code

First, I instantiate a HashTable to store the words and the rows of which they are found. Then I get each line of the text file into a List(of String) object. Iterating through the lines of the text file, I get each word of the line into another List(of String) variable, using the Split method. I send each word of the line through a method (ProcessWord) that will update the HashTable appropriately. Finally, I iterate through all of the key/value pairs in the HashTable to generate the desired output. The ProcessWord method's logic is to first determine whether or not the word already exists in the HashTable. If it does not, add the word and the line number. If it does, then make sure the line count isn't above the frequency of 4 (as requested in your question), make sure that it doesn't put the same line number twice (in case a word is in the same line multiple times), and if all of those conditions are met, add the line number and then update the HashTable.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top