Domanda

I am creating an inverted index to get a dictionary of words with an associated list of the line numbers that the word appears on (starting the line numbers and a list of words that appear in a given cell within that line).

I have managed to get some code working for this, but I found dealing with adding to the arrays (the values in the dictionary) to be a little cumbersome and I wonder is there is a more efficient or more elegant way to handle this.

I am open to using arrays, collections or any other data type that can be easily searched to store the list of line numbers in the values of the dictionary. I have pasted a cut down version of my code to demonstrate the core problem below, the question is really just about the BuildInvertedIndex procedure, but the rest is included to try to make it easier to recreate the scenario:

Sub Test()
' minimum included here to demonstrate use of buildInvertedIndex procedure

    Dim vRange As Range
    Dim vDict As Dictionary

    Set vRange = ActiveSheet.Range("F2:F20585")
    Set vDict = New Dictionary

    BuildInvertedIndex vDict, vRange

    ' test values returned in dictionary (word: [line 1, ..., line n])
    Dim k As Variant, vCounter As Long
    vCounter = 0
    For Each k In vDict.Keys
        Debug.Print k & ": " & ArrayToString(vDict.Item(k))
        vCounter = vCounter + 1
        If vCounter >= 10 Then
            Exit For
        End If
    Next


End Sub


Sub BuildInvertedIndex(pDict As Dictionary, pRange As Range)

    Dim cell As Range
    Dim words As Variant, word As Variant, val As Variant
    Dim tmpArr() As Long
    Dim newLen As Long, i As Long

    ' loop through cells (one col wide so same as looping through lines)
    For Each cell In pRange.Cells

        ' loop through words in line
        words = Split(cell.Value)
        For Each word In words

            If Not pDict.exists(word) Then
                ' start line array with first row number
                pDict.Add word, Array(cell.Row())
            Else
                i = 0
                If Not InArray(cell.Row(), pDict.Item(word)) Then
                    newLen = UBound(pDict.Item(word)) + 1
                    ReDim tmpArr(newLen)
                    For Each val In tmpArr
                        If i < newLen Then
                            tmpArr(i) = pDict.Item(word)(i)
                        Else
                            tmpArr(i) = cell.Row()
                        End If
                        i = i + 1
                    Next val
                    pDict.Item(word) = tmpArr
                End If
            End If
        Next word
    Next cell

End Sub


Function ArrayToString(vArray As Variant, _
                       Optional vDelim As String = ",") As String
' only included to support test (be able to see what is in the arrays)

    Dim vDelimString As String
    Dim i As Long

    For i = LBound(vArray) To UBound(vArray)
        vDelimString = vDelimString & CStr(vArray(i)) & _
                       IIf(vCounter < UBound(vArray), vDelim, "")
    Next

    ArrayToString = vDelimString
End Function

To run this you will need values in column F of the active sheet (sentences), if you do not already have it you will also need to add a reference to the Microsoft Scripting Runtime in your VBA environment for the dictionary data type to be available (tools -> references -> Microsoft Scripting Runtime).

As you will see from the code this gets a bit messy where I have to insert a new line number into an existing array (that is stored as a value within the dictionary). As I do not know of a way to just extend this array (without clearing the existing values), I have used the variable tmpArr to create an array of the appropriate size and then copy the values one by one from the existing array in the dictionary and then add the current row number to the end. The temporary array is then used to replace the existing value for that key (the current word).

Any advice on this would be greatly appreciated.

È stato utile?

Soluzione

I am open to using arrays, collections or any other data type

As I see, using collection instead array would be much simplier:

Sub BuildInvertedIndex(pDict As Dictionary, pRange As Range)
    Dim cell As Range
    Dim words, word
    Dim i As Long    
    ' loop through cells (one col wide so same as looping through lines)
    For Each cell In pRange.Cells    
        ' loop through words in line
        words = Split(cell.Value)
        For Each word In words    
            If Not pDict.Exists(word) Then
                ' initialize collection
                pDict.Add word, New Collection
            End If
            'try to add to collection. If row is already in collecton, nothing happend. Storing key makes you sure there're only unique rows
            On Error Resume Next
            pDict.Item(word).Add Item:=cell.Row, Key:=CStr(cell.Row)
            On Error GoTo 0                
        Next word
    Next cell
End Sub

Next step, is to slightly modify ArrayToString to ColToString:

Function ColToString(vCol As Collection, _
                   Optional vDelim As String = ",") As String
' only included to support test (be able to see what is in the arrays)

    Dim vDelimString As String
    Dim i As Long

    For i = 1 To vCol.Count
        vDelimString = vDelimString & CStr(vCol.Item(i)) & _
                       IIf(i < vCol.Count, vDelim, "")
    Next

    ColToString = vDelimString
End Function

and the test subroutine (changed only one row - Debug.Print k & ": " & ColToString(vDict.Item(k)) and target range to "F2:F5"):

Sub Test()
' minimum included here to demonstrate use of buildInvertedIndex procedure

    Dim vRange As Range
    Dim vDict As Dictionary

    Set vRange = ActiveSheet.Range("F2:F5")
    Set vDict = New Dictionary

    BuildInvertedIndex vDict, vRange

    ' test values returned in dictionary (word: [line 1, ..., line n])
    Dim k As Variant, vCounter As Long
    vCounter = 0
    For Each k In vDict.Keys
        Debug.Print k & ": " & ColToString(vDict.Item(k))
        vCounter = vCounter + 1
        If vCounter >= 10 Then
            Exit For
        End If
    Next

    'clean up memory
    Set vDict = Nothing
End Sub

RESULT:

enter image description here


UPDATE:

to improve speed of your code you could store range in array (next approach work only with single-column range, but you could easily modify it):

Test sub:

Sub TestWirhArray()
' minimum included here to demonstrate use of buildInvertedIndex procedure

    Dim vRange As Range
    Dim vDict As Dictionary
    Dim myArr As Variant

    Set vDict = New Dictionary
    Set vRange = ActiveSheet.Range("F2:F20585")
    myArr = vRange.Value
    BuildInvertedIndexWithArr vDict, myArr, vRange.Row

    ' test values returned in dictionary (word: [line 1, ..., line n])
    Dim k As Variant, vCounter As Long
    vCounter = 0
    For Each k In vDict.Keys
        Debug.Print k & ": " & ColToString(vDict.Item(k))
        vCounter = vCounter + 1
        If vCounter >= 10 Then
            Exit For
        End If
    Next

    'clean up memory
    Set vDict = Nothing
End Sub

new version of BuildInvertedIndexWithArr:

Sub BuildInvertedIndexWithArr(pDict As Dictionary, pArr, firstRow As Long)
    Dim cell, words, word
    Dim i As Long, j As Long

    j = firstRow
    ' loop through cells (one col wide so same as looping through lines)
    For Each cell In pArr

        ' loop through words in line
        words = Split(cell)
        For Each word In words

            If Not pDict.exists(word) Then
                ' initialize collection
                pDict.Add word, New Collection
            End If

            On Error Resume Next
            pDict.Item(word).Add Item:=j, Key:=CStr(j)
            On Error GoTo 0

        Next word
        j = j + 1
    Next cell
End Sub
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top