Create document-term matrix from dictionary
-
12-06-2021 - |
Question
I'm trying to pre-process a text file, where each line is a bi-gram words of a document with their frequency in that document. here is an example of each line:
i_like 1 you_know 2 .... not_good 1
I managed to create the dictionary from the whole corpus. Now I want to read the corpus line by line and having the dictionary, create the document-term matrix so each element (i,j) in matrix will be the frequency of term "j" in document "i".
Solution
Create a function that generates an integer index for each word using a dictionary:
Dictionary<string, int> m_WordIndexes = new Dictionary<string, int>();
int GetWordIndex(string word)
{
int result;
if (!m_WordIndexes.TryGet(word, out result)) {
result = m_WordIndexes.Count;
m_WordIndexes.Add(word, result);
}
return result;
}
The result matrix is:
List<List<int>> m_Matrix = new List<List<int>>();
Processing each line of the text file generates one row of the matrix:
List<int> ProcessLine(string line)
{
List<int> result = new List<int>();
. . . split the line in a sequence of word / number of occurences . . .
. . . for each word / number of occurences . . .{
int index = GetWordIndex(word);
while (index > result.Count) {
result.Add(0);
}
result.Insert(index, numberOfOccurences);
}
return result;
}
Your read the text file one line at a time, calling ProcessLine()
on each line and adding the resulting list to m_Matrix.