Create document-term matrix from dictionary

https://stackoverflow.com/questions/10897580

12-06-2021
|

Question

I'm trying to pre-process a text file, where each line is a bi-gram words of a document with their frequency in that document. here is an example of each line:

i_like 1 you_know 2 .... not_good 1

I managed to create the dictionary from the whole corpus. Now I want to read the corpus line by line and having the dictionary, create the document-term matrix so each element (i,j) in matrix will be the frequency of term "j" in document "i".

Solution

Create a function that generates an integer index for each word using a dictionary:

Dictionary<string, int> m_WordIndexes = new Dictionary<string, int>();

int GetWordIndex(string word)
{
  int result;
  if (!m_WordIndexes.TryGet(word, out result)) {
    result = m_WordIndexes.Count;
    m_WordIndexes.Add(word, result);
  }
  return result;
}

The result matrix is:

List<List<int>> m_Matrix = new List<List<int>>();

Processing each line of the text file generates one row of the matrix:

List<int> ProcessLine(string line)
{
  List<int> result = new List<int>();
  . . . split the line in a sequence of word / number of occurences . . . 
  . . . for each word / number of occurences . . .{
    int index = GetWordIndex(word);      
    while (index > result.Count) {
      result.Add(0);
    }  
    result.Insert(index, numberOfOccurences);
  }
  return result;
}

Your read the text file one line at a time, calling ProcessLine() on each line and adding the resulting list to m_Matrix.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow