How to match multi-line strings in Ruby using Regular Expressions to be used in an Inverted Index?

StackOverflow https://stackoverflow.com/questions/19148795

  •  30-06-2022
  •  | 
  •  

Assignment instructions: http://pastebin.com/pxJS4gfR

Objective: Take a collection of documents and generate its inverted index.

My plan

  1. Grab the relevant strings from the collections file
  2. Tokenize them and place them into a Hash to be used later.

I am using the following regular expression \.I(.*?)\.B\m to grab the text needed from a collections file as shown here: http://rubular.com/r/mOpfuvRT12

Edit: I have used mudasobwa's suggestion

content = File.read('test.txt')
# deal with content
content.scan(/\.T(.*?)\.B/m) { |mtch| 
  puts mtch 
}

This grabs the necessary text I need however I need to place the grabbed text into a Hash to be used later and I am not sure how to work with the String.scan/regex/ because it returns an Array of Arrays.

I am basically trying to replicate this example:

puts "Enter something: "
text = gets.chomp
words = text.split(" ")
frequencies = Hash.new(0)
words.each do |word|
    frequencies[word] += 1
end
frequencies = frequencies.sort_by { |k, v| v }
frequencies.reverse!
frequencies.each do |word, freq|
    puts word + " " + freq.to_s
end
有帮助吗?

解决方案

You are trying to read the file line by line. In such a case /m multiline modifier makes no sense. You are to read the entire file and then parse it for whatever you want:

content = File.read('test.txt')
content.scan(/\.T(.*?)\.B/m) { |mtch| 
  puts mtch 
}

UPD To put the scan results to hash as in the example you need either flatten method of an array:

content = File.read('test.txt')
# flatten the array                  ⇓⇓⇓⇓⇓⇓⇓
words = content.scan(/\.T(.*?)\.B/m).flatten
words.each …

or block within scan method:

content = File.read('test.txt')
freqs = {}
content.scan(/\.T(.*?)\.B/m) { |mtch| 
  (freqs[mtch] ||= 0) += 1 
}
…

UPD2 To split the resulting array of sentenses to array of words:

arr = ["Preliminary Report International", "Fingers or Fists"]   
arr.map {|e| e.split(' ')}.flatten.map(&:downcase)
# ⇒  ["preliminary", "report", "international", "fingers", "or", "fists"]

Here first map iterates array elements and transforms them to arrays of splitten words, flatten produces plain array from yielded array of arrays, and, finally, downcase is here because you’ve requested the downcased words in your example.

Hope it helps.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top