Question

My text file data looks like this:(protein-protein interaction data)

transcription_factor protein

Myc Rilpl1

Mycn Rilpl1

Mycn "Wdhd1,Socs4"

Sox2 Rilpl1

Sox2 "Wdhd1,Socs4"

Nanog "Wdhd1,Socs4"

I want it to look like this:( To see each protein has how many transcription_factor interact with)

protein transcription_factor

Rilpl1 Myc, Mycn, Sox2

Wdhd1 Mycn, Sox2, Nanog

Socs4 Mycn, Sox2, Nanog

After using my code, what I got is this:(how can I get rid off the "" and separate the two protein to new line)

protein transcription_factor

Rilpl1 Myc, Mycn, Sox2

"Wdhd1,Socs4" Mycn, Nanog, Sox2

Here is my code:

input_file = ARGV[0]
hash = {}
File.readlines(input_file, "\r").each do |line|
  transcription_factor, protein = line.chomp.split("\t")

  if hash.has_key? protein
    hash[protein] << transcription_factor
  else
    hash[protein] = [transcription_factor]
  end
end

hash.each do |key, value|
  if value.count > 2
    string = value.join(', ')
    puts "#{key}\t#{string}"
  end
end
Was it helpful?

Solution

Here is a quick way to fix your problem:

...
transcription_factor, proteins = line.chomp.split("\t")
proteins.to_s.gsub(/"/,'').split(',').each do |protein|
  if hash.has_key? protein
    hash[protein] << transcription_factor
  else
    hash[protein] = [transcription_factor]
  end
end
...

The above snippet basically removes the quotes from the proteins if there are any and then for each protein found it does what you had already written.

Also if you would like to eliminate the if you can define the hash like this:

hash = Hash.new {|hash,key| hash[key]= []}

which means that for every new key it will return a new array. So now you can skip the if and write

hash[protein] << transcription_factor
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top