Question

Currently, I am working on a project to extract keywords from a block of text. Below is a sample of the first three items in the initial list. (apologies for the lengthiness)

descriptest<-c("Columbia University is one of the world's most important centers of research and at the same time a distinctive and distinguished learning environment for undergraduates and graduate students in many scholarly and professional fields. The University recognizes the importance of its location in New York City and seeks to link its research and teaching to the vast resources of a great metropolis. It seeks to attract a diverse and international faculty and student body, to support research and teaching on global issues, and to create academic relationships with many countries and regions. It expects all areas of the university to advance knowledge and learning at the highest level and to convey the products of its efforts to the world.", 
"", "UMass Amherst was born in 1863 as a land-grant agricultural college set on 310 rural acres with four faculty members, four wooden buildings, 56 students and a curriculum combining modern farming, science, technical courses, and liberal arts.\n\nOver time, the curriculum, facilities, and student body outgrew the institution's original mission. In 1892 the first female student enrolled and graduate degrees were authorized. By 1931, to reflect a broader curriculum, \"Mass Aggie\" had become Massachusetts State College. In 1947, \"Mass State\" became the University of Massachusetts at Amherst.\n\nImmediately after World War II, the university experienced rapid growth in facilities, programs and enrollment, with 4000 students in 1954. By 1964, undergraduate enrollment jumped to 10,500, as Baby Boomers came of age. The turbulent political environment also brought a \"sit-in\" to the newly constructed Whitmore Administration Building. By the end of the decade, the completion of Southwest Residential Complex, the Alumni Stadium and the establishment of many new academic departments gave UMass Amherst much of its modern stature.\n\nIn the 1970s continued growth gave rise to a shuttle bus service on campus as well as several important architectural additions: the Murray D. Lincoln Campus Center, with a hotel, office space, fine dining restaurant, campus store and passageway to a multi-level parking garage; the W.E.B. Du Bois Library, named \"tallest library in the world\" upon its completion in 1973; and the Fine Arts Center, with performance space for world-class music, dance and theater.\n\nThe next two decades saw the emergence of UMass Amherst as a major research facility with the construction of the Lederle Graduate Research Center and the Conte National Polymer Research Center. Other programs excelled as well. In 1996 UMass Basketball became Atlantic 10 Conference champs and went to the NCAA Final Four. Before the millennium, both the William D. Mullins Center, a multi-purpose sports and convocation facility, and the Paul Robsham Visitors Center bustled with activity, welcoming thousands of visitors to the campus each year.\n\nUMass Amherst entered the 21st century as the flagship campus of the state's five-campus University system, and enrollment of nearly 24,000 students and a national and international reputation for excellence.")

I was hoping to do this in R with the tm package as the DocumentTermMatrix is a clear matrix when dealing with large data. Additionally, I have used the weighting of TfIdf to rank the keywords in the corpus in comparison with the keyword in the entry itself.

I am getting stuck, as I can use max.col to get the maximum keyword , however, my matrix has multiple maximums with equal value and furthermore, I not only want the maximum value, I really would like the top ten highest values in a list. Below is sample code:

 library(RWeka)
 library(tm)
 library(koRpus)
 library(RKEA)
 library(corpora)
 library(wordcloud)
 library(plyr)
changeindextoname<-function(indexnumber){
name<-colnames(z2[indexnumber])
return(name)
}

removestuff<- function(d){
d <- tm_map(d, tolower)
d <- tm_map(d, removePunctuation)
d <- tm_map(d, removeNumbers)
d <- tm_map(d, stripWhitespace)
d <- tm_map(d, skipWords)
d <- tm_map(d, removeWords, stopwords('english'))
}

descripcorpora<-Corpus(VectorSource(descriptest))
descripcorpora<-removestuff(descripcorpora)
ddtm <- DocumentTermMatrix(descripcorpora, control = list(weighting=weightTfIdf, stopwords=T))
f2<-as.data.frame(inspect(ddtm))
z2<-f2
z3<-max.col(z2)
dfwithmax<-cbind(z3, z2)
dfwithmax$word<-lapply(dfwithmax$z3, changeindextoname)
finaldf<-subset(dfwithmax, select=c("z3", "word", "learning", "tallest", "center", "seeks", "teaching"))

The finaldf looks as follows:

finaldf
  z3     word   learning     tallest     center      seeks   teaching
1 106 learning 0.04953008 0.000000000 0.00000000 0.04953008 0.04953008
2 183  tallest 0.00000000 0.000000000 0.00000000 0.00000000 0.00000000
3  35   center 0.00000000 0.007204375 0.04322625 0.00000000 0.00000000

This method seems to work, however, cannot accommodate in row 1 to the fact that "seeks" amd "learning" and "teaching" all have the same value.

Additionally, the max.col returns an index for when all the columns are zero (as in row 2). How would I get rid of this as well?

I am trying to stay away from looping through columns or rows as it will take a long time, because the matrix is quite large.

I would greatly appreciate any advice or idea as to how to write a function that I could apply or loop through each column and add it to a list, which I can then apply the changeindextoname function and return the colnames in a list.

Thank you in advance!

Was it helpful?

Solution

For each document the top five highest values:

apply(as.matrix(ddtm),1,function(x) 
         colnames(as.matrix(ddtm))[order(x,decreasing=TRUE)[1:5]])

  Docs
       1            2            3        
  [1,] "teaching"   "york"       "center" 
  [2,] "seeks"      "year"       "umass"  
  [3,] "learning"   "worlds"     "campus" 
  [4,] "university" "worldclass" "amherst"
  [5,] "research"   "world"      "four"   

Note that you don't provide code for skipWords, so I use this one:

skipWords <- function(x) removeWords(x, c(stopwords("english")

And see tm_reduce to rewrite removestuff function :

removestuff <- tm_reduce(x,list(tolower,removePunctuation,...)
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top