Question

I am trying to get the top 'n' companies from a data frame.Here is my code below.

data("Forbes2000", package = "HSAUR")
sort(Forbes2000$profits,decreasing=TRUE)

Now I would like to get the top 50 observations from this sorted vector.

Was it helpful?

Solution

head and tail are really useful functions!

head(sort(Forbes2000$profits,decreasing=TRUE), n = 50)

If you want the first 50 rows of the data.frame, then you can use the arrange function from plyr to sort the data.frame and then use head

library(plyr)

head(arrange(Forbes2000,desc(profits)), n = 50)

Notice that I wrapped profits in a call to desc which means it will sort in decreasing order.

To work without plyr

head(Forbes2000[order(Forbes2000$profits, decreasing= T),], n = 50)

OTHER TIPS

Use order to sort the data.frame, then use head to get only the first 50 rows.

data("Forbes2000", package = "HSAUR")
head(Forbes2000[order(Forbes2000$profits, decreasing=TRUE), ], 50)

You can use rank from dplyr.

    library(dplyr)
    top_fifty <- Forbes2000 %>%
         filter(rank(desc(profits))<=50)

This sorts your data in descending order and only keeps values where the rank is less than or equal to 50 (i.e. the top 50).
Dplyr is very useful. The commands and chaining syntax are very easy to understand. 10/10 would recommend.

Mnel is right that in general, You want to use head() and tail() functions along with the a sorting function. I should mention though for medium data sets Vince's method works faster. If you didn't use head() or tail(), then you could used the basic subsection call operator []....

 library(plyr)
 x = arrange(Forbes2000,desc(profits))
 x = x[1:50,]
 #Or using Order
 x = Forbes2000[order(Forbes2000$profits, decreasing= T),]
 x = x[1:50,]

However, I really do recommend the head(), tail(), or filter() functions because the regular [] operator assumes your data is structured in easily drawn array or matrix format. (Hopefully, this answers Teja question)

Now which pacakage you choose is largely subjective. However reading people's comments, I will say that the choice to use plyr's arrange(), {bases}'s order() with {utils} head() and tails, or plyr() largely depends on the memory size and row size of your dataset. I could go into more detail about how Plyr and sometimes Dplyr have problems with large complex datasets, but I don't want to get off topic.

P.S. This is one of my first times answering so feedback is appreciated.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top