Question

I'm trying to download the following url into an R dataframe:

http://www.fantasypros.com/nfl/rankings/qb.php/?export=xls

(It's the 'Export' link on the public page: http://www.fantasypros.com/nfl/rankings/qb.php/)

However, I'm not sure how to 'parse' the data? I'm also looking to automate this and perform it weekly, so any thoughts on how to build this into a weekly-access workflow would be greatly appreciated! Have been google searching and scouring stackoverflow for a couple hours now to no avail... :-)

Thank you,

Justin

Attempted Code:

getURL("http://www.fantasypros.com/nfl/rankings/qb.php?export=xls")

This just gives me a string that starts like:

[1] "FantasyPros.com \t \nWeek 8 - QB Rankings \t \nExpert Consensus Rankings (ECR) \t \n\n Rank \t Player Name \tTeam \t Matchup \tBest Rank \t Worst Rank \t Ave Rank \t Std Dev \t\n1\tPeyton Manning\tDEN\t vs. WAS\t1\t5\t1.2105263157895\t0.58877509625419\t\t\n2\tDrew Brees\tNO\t vs. BUF\t1\t7\t2.6287878787879\t1.0899353819483\t\t\n3\tA...

Was it helpful?

Solution

Welcome to R. It sounds like you love to do your analysis in Excel. Thats completely fine, but the fact that you are asking to crawl data from the web AND are asking about R, I think its safe to assume that you will start to find programming your analyses is the way to go.

That said, what you really want to do is crawl the web. There are tons of examples of how to do this with R, right here on SO. Look for things like "web scraping", "crawling", and "screen scraping".

Ok, dialogue aside. Don't worry about grabbing the data in XL format. You can parse the data directly with R. Most websites use a consistent naming convention, so using a for loop and building the URLs for your datasets will be easy.

Below is an example of parsing your page, directly with R, into a data.frame which acts very similar to tablular data in XL.

## load the packages you will need
# install.packages("XML")
library(XML)

## Define the URL -- you could dynamically build this
URL = "http://www.fantasypros.com/nfl/rankings/qb.php"

## Read the tables form the page into R
tables = readHTMLTable(URL)

## how many do we have
length(tables)

## look at the first one
tables[1]
## thats not it

## lets look at the 2nd table
tables[2]

## bring it into a dataframe
df = as.data.frame(tables[2])

If you are using R for the first time, you can install external packages pretty easily with the command install.packages("PackageNameHere"). However, if you are serious about learning R, I would look into using the RStudio IDE. It really flattened the learning curve for me on a ton of levels.

OTHER TIPS

You can probably just use download.file and read.xls from the gdata library. I don't think you can skip lines reading in .xls files but you can supply a pattern argument so that it will read in the file until that pattern is seen in your row of data.

library(gdata)
download.file("http://www.fantasypros.com/nfl/rankings/qb.php?export=xls", destfile="file.xls")

ffdata<- read.xls("file.xls", header=TRUE, pattern="Rank")
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top