Question

I download and scrape a webpage for some data in TSV format. Around the TSV data is HTML that I don't want.

I download the html for the webpage, and scrape out the data I want, using beautifulsoup. However, I've now got the TSV data in memory.

How can I use this TSV data in memory with pandas? Every method I can find seems to want to read from file or URI rather than from data I've already scraped in.

I don't want to download text, write it to file, and then rescrape it.

#!/usr/bin/env python2

from pandas import pandas as p
from BeautifulSoup import BeautifulSoup
import urllib2

def main():
    url = "URL"
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html)
    # pre is the tag that the data is within
    tab_sepd_vals = soup.pre.string

    data = p.LOAD_CSV(tab_sepd_vals)
    process(data)
Was it helpful?

Solution

If you feed the text/string version of the data into a StringIO.StringIO (or io.StringIO in Python 3.X), you can pass that object to the pandas parser. So your code becomes:

#!/usr/bin/env python2

import pandas as p
from BeautifulSoup import BeautifulSoup
import urllib2
import StringIO

def main():
    url = "URL"
    html = urllib2.urlopen(url)
    soup = BeautifulSoup(html)
    # pre is the tag that the data is within
    tab_sepd_vals = soup.pre.string

    # make the StringIO object
    tsv = StringIO.StringIO(tab_sepd_vals)

    # something like this
    data = p.read_csv(tsv, sep='\t') 

    # then what you had
    process(data)

OTHER TIPS

Methods like read_csv do two things, they parse the CSV and they construct a DataFrame object - so in your case you might want to construct the DataFrame directly:

>>> import pandas as pd
>>> df = pd.DataFrame([['a', 1], ['b', 2], ['c', 3]])
>>> print(df)
   0  1
0  a  1
1  b  2
2  c  3

The constructor accepts a variety of data structures.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top