Question

I am new to Python and am trying to work with Pattern. My goal is to get a code which will give me the synonym of the input word, after looking it up from the IndoWordnet. The language must be Bengali. I have a list of words already. But I am not sure how exactly, with the help of Pattern I can web-search an input. I have tried following http://arunrocks.com/easy-practical-web-scraping-in-python/ . It didn't help much. I wanted to started with a parsed web page and this is what I did. This ill give us the absolute link too.

from lxml.html import fromstring
from urllib2 import urlopen
def get_page(url) :
    html = urlopen(url).read()
    dom = fromstring(html)
    dom.make_links_absolute(url)
    return dom

dom = get_page('http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=%E0%A6%97%E0%A6%BE%E0%A6%A7%E0%A6%BE')

<Element html at 0x50b4840>

But I am stuck after that as I do not know how to do specific search with pattern. Please help.

Was it helpful?

Solution

It's a bit more tricky than it seems cos there is an AJAX request that gets the data you want to scrape, so make it in two steps:

  • get the special sid value corresponding to the word you are looking for (it is inside a label with id attribute equals to sid)
  • make another request to the http://www.cfilt.iitb.ac.in/indowordnet/ajax/onto.jsp passing the sid you've grabbed on the first step. For example, see how it looks for the sid=4827: http://www.cfilt.iitb.ac.in/indowordnet/ajax/onto.jsp?sid=4827

Here's the code. It prints all antology labels:

from lxml.html import parse
from urllib2 import urlopen

SID_URL = 'http://www.cfilt.iitb.ac.in/indowordnet/ajax/onto.jsp?sid=%s'

url = 'http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=%E0%A6%97%E0%A6%BE%E0%A6%A7%E0%A6%BE'
tree = parse(urlopen(url))

sid = tree.find('.//label[@id="sid"]').text

tree = parse(urlopen(SID_URL % sid))
for record in tree.xpath('//ontorecord'):
    print record.find('onto_label').text

UPD (getting synonyms):

from lxml.html import parse
from urllib2 import urlopen

url = 'http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=%E0%A6%97%E0%A6%BE%E0%A6%A7%E0%A6%BE'
tree = parse(urlopen(url))

for label in tree.xpath('.//label[@id="words"]/a'):
    print label.text
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top