It's a bit more tricky than it seems cos there is an AJAX request that gets the data you want to scrape, so make it in two steps:
- get the special
sid
value corresponding to the word you are looking for (it is inside alabel
withid
attribute equals tosid
) - make another request to the
http://www.cfilt.iitb.ac.in/indowordnet/ajax/onto.jsp
passing thesid
you've grabbed on the first step. For example, see how it looks for thesid=4827
: http://www.cfilt.iitb.ac.in/indowordnet/ajax/onto.jsp?sid=4827
Here's the code. It prints all antology labels:
from lxml.html import parse
from urllib2 import urlopen
SID_URL = 'http://www.cfilt.iitb.ac.in/indowordnet/ajax/onto.jsp?sid=%s'
url = 'http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=%E0%A6%97%E0%A6%BE%E0%A6%A7%E0%A6%BE'
tree = parse(urlopen(url))
sid = tree.find('.//label[@id="sid"]').text
tree = parse(urlopen(SID_URL % sid))
for record in tree.xpath('//ontorecord'):
print record.find('onto_label').text
UPD
(getting synonyms):
from lxml.html import parse
from urllib2 import urlopen
url = 'http://www.cfilt.iitb.ac.in/indowordnet/first?langno=3&queryword=%E0%A6%97%E0%A6%BE%E0%A6%A7%E0%A6%BE'
tree = parse(urlopen(url))
for label in tree.xpath('.//label[@id="words"]/a'):
print label.text