Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?

StackOverflow https://stackoverflow.com/questions/4589323

Question

Is there a way to use readability (text extraction algorithm) and a custom algorithm in python to extract links from text?

I'd like to figure out a way of extracting links that are in the body of text.

1.) I use readability in python https://github.com/gfxmonk/python-readability

2.) I'd like to somehow compare the extracted text to the original html text in order to extract links in the actual body of an article.

Was it helpful?

Solution

Well, it looks like it returns a BeautifulSoup tree. So you should be able to do something like:

article = page.summary()   # Extract article using readability
article.findAll("a")       # Return a list of all links in the article
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top