Question

I am using Python Goose. You can find it in this link

I want to extract the published date, but when I run the:

g = Goose()
entity = g.extract(url="mylink")
date = entity.publish_date

I have as a result None

I have tried it in many many sites and results were None

Any advice?

Was it helpful?

Solution

I have just checked the relevant part of the source: crawler.py The publish_date extraction is currently commented out

# TODO
# article.publish_date = config.publishDateExtractor.extract(doc)

Further examination revealed that if you uncomment the line above, you'll be able to define your custom date extractor. However, there is no default date extractor implemented in Goose. See this method: set_publishdate_extractor in https://github.com/grangier/python-goose/blob/master/goose/configuration.py

OTHER TIPS

Since 2014 this feature has been implemented into python-goose in the extractors/publishdate.py so article.publish_date returns some date. But only if available in the following metadata-fields:

rnews:datePublished
article:published_time
OriginalPublicationDate
datePublished
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top