I am using Python Goose. You can find it in this link

I want to extract the published date, but when I run the:

g = Goose()
entity = g.extract(url="mylink")
date = entity.publish_date

I have as a result None

I have tried it in many many sites and results were None

Any advice?

有帮助吗?

解决方案

I have just checked the relevant part of the source: crawler.py The publish_date extraction is currently commented out

# TODO
# article.publish_date = config.publishDateExtractor.extract(doc)

Further examination revealed that if you uncomment the line above, you'll be able to define your custom date extractor. However, there is no default date extractor implemented in Goose. See this method: set_publishdate_extractor in https://github.com/grangier/python-goose/blob/master/goose/configuration.py

其他提示

Since 2014 this feature has been implemented into python-goose in the extractors/publishdate.py so article.publish_date returns some date. But only if available in the following metadata-fields:

rnews:datePublished
article:published_time
OriginalPublicationDate
datePublished
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top