Python Goose cannot extract date

https://stackoverflow.com/questions/18846540

python
goose

28-06-2022
|

题

I am using Python Goose. You can find it in this link

I want to extract the published date, but when I run the:

g = Goose()
entity = g.extract(url="mylink")
date = entity.publish_date

I have as a result None

I have tried it in many many sites and results were None

Any advice?

解决方案

I have just checked the relevant part of the source: crawler.py The publish_date extraction is currently commented out

# TODO
# article.publish_date = config.publishDateExtractor.extract(doc)

Further examination revealed that if you uncomment the line above, you'll be able to define your custom date extractor. However, there is no default date extractor implemented in Goose. See this method: set_publishdate_extractor in https://github.com/grangier/python-goose/blob/master/goose/configuration.py

其他提示

Since 2014 this feature has been implemented into python-goose in the extractors/publishdate.py so article.publish_date returns some date. But only if available in the following metadata-fields:

rnews:datePublished
article:published_time
OriginalPublicationDate
datePublished

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow