You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Aug 14, 2021. It is now read-only.
Some articles have their first paragraph omitted, regardless of the configuration values of CleanConditionally and StripUnlikelyCandidates. Examples below:
The short paragraph beginning with "When internet trolls tried..." is omitted from the article content. This section gets assigned to the article excerpt and is recoverable that way, but the first example's lost paragraph is not available in the excerpt.
I've looked at several pages where this behavior happens and I'm unable to determine why it is happening. HTML parsing is not my strong suit so I'm hoping someone can take a look. I really appreciate all the work on this library.
Some articles have their first paragraph omitted, regardless of the configuration values of CleanConditionally and StripUnlikelyCandidates. Examples below:
https://www.engadget.com/2019/05/10/lyft-just-started-experimenting-with-car-rentals-in-san-francisc/?utm_campaign=homepage&utm_medium=internal&utm_source=dl
The paragraph beginning with "Between offering on-demand rides..." at the beginning of the article is completely absent from cleaned content.
https://www.cnn.com/2019/04/12/us/andrew-chael-katie-bouman-black-hole-image-trnd/index.html
The short paragraph beginning with "When internet trolls tried..." is omitted from the article content. This section gets assigned to the article excerpt and is recoverable that way, but the first example's lost paragraph is not available in the excerpt.
I've looked at several pages where this behavior happens and I'm unable to determine why it is happening. HTML parsing is not my strong suit so I'm hoping someone can take a look. I really appreciate all the work on this library.
The code I'm using to scrape these pages:
The text was updated successfully, but these errors were encountered: