Skip to content
This repository has been archived by the owner on Aug 14, 2021. It is now read-only.

First paragraph of many articles omitted #84

Open
nate-anderson opened this issue May 10, 2019 · 1 comment
Open

First paragraph of many articles omitted #84

nate-anderson opened this issue May 10, 2019 · 1 comment

Comments

@nate-anderson
Copy link

nate-anderson commented May 10, 2019

Some articles have their first paragraph omitted, regardless of the configuration values of CleanConditionally and StripUnlikelyCandidates. Examples below:

https://www.engadget.com/2019/05/10/lyft-just-started-experimenting-with-car-rentals-in-san-francisc/?utm_campaign=homepage&utm_medium=internal&utm_source=dl

The paragraph beginning with "Between offering on-demand rides..." at the beginning of the article is completely absent from cleaned content.

https://www.cnn.com/2019/04/12/us/andrew-chael-katie-bouman-black-hole-image-trnd/index.html

The short paragraph beginning with "When internet trolls tried..." is omitted from the article content. This section gets assigned to the article excerpt and is recoverable that way, but the first example's lost paragraph is not available in the excerpt.

I've looked at several pages where this behavior happens and I'm unable to determine why it is happening. HTML parsing is not my strong suit so I'm hoping someone can take a look. I really appreciate all the work on this library.

The code I'm using to scrape these pages:

$pageContent = file_get_contents($this->url);
    
    $readabilityConfig = new Configuration([
        
    ]);
    $readability = new Readability($readabilityConfig);

    try {
        $readability->parse($pageContent);
    } catch (ParseException $e) {
        echo('parse failed');
    }
@swash13
Copy link

swash13 commented Oct 25, 2019

same problem

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants