Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug in scoring #1

Open
Hikari opened this issue Jun 13, 2010 · 1 comment
Open

bug in scoring #1

Hikari opened this issue Jun 13, 2010 · 1 comment

Comments

@Hikari
Copy link

Hikari commented Jun 13, 2010

I guess I found a bug in the way the scoring is done.
For example a article page from cnn:
DEBUG:root:Candidate p#cnnContentContainer.cnn_storyarea with score 163.5
DEBUG:root:Candidate p#.cnn_contentarea with score 138.0
DEBUG:root:Candidate p#cnnContainer. with score 118.5
DEBUG:root:Candidate body#. with score 113.5
DEBUG:root:Candidate p#.cnn_strycntntlft with score 111.0

all of those 5 candidates are somehow childs of eachother (body#->p.*). So it happens, that the result is showing to much text which is not needed.

An idea would be to remove child nodes from the parent before calculating the score.

@timbertson
Copy link
Owner

Thanks for the report. To be honest, I haven't looked too close into the scoring of nodes (I didn't write this library, I merely ported it to python).

I do know that it's unfortunately not as simple as disregarding children from the scoring calculation, because then you lose good content candidates which are composed of multiple children - imagine a "body" div which has very little text inside it, but contains 5 large

tags comprising the article. You'd want to select the containing div, rather than any individual

dustincannon referenced this issue in dustincannon/python-readability Dec 28, 2012
Tweaking Readability to provide cleaner articles.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants