bug in scoring #1

Hikari · 2010-06-13T21:41:27Z

I guess I found a bug in the way the scoring is done.
For example a article page from cnn:
DEBUG:root:Candidate p#cnnContentContainer.cnn_storyarea with score 163.5
DEBUG:root:Candidate p#.cnn_contentarea with score 138.0
DEBUG:root:Candidate p#cnnContainer. with score 118.5
DEBUG:root:Candidate body#. with score 113.5
DEBUG:root:Candidate p#.cnn_strycntntlft with score 111.0

all of those 5 candidates are somehow childs of eachother (body#->p.*). So it happens, that the result is showing to much text which is not needed.

An idea would be to remove child nodes from the parent before calculating the score.

timbertson · 2010-06-14T02:31:36Z

Thanks for the report. To be honest, I haven't looked too close into the scoring of nodes (I didn't write this library, I merely ported it to python).

I do know that it's unfortunately not as simple as disregarding children from the scoring calculation, because then you lose good content candidates which are composed of multiple children - imagine a "body" div which has very little text inside it, but contains 5 large

tags comprising the article. You'd want to select the containing div, rather than any individual

Tweaking Readability to provide cleaner articles.

dustincannon referenced this issue in dustincannon/python-readability Dec 28, 2012

Merge pull request akimboio#1 from akimboio/develop

aa4c025

Tweaking Readability to provide cleaner articles.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug in scoring #1

bug in scoring #1

Hikari commented Jun 13, 2010

timbertson commented Jun 14, 2010

bug in scoring #1

bug in scoring #1

Comments

Hikari commented Jun 13, 2010

timbertson commented Jun 14, 2010