Some websites load their text using javascript, so it isn't in the HTML source #3

enterprisey · 2022-07-30T18:13:27Z

steps:

enter https://www.basketball-reference.com/players/b/brousel01.html as the URL, click "verify" and then "next"
select random text in article, doesn't matter, click "next"
enter the quote "Drafted by the Portland Trail Blazers in the 8th round (128th pick) of the 1974 NBA Draft." and click "next"

expected:
quote is accepted (because that text is on the page), shows next step

actual:
quote is rejected (because the text is not in the HTML source, but is instead loaded by javascript)

siddharthvp · 2022-08-07T06:16:30Z

@Ankit-Gupta18 Can you take a shot at evaluating the feasibility of using a headless browser on the server to load ref pages?

Process:

load the URL in headless browser (via selenium/playwright etc)
wait until the page has fully loaded (might have to wait a while more for JS to execute?)
extract text content from loaded page
close the tab

Possible optimisations / concurrency handling (for future iterations - NOT needed for first pass):

Use same browser process, with multiple tabs if we're getting concurrent requests.
Concurrent to loading the page in browser, also do a normal fetch - if we are able to verify the quote via that, we can abort loading in browser which can be slower.

Ankit-Gupta18 · 2022-08-08T04:22:40Z

@enterprisey @siddharthvp
Can you please help me with resources before actually trying to achieve this since I cant guess what we are trying to achieve here, I have never worked with this kind of issue before

siddharthvp · 2022-08-09T17:47:02Z

I'm not sure what's not clear. The intent is to use a headless browser to load the page to examine its content. This mimics a human opening a browser, and ensures javascripts used by the page get run (which does not happen with a fetch request which merely fetch the initial HTML of the page before any JS modification).

I suggest using playwright library - https://playwright.dev/docs/library. First try running it on your local. Browser can also be launched in headful mode for debugging so you can see what's going on (by passing { headless: false } in the options to .launch, as the documentation says).

enterprisey · 2022-08-17T18:28:38Z

try example code from https://playwright.dev/docs/library

enterprisey · 2022-08-17T18:30:25Z

https://playwright.dev/docs/api/class-page#page-text-content

Vaiofficial · 2023-01-16T06:22:55Z

I want to work on this issue , please assign me this issue.

moonLight-7k · 2023-01-27T18:07:12Z

Hello, I am interested in contributing to open source projects and would love to participate in any that you have available. I have experience in web Development and am eager to learn and grow my skills through working on these projects. Please let me know if there are any opportunities for me to get involved, I would greatly appreciate it. Thank you!

enterprisey added the bug Something isn't working label Jul 30, 2022

enterprisey mentioned this issue Aug 17, 2022

Replace character entities before matching quote #22

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some websites load their text using javascript, so it isn't in the HTML source #3

Some websites load their text using javascript, so it isn't in the HTML source #3

enterprisey commented Jul 30, 2022 •

edited

Loading

siddharthvp commented Aug 7, 2022

Ankit-Gupta18 commented Aug 8, 2022

siddharthvp commented Aug 9, 2022

enterprisey commented Aug 17, 2022

enterprisey commented Aug 17, 2022

Vaiofficial commented Jan 16, 2023

moonLight-7k commented Jan 27, 2023

Some websites load their text using javascript, so it isn't in the HTML source #3

Some websites load their text using javascript, so it isn't in the HTML source #3

Comments

enterprisey commented Jul 30, 2022 • edited Loading

siddharthvp commented Aug 7, 2022

Ankit-Gupta18 commented Aug 8, 2022

siddharthvp commented Aug 9, 2022

enterprisey commented Aug 17, 2022

enterprisey commented Aug 17, 2022

Vaiofficial commented Jan 16, 2023

moonLight-7k commented Jan 27, 2023

enterprisey commented Jul 30, 2022 •

edited

Loading