Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some websites load their text using javascript, so it isn't in the HTML source #3

Open
enterprisey opened this issue Jul 30, 2022 · 7 comments
Labels
bug Something isn't working

Comments

@enterprisey
Copy link
Collaborator

enterprisey commented Jul 30, 2022

steps:

  1. enter https://www.basketball-reference.com/players/b/brousel01.html as the URL, click "verify" and then "next"
  2. select random text in article, doesn't matter, click "next"
  3. enter the quote "Drafted by the Portland Trail Blazers in the 8th round (128th pick) of the 1974 NBA Draft." and click "next"

expected:
quote is accepted (because that text is on the page), shows next step

actual:
quote is rejected (because the text is not in the HTML source, but is instead loaded by javascript)

@enterprisey enterprisey added the bug Something isn't working label Jul 30, 2022
@siddharthvp
Copy link
Member

@Ankit-Gupta18 Can you take a shot at evaluating the feasibility of using a headless browser on the server to load ref pages?

Process:

  • load the URL in headless browser (via selenium/playwright etc)
  • wait until the page has fully loaded (might have to wait a while more for JS to execute?)
  • extract text content from loaded page
  • close the tab

Possible optimisations / concurrency handling (for future iterations - NOT needed for first pass):

  • Use same browser process, with multiple tabs if we're getting concurrent requests.
  • Concurrent to loading the page in browser, also do a normal fetch - if we are able to verify the quote via that, we can abort loading in browser which can be slower.

@Ankit-Gupta18
Copy link
Contributor

@enterprisey @siddharthvp
Can you please help me with resources before actually trying to achieve this since I cant guess what we are trying to achieve here, I have never worked with this kind of issue before

@siddharthvp
Copy link
Member

I'm not sure what's not clear. The intent is to use a headless browser to load the page to examine its content. This mimics a human opening a browser, and ensures javascripts used by the page get run (which does not happen with a fetch request which merely fetch the initial HTML of the page before any JS modification).

I suggest using playwright library - https://playwright.dev/docs/library. First try running it on your local. Browser can also be launched in headful mode for debugging so you can see what's going on (by passing { headless: false } in the options to .launch, as the documentation says).

@enterprisey
Copy link
Collaborator Author

try example code from https://playwright.dev/docs/library

@enterprisey
Copy link
Collaborator Author

@Vaiofficial
Copy link

I want to work on this issue , please assign me this issue.

@moonLight-7k
Copy link

Hello, I am interested in contributing to open source projects and would love to participate in any that you have available. I have experience in web Development and am eager to learn and grow my skills through working on these projects. Please let me know if there are any opportunities for me to get involved, I would greatly appreciate it. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants