Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some websites are raising HTTP2 errors on sisyphus worker #329

Open
rgaudin opened this issue Jun 24, 2024 · 5 comments
Open

Some websites are raising HTTP2 errors on sisyphus worker #329

rgaudin opened this issue Jun 24, 2024 · 5 comments

Comments

@rgaudin
Copy link
Member

rgaudin commented Jun 24, 2024

This zimit run failed and the error message mentions HTTP2. Can we not scrape HTTP/2 served website?

{"timestamp":"2024-06-22T00:53:43.504Z","logLevel":"warn","context":"recorder","message":"Request failed","details":{"url":"https://www.carrefouruae.com/mafuae/en/c/beauty-half-price","errorText":"net::ERR_HTTP2_PROTOCOL_ERROR","page":"https://www.carrefouruae.com/mafuae/en/c/beauty-half-price","workerid":0}}
{"timestamp":"2024-06-22T00:53:43.566Z","logLevel":"fatal","context":"general","message":"Page Load Timeout, failing crawl. Quitting","details":{"msg":"net::ERR_HTTP2_PROTOCOL_ERROR at https://www.carrefouruae.com/mafuae/en/c/beauty-half-price","page":"https://www.carrefouruae.com/mafuae/en/c/beauty-half-price","workerid":0}}
@benoit74
Copy link
Collaborator

Thank you!

@benoit74
Copy link
Collaborator

benoit74 commented Jul 1, 2024

@benoit74
Copy link
Collaborator

benoit74 commented Jul 8, 2024

@benoit74
Copy link
Collaborator

benoit74 commented Jul 8, 2024

ERR_HTTP2_PROTOCOL_ERROR only means there has been a problem within the HTTP2 protocol (which is supported).

I've tried https://www.carrefouruae.com/mafuae/en/c/beauty-half-price, https://www.cbc.ca/news/canada and https://www.americanas.com.br/ locally, both work well.

I've tried all these 3 URLs on zimit farm and they all failed again with same error.

Running curl directly on the worker gives the same error:

root@worker:~# curl -v https://www.carrefouruae.com/mafuae/en/c/beauty-half-price
*   Trying 2a02:26f0:9100:4::1748:f8cd:443...
* Connected to www.carrefouruae.com (2a02:26f0:9100:4::1748:f8cd) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=www.carrefouruae.com
*  start date: Apr 17 23:44:55 2024 GMT
*  expire date: Jul 16 23:44:54 2024 GMT
*  subjectAltName: host "www.carrefouruae.com" matched cert's "www.carrefouruae.com"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x562ce0019b20)
> GET /mafuae/en/c/beauty-half-price HTTP/2
> Host: www.carrefouruae.com
> user-agent: curl/7.74.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
* HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
* stopped the pause stream!
* Connection #0 to host www.carrefouruae.com left intact
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

Forcing IPv4 does not help:

root@worker:~# curl -v4 https://www.carrefouruae.com/mafuae/en/c/beauty-half-price
*   Trying 96.16.248.141:443...
* Connected to www.carrefouruae.com (96.16.248.141) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=www.carrefouruae.com
*  start date: Apr 17 23:44:55 2024 GMT
*  expire date: Jul 16 23:44:54 2024 GMT
*  subjectAltName: host "www.carrefouruae.com" matched cert's "www.carrefouruae.com"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55c86ce2eb20)
> GET /mafuae/en/c/beauty-half-price HTTP/2
> Host: www.carrefouruae.com
> user-agent: curl/7.74.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
* HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
* stopped the pause stream!
* Connection #0 to host www.carrefouruae.com left intact
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)

This really looks like a problem on our worker, but I have no clue where to look at

@benoit74
Copy link
Collaborator

benoit74 commented Jul 8, 2024

Many HTTP/2 website work well on our worker, e.g. https://http2.github.io/ is ok: https://farm.zimit.kiwix.org/pipeline/0cc1c105-b5c8-4e7e-b1cf-e72b4adf37a2

@benoit74 benoit74 changed the title Is HTTP2 an issue? Dome websites are raising HTTP2 errors on sisyphus worker Jul 8, 2024
@benoit74 benoit74 added the bug label Jul 8, 2024
@benoit74 benoit74 changed the title Dome websites are raising HTTP2 errors on sisyphus worker Some websites are raising HTTP2 errors on sisyphus worker Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants