Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle non-ascii characters in url #193

Open
kolesar-andras opened this issue Nov 11, 2019 · 0 comments · May be fixed by #194
Open

Handle non-ascii characters in url #193

kolesar-andras opened this issue Nov 11, 2019 · 0 comments · May be fixed by #194

Comments

@kolesar-andras
Copy link

Zombie driver fails when url contains "high bytes", non-ascii characters. The following example contains a valid Hungarian with accented characters.

https://hu.wikipedia.org/wiki/Műemlék

Desktop browsers and Mink Goutte driver translate the high bytes correctly:

https://hu.wikipedia.org/wiki/M%C5%B1eml%C3%A9k

Zombie driver sends string as-is to javascript, then bytes above 0x7f go wrong somewhere in Zombie:

https://hu.wikipedia.org/wiki/Mqeml\xe9k

It's a bit strange how characters are truncated:

  • letter é becomes \xe9 that is character code in ISO-8859-1
  • letter ű becomes q because this character does not exists in that code page

Characters that don't exist in ISO-8859-1 encoding are represented with regular letters, for example q, damage is irreversible.

Example shows that desktop browsers translate non-asci characters to percent-encoded bytes using their UTF-8 character codes:

  • letter é becomes %C3%A9
  • letter ű becomes %C5%B1

That's correct, web servers expect urls in this way.

@kolesar-andras kolesar-andras linked a pull request Nov 11, 2019 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant