Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

more intelligent parsing of websites & URLs / URL fields -- check for illegal characters; strip those before inputting #64

Open
jwmh opened this issue Feb 2, 2014 · 2 comments

Comments

@jwmh
Copy link

jwmh commented Feb 2, 2014

SUMMARY:
When website URLs get imported (or even copied/pasted) from other sources, sometimes... there are character-set issues. Sometimes they include invalid, but invisible, characters.
This makes the URL invalid....
.... but your average user will have no clue what's wrong -- they'll just think the business' website is having problems.

STEPS TO REPRODUCE:
0. DON'T change the below venue; keep as an example use-case (until we've fully documented the issue, its causes, and a kept a good example of the 'bad' URL string elsewhere).

  1. Take a look at this venue:
    http://portland.activatehub.org/venues/922
  2. Note it was imported via a FB page.
  3. Click the 'website' link:
    Actual URL: http://www.thewaypost.com/%E2%80%8E
    User-visible URL: http://www.thewaypost.com/
  4. Get a 404 Page Not Found error: "The requested URL /‎ was not found on this server."
  5. Delete the trailing slash (and accompanying invisible character(s)) from the address bar, and load the domain name again.
  6. Watch it work perfectly this time.
@dhedlund
Copy link
Contributor

dhedlund commented Feb 2, 2014

Google has started adding a UTF-8 "left-to-right mark" (http://en.wikipedia.org/wiki/Left-to-right_mark) at the end of some of the URLs displayed on their properties/sites. I know that Google search results in particular do this; if you copy and paste the URL instead of clicking on the link and copying from the URL bar/omnibox, it'll include the mark (0xE2808E). Pasting that URL into another program, text field, or URL bar will result in a bad URL exactly as encountered in this ticket. In fact, the URL includes the same left-to-right mark, indicating that someone probably copy and pasted it from Google. sigh

@jwmh
Copy link
Author

jwmh commented Feb 4, 2014

Hm... good to know. Is there a way for me to easily see the entire history-of-changes to a particular entity (e.g. all changes to venue# 922 ) ? (tried https://google.com/search?q=site:portland.activatehub.org%2Fchanges+AND+waypost , but that doesn't seem to work, alas -- actually, seems "portland.activatehub.org/changes" isn't indexed at all by google; maybe that's by design?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants