-
-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change original URLs for archived ones #1
Comments
Absolutely I agree. The problem is that automating it might be tricky as some the links are completely dead, some have been redirected, some shows 404, some shows irrelevant data, and some are still alive! Unless we take a copy of them all automatically from the wayback machine it can be really hard (perhaps we can save both copy of wbm and the page itself if it shows 200 status). We should be able to use a certain algorithm to choose an appropriate snapshot (for example for 2010 we need the first snapshot between 2010 and 2015 perhaps) - not sure how wayback machine works with the apis and whether there is a rate limit etc etc. Can you contribute to this perhaps? We can even publish the tool in this repository as well so we can use it in the future too! |
Another solution would be by doing this manually but that can take serious time... I may do it as a hobby but I will probably need help as categorising them can be a chore too (saving them all in PDF perhaps if not already in PDF?). |
@irsdl I started doing something half manually and half automatically. Here's the first test I made to see how that would work: I'll start a PR on here asap so we can gradually tweak things as necessary, what do you think? I'll also change a few things in the tool I've used and upload it so we can work on that too. |
@irsdl We can define which way is best for archiving purposes, but I think it'll be ok if we just archive them as original format (be it HTML, PDF, etc.), tell me what you think |
For 2019 it is easy to do this because they are still live and we should be able to just save the endpoints if they are not in slideshare or something like that. I guess the ultimate approach would be to manually hunt them down one by one and save them in an appropriate format. It is a chore but can become very valuable - I may start doing this in my spare time ;) |
What do you think of archiving the original URLs and replacing them for their archived ones? I think it'd make this repo more future-proof.
The text was updated successfully, but these errors were encountered: