Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL Links #109

Open
akul-goyal opened this issue Mar 13, 2023 · 2 comments
Open

URL Links #109

akul-goyal opened this issue Mar 13, 2023 · 2 comments

Comments

@akul-goyal
Copy link

Is it possible to gain access to the URL links (or any website information) from which the data was scrapped to generate PILE?

@dboggs95
Copy link

@akul-goyal Read their paper, page 14.
https://arxiv.org/abs/2101.00027
https://arxiv.org/pdf/2101.00027.pdf

If I understand correctly, due to the copyrighted nature of some of their datasets, they don't host direct links to all of them.

However, many of the links in the readme are to scripts that will download them. I have only used Project Gutenberg so far, but I assume if you run pile.py with the --force-download command it will download all 1.2 TB of data, minus the books3 datasource from Bibliotek, which must be commented out of the code in order for it to work.

@akul-goyal
Copy link
Author

Hi, @dboggs95 thanks for the response. I was interested in more fine-grained website information rather than the links to the actual dataset. For example, for the youtube caption dataset, I am interested in the URL of the youtube video used to collect the data. This GitHub repo currently contains scrapping scripts to collect data, but it does not specify links used to create Pile. Furthermore, even if the URL does exist, it is not clear to me that a mapping exists from the URL to the scrapped text.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants