URL Links #109

akul-goyal · 2023-03-13T18:36:38Z

Is it possible to gain access to the URL links (or any website information) from which the data was scrapped to generate PILE?

dboggs95 · 2023-03-21T13:30:41Z

@akul-goyal Read their paper, page 14.
https://arxiv.org/abs/2101.00027
https://arxiv.org/pdf/2101.00027.pdf

If I understand correctly, due to the copyrighted nature of some of their datasets, they don't host direct links to all of them.

However, many of the links in the readme are to scripts that will download them. I have only used Project Gutenberg so far, but I assume if you run pile.py with the --force-download command it will download all 1.2 TB of data, minus the books3 datasource from Bibliotek, which must be commented out of the code in order for it to work.

akul-goyal · 2023-03-21T17:38:39Z

Hi, @dboggs95 thanks for the response. I was interested in more fine-grained website information rather than the links to the actual dataset. For example, for the youtube caption dataset, I am interested in the URL of the youtube video used to collect the data. This GitHub repo currently contains scrapping scripts to collect data, but it does not specify links used to create Pile. Furthermore, even if the URL does exist, it is not clear to me that a mapping exists from the URL to the scrapped text.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL Links #109

URL Links #109

akul-goyal commented Mar 13, 2023

dboggs95 commented Mar 21, 2023

akul-goyal commented Mar 21, 2023

URL Links #109

URL Links #109

Comments

akul-goyal commented Mar 13, 2023

dboggs95 commented Mar 21, 2023

akul-goyal commented Mar 21, 2023