Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Appending data to the Pile. #99

Open
shankerabhigyan opened this issue Jun 6, 2022 · 1 comment
Open

Appending data to the Pile. #99

shankerabhigyan opened this issue Jun 6, 2022 · 1 comment

Comments

@shankerabhigyan
Copy link

Hi,

I wanted to know if Pile will be looking to integrate multilingual data anytime soon.
There are some organisations in India with archived scholarly articles and research work which haven't received the exposure they deserve because of language barriers in international research.

I also wanted to gain some more clarity on what are the key steps that are followed after the data is converted to the jsonlines format.
It's also been mentioned that the lm_dataset format has to be followed for the new data to be appended, could you please give more clarity on what are the key attributes of that format and how and at what point of the entire process does it relate to the final formation of GPT-J.
Thank you.

@dboggs95
Copy link

@shankerabhigyan Read their paper, page 9.
https://arxiv.org/abs/2101.00027
https://arxiv.org/pdf/2101.00027.pdf

A fully multi-lingual expansion of the Pile is in their future plans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants