Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggested corpus: Adult stories #107

Open
johnflux opened this issue Feb 19, 2023 · 1 comment
Open

Suggested corpus: Adult stories #107

johnflux opened this issue Feb 19, 2023 · 1 comment

Comments

@johnflux
Copy link

I have corpus of ~10GB of adult stories, in English, in plain text, taken primarily from asstr.org and literotica.
I think it would be interesting to incorporate these into the training set as well.

@dboggs95
Copy link

@johnflux I would look in the Pile paper, page 22, excluded datasets.
https://arxiv.org/abs/2101.00027
https://arxiv.org/pdf/2101.00027.pdf

One of your datasources is directly named and excluded there, and the other one, probably follows the same rationale. Their reasons for excluding these were much different from the reasons for which I would have excluded them were it my choice (my rationale is x in, x out -> where x = {copyright infringement, nsfw content}), but they had a more scientific rationale you can read there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants