Suggested corpus: Adult stories #107

johnflux · 2023-02-19T19:31:08Z

I have corpus of ~10GB of adult stories, in English, in plain text, taken primarily from asstr.org and literotica.
I think it would be interesting to incorporate these into the training set as well.

dboggs95 · 2023-03-21T13:13:58Z

@johnflux I would look in the Pile paper, page 22, excluded datasets.
https://arxiv.org/abs/2101.00027
https://arxiv.org/pdf/2101.00027.pdf

One of your datasources is directly named and excluded there, and the other one, probably follows the same rationale. Their reasons for excluding these were much different from the reasons for which I would have excluded them were it my choice (my rationale is x in, x out -> where x = {copyright infringement, nsfw content}), but they had a more scientific rationale you can read there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggested corpus: Adult stories #107

Suggested corpus: Adult stories #107

johnflux commented Feb 19, 2023

dboggs95 commented Mar 21, 2023

Suggested corpus: Adult stories #107

Suggested corpus: Adult stories #107

Comments

johnflux commented Feb 19, 2023

dboggs95 commented Mar 21, 2023