This directory contains the 20 Newsgroups dataset, pre-converted into Annif vocabulary and document corpus format.

The script used for conversion is also available. It makes use of the scikit-learn fetch_20newsgroups function which is a convenient way of accessing the dataset.

This is the bydate flavor of the dataset, which has been split into train (n=11314) and test (n=7532) subsets by date. All header information as well as quote headers, which could provide non-topical hints about the newsgroup a message was posted in, have been stripped.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Files

README.md

Latest commit

History

README.md

File metadata and controls