Skip to content

Latest commit

 

History

History
12 lines (10 loc) · 692 Bytes

README.md

File metadata and controls

12 lines (10 loc) · 692 Bytes

This directory contains the 20 Newsgroups dataset, pre-converted into Annif vocabulary and document corpus format.

The script used for conversion is also available. It makes use of the scikit-learn fetch_20newsgroups function which is a convenient way of accessing the dataset.

This is the bydate flavor of the dataset, which has been split into train (n=11314) and test (n=7532) subsets by date. All header information as well as quote headers, which could provide non-topical hints about the newsgroup a message was posted in, have been stripped.