Skip to content

In this brief project we're gonna explore a few NLP tools using a Sklearn dataset and the following modelling techniques: bag of words, Hashing and TF-IDF vectorizer.

Notifications You must be signed in to change notification settings

gonzaferreiro/NLP_with_20newsgroups

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 

Repository files navigation

Introduction to Natural Lenguage Processin

introduction

In this brief project we're gonna explore a few NLP tools using a Sklearn dataset. The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon a messages posted before and after a specific date.

In this project we'll only work trying to predict four categories of the Sklearn dataset:

  • alt.atheism
  • talk.religion.misc
  • comp.graphics
  • sci.space

Feel free to check the dataset documentation to know more about it.

What you'll find in this repository

  • Introduction to the dataset and its exploration
  • Bag of words model: what it is and application
  • Exploring most common words in several ways
  • Looking at the confusion matrix out of our model
  • Using Hashing and TF-IDF: theoretical introduction and application
  • A classifiers comparison

About

In this brief project we're gonna explore a few NLP tools using a Sklearn dataset and the following modelling techniques: bag of words, Hashing and TF-IDF vectorizer.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published