spamdetection

Project repository for the spam detection Security Analytics (SECANA) project.

description

The goal of our project was to be able to classify a email and tell if it's spam or ham (not spam) by analysing the content of the emails (natural language processing - NLP). After processing and exploring the datasets we've compared different features and classifiers. We've decided to use the RandomForest classifier with TFIDF (parsed and preprocessed texts of the email bodys) and the number of chars and tokens as features.

Besides the model we've also developed a web interface for checking your emails (simply upload the mail as an .eml file) and a Mozilla Thunderbird plugin that can make use of the specified web interface and provides in-app feedback. More information about these projects can be found in their linked repo.

dataset info

We've used the following datasets:

At least the TREC dataset can (due to licensing) not be uploaded so please download the datasets on your own.

The directories should look like this:

datasets/
├── spamassassin (files not provided)
│   ├── ham [6951 entries exceeds filelimit, not opening dir]
│   └── spam [2397 entries exceeds filelimit, not opening dir]
└── trec07p (files not provided)
│   ├── ham [25220 entries exceeds filelimit, not opening dir]
│   ├── spam [50199 entries exceeds filelimit, not opening dir]
└── splitTRECFiles.sh - helper for splitting the trec dataset files

Keep in mind that there might be some files in the datasets that have to be removed, e.g. READMEs or files that contain the copy-commands.

project structure

For just training the model from just the datasets (spam and ham files) run the file spam-detection.ipynb. Design decisions and plots can be found in the files preprocessing.ipynb, features.ipynb, vocabulary.ipynb and model.ipynb.

spam-detection/
├── datasets
│	├── spamassassin (files not provided)
│	├── trec07p (files not provided)
│	├── splitTRECFiles.sh - helper for splitting the trec dataset files
├── exports
│   ├── docu-de.pdf - documentation in German
│   ├── model.sav - export of the created model
│   └── vocab.sav - export of the vocabulary for the model
├── markdown - markdown exports of the jupyter notebook files
│   ├── features/
│   ├── preprocessing/
│   ├── model.md
│   ├── spam-detection.md
│   └── vocabulary.md
├── plugin - submodule containing code and info about the thunderbird plugin
├── resources
│   ├── bad_domains.txt - downloaded content of https://dbl.oisd.nl/ , remove headers (not provided)
│   ├── bad_domains.db - created via preprocessing.ipynb from bad_domains.txt (not provided)
│   └── words-dwyl-github.txt - https://github.com/dwyl/english-words/blob/master/words.txt (not provided)
├── webinterface - submodule containing code and info about the webinterface
├── LICENSE
├── README.md - **you are here**
├── features.ipynb - details about features (ideas and importances) and outliers
├── model.ipynb - decisions for using the final classifier
├── preprocessing.ipynb - preprocessing incl. design thoughts and canceled ideas
├── spam-detection.ipynb - main jupyter notebook that contains only the necessary steps to train the model
└── vocabulary.ipynb - design decision for the vocabulary (using intersections)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spamdetection

description

dataset info

project structure

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
datasets		datasets
exports		exports
markdown		markdown
plugin @ 9aeac4f		plugin @ 9aeac4f
resources		resources
webinterface @ ae83767		webinterface @ ae83767
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
features.ipynb		features.ipynb
model.ipynb		model.ipynb
preprocessing.ipynb		preprocessing.ipynb
spam-detection.ipynb		spam-detection.ipynb
vocabulary.ipynb		vocabulary.ipynb

License

konstantingoretzki/spamdetection

Folders and files

Latest commit

History

Repository files navigation

spamdetection

description

dataset info

project structure

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages