Inserter

Overview

The inserter repository is designed to perform asynchronous reading and processing of files containing online conversations, such as posts, comments, and tweets. This repository primarily complements real-time data collection systems by expanding and enriching our dataset using sources from various online datasets. The code execution is optimized using Google Cloud’s GPU infrastructure, which enables fine-grain parallelism. This setup allows us to scale our data quickly and efficiently.

Key Features

Fine-grain parallelism, both process- and thread-based, executed on Google Cloud GPUs allows for rapid dataset expansion.
Efficiently reads and processes large datasets in the background, without interrupting the real-time data collection processes.
Enriches existing datasets by adding valuable historical data from offline sources.
Works with a variety of online conversation datasets, including those from Reddit, Wikipedia, and more.

Datasets

These are some of the datasets we use with inserter and insert into our own dataset after processing:

Plain Text Wikipedia (Simple English) A dataset consisting of plain text from Wikipedia articles written in simple English.
Reddit Russia-Ukraine Conflict A collection of posts from Reddit covering discussions about the Russia-Ukraine conflict.
Political Reddit Dataset A dataset containing political discussions from Reddit.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
database.py		database.py
generator.py		generator.py
main.py		main.py
requirements.txt		requirements.txt
tagger.py		tagger.py
text_processor.py		text_processor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Inserter

Overview

Key Features

Datasets

About

Contributors 2

Languages

cmpe492-text-processing/inserter

Folders and files

Latest commit

History

Repository files navigation

Inserter

Overview

Key Features

Datasets

About

Resources

Stars

Watchers

Forks

Contributors 2

Languages