Skip to content

Parallelized Corpus Inserter - GCS Run

Notifications You must be signed in to change notification settings

cmpe492-text-processing/inserter

Repository files navigation

Inserter

Overview

The inserter repository is designed to perform asynchronous reading and processing of files containing online conversations, such as posts, comments, and tweets. This repository primarily complements real-time data collection systems by expanding and enriching our dataset using sources from various online datasets. The code execution is optimized using Google Cloud’s GPU infrastructure, which enables fine-grain parallelism. This setup allows us to scale our data quickly and efficiently.

Key Features

  • Fine-grain parallelism, both process- and thread-based, executed on Google Cloud GPUs allows for rapid dataset expansion.
  • Efficiently reads and processes large datasets in the background, without interrupting the real-time data collection processes.
  • Enriches existing datasets by adding valuable historical data from offline sources.
  • Works with a variety of online conversation datasets, including those from Reddit, Wikipedia, and more.

Datasets

These are some of the datasets we use with inserter and insert into our own dataset after processing: