Skip to content

jlstro/timelark-pipeline

Repository files navigation

Timelark Data Pipeline

This very basic data pipeline built in Python is part of the Timelark project. It reads unstructured text from text files, extracts named entities using spaCy, queries the Aleph API to enrich these entities, and saves the enriched data to an SQLite database. From here it can be visualized.

Table of Contents

Prerequisites

  • Python 3.x
  • spaCy and spaCy model (e.g., en_core_web_lg)
  • Dataset (sqlite wrapper)
  • Aleph API access and API key (for example OCCRP's Aleph)
  • Confection (for configuration management)

Installation

Clone this repository:

git clone https://github.com/jlstro/timelark-pipeline.git
cd timelark-pipeline

Create a virtual environment and nstall the required Python packages:

python3 -m venv venv
source venv/bin/activate  
# On Windows: venv\Scripts\activate
python3 -m pip install spacy confection dataset

Download and install the spaCy model (e.g., "en_core_web_lg"):

python3 -m spacy download en_core_web_lg

Configuration

  1. Create a configuration file named config.cfg in the root directory of the repository. Define the paths to your database, text files, and other configuration values as needed. Refer to the confection documentation for more information on writing the configuration.

Example config.cfg:

[paths]
db = "./db/data.db"
files = "./text_files"

[aleph]
host = "https://aleph.occrp.org"
collections = 25, 55, 90

The pipeline script expects txt files in the folder set under files in the cfg. It will read in each file and extract the entities, enrich them and then store them.

Make sure you set your Aleph API key as an environment variable named ALEPH_API_KEY.

Running the Pipeline

Run the main script to start the pipeline:

python3 main.py

The pipeline will read text files from the specified directory, extract entities, enrich them using the API, and save the enriched data to the SQLite database.

To-DO

  • Add support for events
  • Add relationship extraction, for example using spacy-llm
  • De-duplicate entities and fuzzy match
  • Convert enriched entities into ftm
  • Improve extractor to work with other type of structured information, for example a person's death, from news articles
  • Add blacklist/whitelist support to define a clearer scope of which entities may be interesting for a given investigation

About

Data pipeline for the Timelark project

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages