Skip to content

Latest commit

 

History

History
73 lines (40 loc) · 3.16 KB

README.md

File metadata and controls

73 lines (40 loc) · 3.16 KB

This pipeline is part of a DSSG Berlin e.V. volunteering project that ran from 2022 to 2023 to support an aid organisation uncovering the different services sub-organisations offer to help homeless people. This public repository is for demonstration purpose and does not include all project outputs.

Pipeline

image

Thanks for the Contributions

Contribute to add your name to the list of contributors.

Instructions to run the Pipeline

Create a GitHub Codespace for this Repository

Create and open a codespace with 8 GB RAM on the repository main branch

image

This will open a new tab in your browser

image

Go to https://github.com/codespaces

image

Change the codespace machine type to a machine with a memory of 8 GB RAM

image

For the change of the machine type to become active stop the codespace ...

https://docs.github.com/en/codespaces/developing-in-codespaces/stopping-and-starting-a-codespace#stopping-a-codespace

... and restart it

https://docs.github.com/en/codespaces/developing-in-codespaces/stopping-and-starting-a-codespace#restarting-a-codespace

Open the tab of codespace inside your browser and enter the following commands into the terminal of your codespace

image

Run the Pipeline

Install dependencies

pip install -r pipeline/tfidf-fasttext-pipe-codespace/pipeline_requirements.txt

Download fasttext vector model provided by deepset.ai

wget https://s3.eu-central-1.amazonaws.com/int-emb-fasttext-de-wiki/20180917/model.bin

Unzip the keyword vectorizer

unzip data-assets/vectorizer.zip -d data-assets

Create and save keywords for each document using TF-IDF, this will also download the sample anonymized dataset and the sklearn TF-IDF vectorizer

python -W ignore pipeline/tfidf-fasttext-pipe-codespace/02_extract_keywords.py

Run text search with predefined terms and cosine similarity cutoff

python pipeline/tfidf-fasttext-pipe-codespace/03_search_documents_for_topic.py

Licenses

FastText under Creative Commons Attribution-Share-Alike License 3.0, as described in P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information and supplied by https://www.deepset.ai/german-word-embeddings.