This pipeline is part of a DSSG Berlin e.V. volunteering project that ran from 2022 to 2023 to support an aid organisation uncovering the different services sub-organisations offer to help homeless people. This public repository is for demonstration purpose and does not include all project outputs.

Pipeline

Thanks for the Contributions

Contribute to add your name to the list of contributors.

your GitHub name
SamerEssa-IT
Saif-Mandour
mediasittich
JOPloume
bsenst

Instructions to run the Pipeline

Create a GitHub Codespace for this Repository

Create and open a codespace with 8 GB RAM on the repository main branch

This will open a new tab in your browser

Go to https://github.com/codespaces

Change the codespace machine type to a machine with a memory of 8 GB RAM

For the change of the machine type to become active stop the codespace ...

https://docs.github.com/en/codespaces/developing-in-codespaces/stopping-and-starting-a-codespace#stopping-a-codespace

... and restart it

https://docs.github.com/en/codespaces/developing-in-codespaces/stopping-and-starting-a-codespace#restarting-a-codespace

Open the tab of codespace inside your browser and enter the following commands into the terminal of your codespace

Run the Pipeline

Install dependencies

pip install -r pipeline/tfidf-fasttext-pipe-codespace/pipeline_requirements.txt

Download fasttext vector model provided by deepset.ai

wget https://s3.eu-central-1.amazonaws.com/int-emb-fasttext-de-wiki/20180917/model.bin

Unzip the keyword vectorizer

unzip data-assets/vectorizer.zip -d data-assets

Create and save keywords for each document using TF-IDF, this will also download the sample anonymized dataset and the sklearn TF-IDF vectorizer

python -W ignore pipeline/tfidf-fasttext-pipe-codespace/02_extract_keywords.py

Run text search with predefined terms and cosine similarity cutoff

python pipeline/tfidf-fasttext-pipe-codespace/03_search_documents_for_topic.py

Licenses

FastText under Creative Commons Attribution-Share-Alike License 3.0, as described in P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information and supplied by https://www.deepset.ai/german-word-embeddings.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Pipeline

Thanks for the Contributions

Instructions to run the Pipeline

Create a GitHub Codespace for this Repository

Run the Pipeline

Licenses

Files

README.md

Latest commit

History

README.md

File metadata and controls

Pipeline

Thanks for the Contributions

Instructions to run the Pipeline

Create a GitHub Codespace for this Repository

Run the Pipeline

Licenses