Skip to content

This pipeline is part of a DSSG Berlin e.V. volunteering project that ran from 2022 to 2023 to support an aid organisation uncovering the different services sub-organisations offer to help homeless people. This public repository is for demonstration purpose and does not include all project outputs.

License

Notifications You must be signed in to change notification settings

dssg-berlin/text-search-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 

Repository files navigation

This pipeline is part of a DSSG Berlin e.V. volunteering project that ran from 2022 to 2023 to support an aid organisation uncovering the different services sub-organisations offer to help homeless people. This public repository is for demonstration purpose and does not include all project outputs.

Pipeline

image

Thanks for the Contributions

Contribute to add your name to the list of contributors.

Instructions to run the Pipeline

Create a GitHub Codespace for this Repository

Create and open a codespace with 8 GB RAM on the repository main branch

image

This will open a new tab in your browser

image

Go to https://github.com/codespaces

image

Change the codespace machine type to a machine with a memory of 8 GB RAM

image

For the change of the machine type to become active stop the codespace ...

https://docs.github.com/en/codespaces/developing-in-codespaces/stopping-and-starting-a-codespace#stopping-a-codespace

... and restart it

https://docs.github.com/en/codespaces/developing-in-codespaces/stopping-and-starting-a-codespace#restarting-a-codespace

Open the tab of codespace inside your browser and enter the following commands into the terminal of your codespace

image

Run the Pipeline

Install dependencies

pip install -r pipeline/tfidf-fasttext-pipe-codespace/pipeline_requirements.txt

Download fasttext vector model provided by deepset.ai

wget https://s3.eu-central-1.amazonaws.com/int-emb-fasttext-de-wiki/20180917/model.bin

Unzip the keyword vectorizer

unzip data-assets/vectorizer.zip -d data-assets

Create and save keywords for each document using TF-IDF, this will also download the sample anonymized dataset and the sklearn TF-IDF vectorizer

python -W ignore pipeline/tfidf-fasttext-pipe-codespace/02_extract_keywords.py

Run text search with predefined terms and cosine similarity cutoff

python pipeline/tfidf-fasttext-pipe-codespace/03_search_documents_for_topic.py

Licenses

FastText under Creative Commons Attribution-Share-Alike License 3.0, as described in P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information and supplied by https://www.deepset.ai/german-word-embeddings.

About

This pipeline is part of a DSSG Berlin e.V. volunteering project that ran from 2022 to 2023 to support an aid organisation uncovering the different services sub-organisations offer to help homeless people. This public repository is for demonstration purpose and does not include all project outputs.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages