Language Detection Using BERT - Base, Cased Multilingual

Overview

Using the pretrained BERT Multilingual model, a language detection model was devised. The model was fine tuned using the Wikipedia 40 Billion Multilingual Dataset which contains Wikipedia entries from 41 different languages. The model was trained on 16 of the languages. You may find the dataset here.

Usage

Prerequisites

TensorFlow: >> pip install tensorflow
TensorFlow Hub: >> pip install tensorflow-hub
TensorFlow Datasets: >> pip install tensorflow-datasets
TensorFlow Text: >> pip install tensorflow-text --no-dependencies

Please note that we are making use of the --no-dependencies flag because of an error that TensorFlow Text throws pursuant to this following GitHub Issue. If you have already installed TensorFlow text, it is recommended you uninstall and reinstall it

Please also note that after installing TensorFlow Text with this specific flag, you will need to import the file to register a few ops, as highlighted here

Sci-kit Learn: >> pip install sklearn

If you want to perform inference, i.e. simply find what language a given document is written in

Download the complete repository
Under the same file hierarchy as the lang_finder.py, download and save the trained model from this link
Import the file lang_finder.py and call the function lang_finder.find_language([str]) which accepts a list of strings as input, and returns list of what language they were written in

NOTE: If you changed the set of languages being used in modelling.py for custom training, please update the list of languages specified in the file lang_finder.py as well for it to run correctly

If you want to train a new model directly within Google Colaboratory:

Link To Google Colab

If you want to train a new model locally

Download the whole repository and run the file modelling.py with the command

>> python modelling.py

If you want to train it on more, or different languages

You can find the list of languages available under the Wiki40B dataset in this link. Simply add the languages to the list list_languages in the file modelling.py, update the list in lang_finder.py as well and run

>> python modelling.py

Everything else is configured to work automatically, just make sure that lang_finder.py has the same languages as mentioned in modelling.py in case you make any changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Language Detection Using BERT - Base, Cased Multilingual

Overview

Usage

Prerequisites

If you want to perform inference, i.e. simply find what language a given document is written in

If you want to train a new model directly within Google Colaboratory:

If you want to train a new model locally

If you want to train it on more, or different languages

Files

README.md

Latest commit

History

README.md

File metadata and controls

Language Detection Using BERT - Base, Cased Multilingual

Overview

Usage

Prerequisites

If you want to perform inference, i.e. simply find what language a given document is written in

If you want to train a new model directly within Google Colaboratory:

If you want to train a new model locally

If you want to train it on more, or different languages