This script process segmentation, lemmatization, normalization and NER of XML-TEI encoded files.
Normalization and NER are still a work in progress.
- clone or download this repository
git clone https://github.com/e-ditiones/Annotator.git
cd Annotator
-
The XML-files to be processed need to be in the
in_XML
folder. -
Run the script
bash process.sh
- Results are in the
out
folder :XML
: contains XML annotated files ;TSV
: contains the annotation in TSV format.
For lemmatisation, we use Pie-extended and the "freem" model.
This repository is developed by Alexandre Bartz with the help of Simon Gabay, as part of the project e-ditiones.
Our work is licenced under a Creative Commons Attribution 4.0 International Licence.
Pie-extended is under the Mozilla Public License 2.0.
Alexandre Bartz, Simon Gabay. 2020. Lemmatization and normalization of French modern manuscripts and printed documents. Retrieved from https://github.com/e-ditiones/Annotator.