Skip to content

Medieval multilingual sentence alignment and collation using sentence embeddings

License

Notifications You must be signed in to change notification settings

matgille/mutilingual_collator

 
 

Repository files navigation

Mutilingual collator

This repo contains a set of scripts to align and collate a multilingual medieval corpus. Its designers are Matthias Gille Levenson, Lucence Ing and Jean-Baptiste Camps.

It is based on a fork of the automatic multilingual sentence aligner Bertalign.

The scripts relies for now on a prior phase of text segmentation at syntagm level using regular expressions to match grammatical syntagms and produce a more precise alignment.

Citation

Lei Liu & Min Zhu. 2022. Bertalign: Improved word embedding-based sentence alignment for Chinese–English parallel corpora of literary texts, Digital Scholarship in the Humanities. https://doi.org/10.1093/llc/fqac089.

Licence

This fork is released under the GNU General Public License v3.0

About

Medieval multilingual sentence alignment and collation using sentence embeddings

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 85.6%
  • XSLT 14.4%