Alex lemmatizer classifier 2 #1422

AngledLuffa · 2024-09-12T23:48:00Z

Add a word classifier to cover ambiguous lemmas such as 's

… token in English or other lemmas with ambiguous resolutions Includes data processing class for extracting sentences of interest Has evaluation functions for single example and multiexample Adds utility functions for loading dataset from file and handling unknown tokens during embedding lookup Can use charlm models for training Includes a baseline which uses a transformer to compare against the LSTM model Uses AutoTokenizer and AutoModel to load the transformer - can provide a specific model name with the --bert_model flag Includes a feature to drop certain lemmas, or rather, only accept lemmas if they match a regex. This will be particularly useful for a language like Farsi, where the training data only has 6 and 1 examples of the 3rd and 4th most common expansions Automatically extract the label information from the dataset. Save the label_decoder in the regular model and the transformer baseline model. Word vectors are trainable in the LSTM model Word vectors used are the ones shipped with Stanza for whichever language, not specifically Glove. This allows for using WV for whichever language we are using Model selection during training loop done using eval set performance - both baseline and LSTM model Training/testing done via batch processing for speed Include UPOS tags in data processing/loading for files. We then use UPOS embeddings for the words in the LSTM model as an additional signal for the query word Implement multihead attention option for LSTM model Add positional encodings to MultiHeadAttention layer of the LSTM model. The common train() method from the two trainer classes is treated as one parent class. Should make it easier to update pieces and keep them in sync Keep the dataset in a single object rather than a bunch of lists. Makes it easier to shuffle, keeps everything in one place Don't save the transformer, charlm, or original word vector file in the model files. Word vectors are finetuned and the deltas are saved. import full path

… charlms if they exist run_lemma_classifier.py now automatically tries to pick a save name and training filename appropriate for the dataset being trained. Still need to calculate the lemmas to predict and use a language-appropriate wordvec file before we can do other languages, though Add the ability to use run_lemma_classifier.py in --score_dev mode Add --score_test to the lemma_classifier as well Connects the transformer baseline to the run_lemma_classifier script Reports the dev & test scores when running in TRAIN mode

…taset fa_perdt, ja_gsd, AR, HI as current options for the lemma classifier

This requires using a target regex instead of target word to make it simpler to match multiple words at once in the data preparation code

Add a sample 9/2/2 dataset and test that it gets read in a way we might like

…mmaClassifier model Call evaluate_model just in case, although the expectation is that the F1 isn't going to be great

… existing model

… Will be useful for integrating with the Pipeline

AngledLuffa force-pushed the alex_lemmatizer_classifier_2 branch 30 times, most recently from 21cb859 to f8455f4 Compare September 16, 2024 02:27

AngledLuffa force-pushed the alex_lemmatizer_classifier_2 branch 20 times, most recently from 74e827f to b7f63a4 Compare September 19, 2024 21:35

SecroLoL and others added 10 commits September 19, 2024 14:46

Add a script to convert the various datasets to a lemma classifier da…

ccdea16

…taset fa_perdt, ja_gsd, AR, HI as current options for the lemma classifier

Add Greek as an option to the lemma_classifier data preparation

356f76b

This requires using a target regex instead of target word to make it simpler to match multiple words at once in the data preparation code

Add a test of the lemma_classifier data preparation code

4412d5a

Add a sample 9/2/2 dataset and test that it gets read in a way we might like

Add a test which iterates the LSTM and transformer versions of the Le…

c01a009

…mmaClassifier model Call evaluate_model just in case, although the expectation is that the F1 isn't going to be great

Add utility to train multiple file variants at the same time

8573ffb

Add a flag to the classifier training scripts to force overwriting an…

db429ce

… existing model

Keep the target words (words whcih we trained to recognize) in a set.…

05f8657

… Will be useful for integrating with the Pipeline

Add a short script to attach a LemmaClassifier to a Lemmatizer trainer

b7f63a4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Alex lemmatizer classifier 2 #1422

Alex lemmatizer classifier 2 #1422

AngledLuffa commented Sep 12, 2024

Alex lemmatizer classifier 2 #1422

Are you sure you want to change the base?

Alex lemmatizer classifier 2 #1422

Conversation

AngledLuffa commented Sep 12, 2024