Skip to content

This repository provides a Location Tagger, for identifying locations, using a BERT-CRF Tagger. It creates a Location chunk using IOB tags when it finds one or more location words.

Notifications You must be signed in to change notification settings

dair-iitd/LocationTagger

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LocationTagger

This repository provides a Location Tagger, for identifying locations, using a BERT-CRF Tagger. It creates a Location chunk using IOB tags when it finds one or more location words.

This code is based on the BiLSTM-CCM tagger repository.


Requirements

  • Python 3.4+
  • Linux-based system

Installation

Clone

Clone this repository to your local machine.

git clone "https://github.com/dair-iitd/LocationTagger.git"

Environment Setup

Please follow the instructions at the following link to set up anaconda. https://www.digitalocean.com/community/tutorials/how-to-install-anaconda-on-ubuntu-18-04-quickstart

Set up the conda environment

$ conda env create -f environment.yml

Install the required python packages

$ conda activate location-tagger
$ pip install -r requirements.txt

Set up

Stanford Core-NLP-Server

Please install the stanford core-nlp server library using the following link: http://nlp.stanford.edu/software/stanford-corenlp-latest.zip The server needs to run (on port 9000) when generating the features or outputs.

Use the following command to run the server.

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer -preload tokenize,ssplit,pos,lemma,ner,parse,depparse -status_port 9000 -port 9000 -timeout 15000

To shut down the server use the following command:

wget "localhost:9000/shutdown?key=`cat /tmp/corenlp.shutdown`" -O -

BERT

Download https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-cased-vocab.txt and save as "data/utils/bert/vocab.txt".

Download https://s3.amazonaws.com/models.huggingface.co/bert/bert-base-multilingual-cased.tar.gz and save as "data/utils/bert/bert-base-multilingual-cased.tar.gz".

Description

The repository can be used to tag locations in a tourism forum post (example shown below). The repository provides scripts for training/testing the model, changing model configurations, evalaute precision/recall of tagged locations given the gold data, and processing/filtering generic tagged locations like states, countries, acronyms, etc.

A sample input structure is shown below:

{
	"question": "We will spend a weekend in October in NY as part of a longer trip to the US. We have both been in NY before and have a few places we wanna visit and re-visit but we are still searching for a nice seafood place for dinner. We have had a look at Luke's Lobster and Pearl Oyster Bar to name a few. . . Any comments on these two? Other recommendations for great seafood?",
	"url": "https://www.tripadvisor.in/ShowTopic-g28953-i4-k4852356-Seafood_place_in_NY-New_York.html"
}

The output file is shown below:

{
	"question": "We will spend a weekend in October in NY as part of a longer trip to the US. We have both been in NY before and have a few places we wanna visit and re-visit but we are still searching for a nice seafood place for dinner. We have had a look at Luke's Lobster and Pearl Oyster Bar to name a few. . . Any comments on these two? Other recommendations for great seafood?",
	"url": "https://www.tripadvisor.in/ShowTopic-g28953-i4-k4852356-Seafood_place_in_NY-New_York.html",
	"question_pos": "We_0 will_1 spend_2 a_3 weekend_4 in_5 October_6 in_7 NY_8 as_9 part_10 of_11 a_12 longer_13 trip_14 to_15 the_16 US_17 ._18 We_19 have_20 both_21 been_22 in_23 NY_24 before_25 and_26 have_27 a_28 few_29 places_30 we_31 wan_32 na_33 visit_34 and_35 re-visit_36 but_37 we_38 are_39 still_40 searching_41 for_42 a_43 nice_44 seafood_45 place_46 for_47 dinner_48 ._49 We_50 have_51 had_52 a_53 look_54 at_55 Luke_56 's_57 Lobster_58 and_59 Pearl_60 Oyster_61 Bar_62 to_63 name_64 a_65 few_66 ..._67 Any_68 comments_69 on_70 these_71 two_72 ?_73 Other_74 recommendations_75 for_76 great_77 seafood_78 ?_79",
	"tagged_location": [
	    "Luke_56 's_57 Lobster_58",
	    "Pearl_60 Oyster_61 Bar_62"
    ]
}

Generating Features

The features generator has two possible options for different file formats.

For the annotation file format, use the following command,

python -m utils.generateFeatureFileFromAnnotations --input_file_path "data/inputs/$1" --features_file_path "data/features/$2$

For the typical file (as shown in description), use the following command,

python -m utils.generateFeatureFileForTourqueData --data_file_path "data/inputs/$1" --features_file_path "data/features/$2$

Training

$ python -m src.main --train --num_epochs 5 --data_file_path "data/features/train_annotations.features.txt" --serialization_dir "data/models" --pretrained_model_path "data/models/best.weights" --config_file "data/configs/config.jsonnet" --devices 0

The config.jsonnet file can be changed as per the user requirements. pretrained_model_path need not be specified.

Testing

$ python -m src.main --test --data_file_path "data/features/validation_questions.features.txt" --predictions_file_path "data/predictions/validation_questions.predictions.txt" --"pretrained_model_path data/models/best.weights" --config_file "data/configs/config.jsonnet" --devices 0   

Evaluate Predictions

All feature files contain labels. For annotation files, the labels can be "O", "B-GPE", "I-GPE". However, for other files, only "O" labels are present. Use the following command to find precision/recall of tagged locations.

python -m analysis.evaluate --prediction_file_path "data/predictions/test_annotations.predictions.txt" --gold_file_path "data/features/test_annotations.features.txt"

Generate Outputs

The predictions file contains only the tags like "B-GPE", "I-GPE", "O". The following command can be used to parse the location tokens by generating a conll tree.

python -m utils.generateSanitizedPredictions --input_file_path "data/features/train_questions.json" --features_file_path "data/features/train_questions.features.txt" --predictions_file_path "data/features/train_questions.predictions.txt" --output_file_path "data/features/train_annotations.features.txt" --process

The process argument is optional. It used to filter out countries, states, acronyms, etc. Note that repetitions may occur in tagged locations (different token positions).

License

License

About

This repository provides a Location Tagger, for identifying locations, using a BERT-CRF Tagger. It creates a Location chunk using IOB tags when it finds one or more location words.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published