GitHub - Caroline-Vandyck/InformationScienceProjectCarolineVandyck: This repository contains the project for the course Information Science, which is part of the MA DTA of the University of Antwerp.

Caroline-Vandyck / InformationScienceProjectCarolineVandyck Public

Notifications You must be signed in to change notification settings
Fork 0
Star 0

This repository contains the project for the course Information Science, which is part of the MA DTA of the University of Antwerp.

0 stars 0 forks Branches Tags Activity

Star

Notifications

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.txt		README.txt
VandyckCarolineInformationScience.ipynb		VandyckCarolineInformationScience.ipynb

Repository files navigation

This is the repository for the exam project of the course Information Science, part of the MA Digital Text Analysis of the University of Antwerp.

This repository contains the code (VandyckCarolineInformationScience.ipynb) for data reconciliation between a sample of the Anet SQLite export file and the CERL Thesaurus SRU/CQL service. For the code to work, the notebook and the sample need to be located in the same directory.

The purpose of the code is to retrieve the tags/identifiers (one from Anet, one from the CERL Thesaurus) that correspond for the same person. The program will query CERL on the basis of the surname obtained from Anet and will select the first 50 records. For those 50 (or less) records, it will be checked whether the first name and year of birth match. In the end, every record obtained from CERL will be returned with a statement on which metadata it matches Anet. It only uses Python's standard library.

At the end of the notebook, some observations and possible improvements are listed. However, most of these 'possible improvements' boil down to personal choices/preferences (relating to what you find important and how you want to standardize names). There are always many options to choose from and choices to make. I attempted to make the optimal ones and explained why, yet opinions can differ.