-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Same work twice #2740
Comments
@gcelano Here there are two separate editions indicated in both the metadata and in Scaife. I believe there can be partial works and other differences that would negate your statement (they would not be "exactly the same work"). grc1, grc2 etc. always indicates a different edition. There is no special indication in the URN itself that suggests a work is partial or incomplete. (I do not think there is anything split in OGL any longer). @AlisonBabeu thoughts? A glance at the word counts can show differences across editions. https://opengreekandlatin.github.io/First1KGreek/ |
I am trying to get only one edition per work, but at the same time I do not want to filter out works split into two files. Since the files are process automatically, there would be no way for me to distinguish between duplicates and split works, if this is not encoded somewhere (name of the file or maybe within |
hi @lcerrato and @gcelano (nice to hear from you, hope all is well!). As I've been going through the list of works in the Scaife viewer in the last year or so, I've been combining files when I have found a work spread across more than one TEI_XML file (that has only happened once or twice, such as with the letters of Augustine!). If a file has only part of a work rather than say the whole work, we've been indicating that typically in the header metadata and in the cts_.xml file, not in the URN. For example with Pappus of Alexandria's work Synagoge, we only have Book 1 (only first volume was digitized).
There are a few places where URNS have been used inconsistently and 1st1K-grc2 or 1st1K-lat1 have been used to represent supplementary parts of a printed edition (say a preface, intro, index, etc.) but I've been changing those as I find them so we are intellectually consistent across the whole collection. In general, if there is a 1st1K-grc1 and 1st1K-grc it means there are two editions of a work. |
Hi @AlisonBabeu, thank you for the explanation! In general, I think that the more explicit such information is the better. Maybe isolating it in an attribute, or even better in the file name (for example, |
Hi Giuseppe -- Are you doing a new lemmatization/treebanking of the Greek? P.S. It is great to see you on this thread. I know you have been active in other areas but it is always wonderful to see you! |
Hi @gregorycrane! I have tokenized and morphosyntactically annotated all texts of |
There a number of files deriving from the same printed edition, such as
tlg0057.tlg034.1st1K-grc1
andtlg0057.tlg034.1st1K-grc2
, which apparently have slightly different markup: is there a reason for that? More importantly, can we assume that every time the first part of acts:urn
("`tlg0057.tlg034") is the same, the corresponding files contain exactly the same work and not, for example, part of it (e.g., one work has been split into two parts, being too long)?The text was updated successfully, but these errors were encountered: