Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New validator rule: leaf-det-clf #1059

Open
nschneid opened this issue Oct 8, 2024 · 11 comments
Open

New validator rule: leaf-det-clf #1059

nschneid opened this issue Oct 8, 2024 · 11 comments

Comments

@nschneid
Copy link
Contributor

nschneid commented Oct 8, 2024

I notice that the leaf-det-clf rule introduced in UniversalDependencies/tools@1e4debd and then revised in UniversalDependencies/tools@759c5ae has invalidated quite a lot (a majority?) of treebanks.

Is further revision necessary? For example, EWT is still experiencing some errors that look like they should be valid:

  • det + nmod e.g. "at least some reports" (det(reports, some), nmod(some, least)). "at least" is admittedly ADV-like, so another option is to make it ExtPos=ADV and advmod.
  • "such"/det licensing an advcl, as in these results. The guidelines on sufficiency and excess for "so" and similar say the advcl should attach to the adjective or adverb, not the noun in a case like sufficient flour. In such a high price that nobody could afford it, I suppose "such" should have an advcl dependent?
@mr-martian
Copy link
Contributor

The errors in Hebrew are due to things like

# x- so the RTL text doesn't make this unreadable
32	x-ה	x-ה	DET	art	PronType=Art	33	det	_	Gloss=the|Ref=GEN_19.8
33	x-אֲנָשִׁ֤ים	x-אישׁ	NOUN	subs	Gender=Masc|Number=Plur	38	obl	_	Gloss=man|Ref=GEN_19.8
34-35	x-הָאֵל֙	x-_	_	_	_	_	_	_	_
34	x-הָ	x-ה	DET	art	PronType=Art	35	det	_	Gloss=the|Ref=GEN_19.8
35	x-אֵל֙	x-אל	PRON	prde	Number=Plur|PronType=Dem	33	det	_	Gloss=these|Ref=GEN_19.8

where demonstrative pronouns have their own determiners. (I'm open to other means of annotating this.)

@amir-zeldes
Copy link
Contributor

@mr-martian this is also the analysis used in the modern Hebrew TBs, so I would be inclined to accept and keep it (it's also parallel to how adjectival modification works in Hebrew)

@mr-martian
Copy link
Contributor

If I were doing Hebrew from scratch, the one alternative I'd consider is treating ה as an inflectional prefix rather than a syntactic word.

@amir-zeldes
Copy link
Contributor

I would vote against that TBH, it's not how other languages with repeating articles do it either (e.g. Greek) and it complicates lemmatization, type counts, and a bunch of other things.

@colinbatchelor
Copy link
Contributor

I have one remaining error:
[(in gd_arcosg-ud-train.conllu) Line 55940 Sent p01_033h Node 79]: [L3 Syntax leaf-det-clf] 'det' not expected to have children (79:a:det --> 81:h-uile:compound)

The offending tree has someone emphasising 'every' by saying a h-uile h-uile. Is there maybe a better way I should be doing this or could it be an exception?

@nschneid
Copy link
Contributor Author

Repetition for emphasis: would flat be a good option instead of compound? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).

The validator currently allows fixed, but not flat, it seems.

@LeonieWeissweiler
Copy link
Contributor

LeonieWeissweiler commented Oct 10, 2024

This invalidated both HDT and GSD for German, mostly because of vor allem (mainly) and unter anderem (among others). For both, the first word is an ADP' and the second is a DET' that depends on it with the `case' relation.

How should we handle this better?

@nschneid
Copy link
Contributor Author

unter anderem is sometimes treated as a fixed expression. Here is a case triggering the error:

image

I assume this means "among other teachers"—is there a reason not to analyze it as "among [other teachers]", with unter attaching to Lehrer?

@amir-zeldes
Copy link
Contributor

No, for the German case it's not "among other teachers", notice "other" is dative but "teacher" is not - it's "among others, teachers". I think the mistake is the deprel det - this is not a determiner but an oblique modifier, just like English "among others".

@FedeIure
Copy link

Repetition for emphasis: would flat be a good option instead of compound? Cf. https://universaldependencies.org/u/dep/flat.html#iconic-sequences (though I can't speak to how languages are dealing with reduplication in general).

The validator currently allows fixed, but not flat, it seems.

What about flat:redup to mark repetition for emphasis?

Here two examples in one sentence from Roman tragedies in UD Latin-CIRCSE:

flat_redup_Latin_CIRCSE

@sylvainkahane
Copy link
Contributor

For spoken data, we need three relations to be added to the validator:

  • discourse, which is very common between two determiners in false starts: "a, uh, a gap", "my, uh, our friend"
  • parataxis for cases such as "a, I don't how to call that, a kiosk, …": here we have a reparandum link between the two "a"s and we would like to attach the parenthesis to the first "a". More exactly we use parataxis:parenth in our spoken French treebanks.
  • dep for false starts such as "the last, the last day": here "the last" forms a phrase the head of which is missing and we decided to have dep(the, last). I am not against another solution, as long as "the last" is still a phrase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants