Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify Japanese and Korean unsimplified, canonical characters #5

Open
Transfusion opened this issue Jan 28, 2021 · 1 comment
Open

Comments

@Transfusion
Copy link
Member

Transfusion commented Jan 28, 2021

image

卫、衛、衞󠄀

Note that in Japan, https://www.kanjipedia.jp/kanji/0000403800 衞󠄀 is the 旧字 of 衛 (!!)

One cannot go hunting in the Unihan database directly since they are preexisting variants in G sources too - https://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=U%2B885E

「說文解字」has https://dict.variants.moe.edu.tw/variants/rbt/word_attribute.rbt?quote_code=QTAyNzY4 眞, and furthermore goes on to say: 僊人變形而登天也。从从目从乚。 Korea and Japan consider this variant to be canonically traditional.

image

The case of 既 and 即 is strange in Japanese: they are 既 and 卽 respectively.

image

image

@Transfusion
Copy link
Member Author

Unsimplified canonical Japanese variants are mostly available here
https://github.com/cjkvi/cjkvi-variants/blob/e4f1da248c9737a243f9930b5dc497cef5d5ae16/jp-old-style.txt#L64-L69

Korean variants of the same nature are taken from the 1800 Hanja for Everyday Use

I consider variants of this nature (along with simplified / traditional chinese / the numerals / shinjitai in joyo kanji, radicals, etc) to be orthographic variants to ensure they are grouped together
https://github.com/Transfusion/cjk-radical-search/blob/19d0d1b672d7a652bfcd6cc784dcd43ce7c669e1/etl/variants-fetcher.ts#L109

https://github.com/Transfusion/cjk-radical-search/blob/19d0d1b672d7a652bfcd6cc784dcd43ce7c669e1/etl/variants-fetcher.ts#L241-L260

TODO: investigate the 1800 korean hanja list and check whether any of them are not in the commonly used traditional chinese set, as I do not include them when computing orthographic variants, rather only in the expandVariantIslands function (TBD: discussion on what this does and the design issues faced)

https://github.com/Transfusion/cjk-radical-search/blob/19d0d1b672d7a652bfcd6cc784dcd43ce7c669e1/etl/genVariants.ts#L105-L116

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant