Jump to content
Chinese-Forums
  • Sign Up

A question about the Unihan Database by Unicode


Recommended Posts

Posted

I have a question about the Unihan Database by Unicode, it regards the amount of Chinese-Japanese-Korean-(Vietnamese) [CJK (V)] Characters (ie. Hanzi, Kanji, Hanja, Chữ Nôm/ Hán Tự/ Chữ Hán) included in the encoding Unihan Database Unicode character set(s).

How many CJK Characters (ie. Hanzi-Traditional 繁體漢字 and Simplified 简体汉字, Kanji-Kyuujitai/Shinjitai 舊字體/新字体 漢字, Hanja 漢字, Chữ Nôm 字喃/ Hán Tự漢字/ Chữ Hán字漢) are included in the general Unihan Database, and how many are there in the Unihan extended database(s) too?

Does the Unihan Database include all Characters that are entries in Morohashi Dai-Kan-Wa-Jiten 大漢和辭典, Cihai Zidian 辭海字典, Kangxi Zidian 康熙字典, Hanyu-Da-Zidian 漢語大字典, etc.

Is there a Unicode Unihan expert on this forum, if so could they please answer my questions?

謝謝您

如何も有り難う御座い升

希爾頓從

Posted

I'm not sure what you mean by the extended database. Unihan is just one big collection of chinese characters. Perhaps you are confusing this with the Basic Multilingual Plane (BMP) and the Supplementary Multlingual plane (SMP)? Altogether though, the Unihan database contains information for 71226 unique code points (and obviously, some code points have more information than others).

Regarding the dictionaries you mention, I'm not sure if Unihan contains all the entries from those dictionaries, however it does list information about the dictionaries it uses.

If you visit this page, it lists dictionary indice information for all the dictionaries used to compile/cross-check the database. Following the link to a given dictionary will tell you how many dictionary indices exist in Unihan database for that dictionary. Therefore there will be at least that many characters in the database from that dictionary and possibly more (I say at least, because the information is listed as provisional and so might not be complete).

E.g There are dictionary indices for 55812 characters from the 汉语大词典 Hanyu Da Cidian, 70205 from the Kangxi etc.

Suffice to say, regardless of whether it contains all the characters in those dictionaries, Unihan is almost certainly the most comprehensive database of CJK characters.

Posted

I'm under the impression it doesn't contain all of the entries in the 康熙字典, although the missing entries are mostly variants.

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...