Jump to content
Chinese-Forums
  • Sign Up

Developer question - detecting hanzi in unicode string


westmeadboy

Recommended Posts

@imron - thanks for that.

I'm still rather confused about the whole thing. Is it enough to take into account these ranges:

Unified CJK Ideographs

CJK Ideographs Ext. A

CJK Ideographs Ext. B

I just want to detect traditional and simplified chars that appear in real life (in chinese).

Link to comment
Share on other sites

I would say yes. In fact for most characters, just the unified CJK Ideographs would cover it (extension A and B are mostly more obscure, rarely used characters - but definitely still wanted for absolute completeness). For things like the radicals and strokes, similar characters for these already exist in the main CJK Ideographs range and from my experiments with a limited number of radicals like 氵, the actual unicode output by IMEs is for the character in the main range, and not the separate radical ranges.

Also, you may want to also click the "symbols and punctuation" link, and include CJK punctuation.

Link to comment
Share on other sites

I should probably add that I want to do this as part of the dictionary app (I mentioned in another thread) at the point where the user enters some kind of search term. I want the app to automatically detect whether the user has entered hanzi, pinyin or english.

Maybe eventually I'll allow hanzi and pinyin to be mixed together - but this is probably for a later version...

Anyway, so I may as well check more ranges rather than less, because speed is not important in this part of the execution.

Link to comment
Share on other sites

Fair enough. For reference, the Unified CJK Ideographs cover almost 21,000 characters, and will cover almost anything your users will input. CJK A covers a further 6,500. Both of these ranges are in the BMP. CJK B contains a further 43,000 characters and is the SIP (so you'll encounter problems if you're using buggy UTF-16 code that doesn't realise that codepoints in the SIP are represented by 4-bytes instead of the usual 2).

Link to comment
Share on other sites

I just worked on a project that did this yesterday. :clap

Python:

# http://en.wikipedia.org/wiki/CJK_Unified_Ideographs
cjkUnifiedIdeographs = u'u4E00-u9FFF'
cjkCompatibilityIdeographs = u'uF900-uFAFF'
cjkUnifiedIdeographsExtA = u'u3400-u4DBF'
cjkUnifiedIdeographsExtB = u'u20000-2A6DF'  #not sure if correct syntax, but not using anyway
cjkEnclosedLettersAndMonths = u'u3200-u32FF'


# Non-CJK characters used in simplified/traditional field of CC-CEDICT
# I added these in by trial and error
# Some of these are covered in code range "Halfwidth and Fullwidth Forms". But this makes a stricter filter
cjkMiddleDot = u'u30FB'
cjkFullwidthComma = u'uFF0C'
cjkLingZero = u'u3007'
cjkFullwidthLatin = u'uFF21-uFF3A'

cjkRegexp = u'[%s%s%s%s%s%s%s]' % (cjkMiddleDot, cjkFullwidthComma, cjkLingZero, cjkUnifiedIdeographsExtA, cjkUnifiedIdeographs, cjkCompatibilityIdeographs, cjkFullwidthLatin)

Every simplified/traditional entry in CC-CEDICT, except for a few rare Unicode variants, is covered by

  • cjkMiddleDot
  • cjkFullwidthComma
  • cjkLingZero
  • cjkUnifiedIdeographs
  • cjkCompatibilityIdeographs (for just a few variants)
  • cjkUnifiedIdeographsExtA (for just a few variants)
  • cjkFullwidthLatin

Extension B and Enclosed Letters and Months (these are encircled numbers and characters) are not used at all.

Perl also has a module that aliases code ranges; for example:

use charnames ':full';    #use friendly 'InCJKUnifiedIdeographs' for Chinese pattern match

$cjkMiddleDot = 'x{30FB}';
$cjkFullwidthComma = 'x{FF0C}';
$cjkLingZero = 'x{3007}';
$cjkFullwidthLatin = 'x{FF21}-x{FF3A}';

$CJK_regexp = '[' . join('',
   'p{InCJKUnifiedIdeographs}',
   'p{InCJKUnifiedIdeographsExtensionA}',
   'p{InCJKCompatibilityIdeographs}',
   $cjkMiddleDot,
   $cjkFullwidthComma,
   $cjkLingZero,
   $cjkFullwidthLatin
   ) . ']';

# result: [p{InCJKUnifiedIdeographs}p{InCJKUnifiedIdeographsExtensionA}p{InCJKCompatibilityIdeographs}x{30FB}x{FF0C}x{3007}x{FF21}-x{FF3A}]

Mix and max the ranges used, for example if you want to include Latin fullwidth letters or punctuation.

Link to comment
Share on other sites

Don't suppose you remember off-hand which ones were from the compatibility ideographs?

I was under the impression that the compatibility ideographs were there to help with conversions to/from older standards but that the main and extended CJK ideograph range contained an exact same version of these characters just with a different code-point.

I wonder if it's a hang-over from when CEDICT wasn't stored as unicode and was then converted? If so, it might be worth finding the corresponding ideograph in the main ranges and submitting a patch to CEDICT. The main reason being that IMEs don't seem to output the codepoints for the duplicated compatibility characters (preferring instead to use the codepoint from the main CJK ideographs) meaning that searches for that character that come from user input would fail.

Link to comment
Share on other sites

This is from the most recent CC-CEDICT (20009-09-08)

CJK Compatibility Ideographs

蘭 蘭 [lan2] /Unicode compatibility variant for 蘭/orchid/

盧 盧 [lu2] /Korean variant of 盧|卢/

老 老 [lao3] /unicode compatibility variant of 老/

不 不 [bu4] /variant of 不/(negative prefix)/not/no/

練 練 [lian4] /variant of 練|练, to practice/to train/to drill/to perfect (one's skill)/exercise/

識 識 [shi2] /Unicode compatibility variant of 識|识/

兀 兀 [wu4] /duplicate of Big Five A461/

Almost all of these were in CEDICT at least at the time it was imported into CC-CEDICT, but so were the entries for the corresponding canonical character. So it wasn't a conversion error, rather just extra entries with marginal usefulness. Even if they can't be entered from an IME, they can still be copy-pasted, so they're not completely inaccessible, if someone ever encountered the character and needed to look it up.

CJK Unified Ideographs Extension A

㑇 㑇 [zhou4] /beautiful/

㑳 㑳 [zhou4] /beautiful/

㗂 㗂{u+35c2} [sheng3] /variant of 省/tight-lipped/to examine/to watch/to scour (esp. Cantonese)/

㝵 㝵 [de2] /to obtain/archaic variant of 得|得[de2]/component in 礙|碍[ai4] and 鍀|锝[de2]/

㥁 㥁{u+3941} [de2] /variant of 德, ethics/

㨗 㨗{u+3a17} [jie2] /variant of 捷/quick/nimble/

㬎 㬎 [xian3] /old variant of 顯|显[xian3]/visible/apparent/

㶸 㶸{u+3db8} [xie2] /(precise meaning unknown, relates to iron)/variant of 劦 or of 協|协/

㺵 㺵{u+3eb5} [jiu2] /black jade/variant of 玖/

䯝 䯝{u+4bdd} [sui3] /variant of 髓/marrow/essence/quintessence/pith (soft interior of plant stem)/

喎僻不遂 㖞僻不遂 [wai1 pi4 bu4 sui2] /facial paralysis and hemiplegia after apoplexy (idiom)/

奕訢 奕䜣 [Yi4 xin1] /Grand Prince Yixin (1833-1898), sixth son of Emperor Daoguang, prominent politician, diplomat and modernizer in late Qing/

恭親王奕訢 恭亲王奕䜣 [Gong1 qin1 wang2 Yi4 xin1] /Grand Prince Yixin (1833-1898), sixth son of Emperor Daoguang, prominent politician, diplomat and modernizer in late Qing/

綵 䌽{u+433d} [cai3] /variant of 彩/(bright) color/variety/multicolored silk/motley/variegated/

訢 䜣 [xin1] /pleased/delighted/happy/variant of 欣/

These are more recent entries, and appear to be from user submissions. I would guess they are from classical texts or fanciful writing.

Link to comment
Share on other sites

How about...

- parse the CEDICT dictionary and build list of Chinese characters contained within. (see my post http://www.chinese-forums.com/index.php?/topic/22160-top-characterfile-parser-utility-make-your-own-char-lists)

- store sorted instance of list in program

- search list for character, return true/false.

I would actually prefer this over any other implementation. Chinese characters are those characters used in written Chinese, as opposed to just 'Asian' characters.

Yeah you could do a range comparison, but it doesn't look to me that Chinese occupy one large contiguous block. (there's gaps in the codes)

I've attached a list I've just generated. That came from CEDICT and the large sample sentence file referenced in my post above. Doing both gets me another 2 characters over CEDICT alone. Total of 12,402 characters. Note this is with the default Hanzi regex filter on, there seems to be a few odd ones that come before 一, such as 䜣, 㑇, 㑳. Very odd, depends how perfect you want it to be. How many Chinese people know more than 10,000 characters? Also Japanese is contained in the CEDICT dict.

12402_Top_Chars.zip

Link to comment
Share on other sites

but it doesn't look to me that Chinese occupy one large contiguous block.
They don't. As mentioned in my first post:
All CJK characters fall within a number of contiguous ranges
i.e there are several different contiguous ranges and you need to check them all if you want to determine if the character is Chinese. These ranges cover all ideographs common to Chinese, Japanese and Korean (CJK).

The reason checking these ranges is preferable to the method you propose is:

A) it's significantly faster and uses significantly less memory to check a few ranges than it does to search against a list of known characters (important characteristics for mobile application).

B) it doesn't limit you to an incomplete set of characters, but rather allows for any Chinese character that the user can enter, even rare/uncommon ones that aren't in dictionaries like CEDICT. You could choose something more complete, such as the Unihan database, but then the problems mentioned in A) become more pronounced.

Link to comment
Share on other sites

It's also worth pointing out that these ranges aren't just "Asian Characters". The Unicode Standard very clearly lays out ranges for each language. The so-called unified CJK ideographs, are those that are common to CJK languages, and so any character in this range can be considered Chinese. Characters specific to a given language (hiragana, katakana for Japanese, Hangul for Korean, and even Bobomofo for Chinese) each have their own distinct ranges.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...