gsteemso Posted September 24, 2009 at 05:24 PM Report Posted September 24, 2009 at 05:24 PM Hi all! I am a computing hobbyist who has gotten sidetracked, while designing my own computer from the register level up, into questions of what characters the finished machine should be able to process and display. For reasons which are not relevant to this discussion, I am limited to a repertoire of 4096 characters (with an additional 4096 accents, modifiers, and combinations thereof; there is also a facility for shrunken characters, mainly driven by their prevalence in Japanese kana). I was annoyed to discover that is not enough to do anything useful in Chinese, even if I discard all the other writing systems I wanted to include (basically anything commonly used in the Seattle-Vancouver, B.C., area: the many variants of Latin, Japanese [kana only], Cyrillic, Greek, Canadian syllabics, Tagalog, various Indic scripts… there are a lot of sizeable minorities around here). While researching methods of writing Chinese and related languages, I came across this site and read a variety of interesting discussions. The consensus here seems to be that Chinese, at least of the Mandarin variety, is ill-suited to being written phonetically, despite the fact that people have been trying to advocate doing exactly that for centuries now. In any case, my question: Pinyin et al being irrelevant to the topic at hand as I am already including the Latin script in my homebrew computer design, would it be any real use to a user of Chinese if I include bopomofo capability? If bopomofo is of too-limited use to bother with, are there any other ways of representing Chinese writing that use no more than a few hundred code points for the complete repertoire? I have seen fleeting references to something called radical/stroke indexes, but I don't know how general the idea is, or even what it's really about. Eagerly awaiting your collective wisdom, gsteemso PS: You may ask, "Why include Chinese in a computer only I will ever use, when I don't know how to speak or write it?" Simply put, I have an impractical dream of someday learning that and many other languages. Being Canadian, I have a rudimentary grasp of French already, and they say learning languages gets easier as you learn more of them. Quote
renzhe Posted September 24, 2009 at 06:09 PM Report Posted September 24, 2009 at 06:09 PM I think that nowadays, bopomofo is useful, but not nearly as useful as pinyin, which you have already covered. The only time you'd really want to use bopomofo instead of pinyin is if the person reading can read bopomofo, but cannot read pinyin. This might or might not be the case around your area, I don't know. There's nothing wrong with bopomofo, but I don't know if it is used for anything other than teaching children reading in Taiwan nowadays. I imagine that the benefit of having bopomofo on top of pinyin would be minimal. Why not encode Chinese characters using several characters, the way utf-8 does? Quote
jbradfor Posted September 24, 2009 at 08:15 PM Report Posted September 24, 2009 at 08:15 PM would it be any real use to a user of Chinese if I include bopomofo capability? No, as bopomofo and pinyin are both systems of transliteration. The only difference between the two is that pinyin transliterates into the roman alphabet, and bopomofo transliterates into a made-up alphabet. That is, the information content in bopomofo is exactly the same as in pinyin. It provides no additional information about how the character is written. It may look like Chinese characters, but it's not. If you look at http://en.wikipedia.org/wiki/Bopomofo#Bopomofo_vs._tongyong_pinyin_.26_Hanyu_pinyin for example, you'll see the 1-to-1 correspondence between bopomofo and pinyin. If bopomofo is of too-limited use to bother with, are there any other ways of representing Chinese writing that use no more than a few hundred code points for the complete repertoire? Maybe. I believe there are ways of entering Chinese characters using a standard keyboard in which a set sequence of keys correspond to exactly one character. [unlike pinyin-based input, in which a given pinyin typically corresponds to multiple characters.] You could potentially display Chinese characters as their input sequence. It would be unique, and would require a fair amount of learning to be able to read it. Otherwise, probably not. Characters are typically made up of component parts, so one could conceive of a system in which for a character you just display each part separately, and require the reader to mentally put the character together. This would again take some adjusting, but probably not as much as the above. Alas, there are probably too many "building pieces" to fit within the allotted code points. Quote
jbradfor Posted September 24, 2009 at 08:29 PM Report Posted September 24, 2009 at 08:29 PM @renzhe: Why not encode Chinese characters using several characters, the way utf-8 does? I assume he is discussing displaying the characters on the screen, not encoding a document on disk. i.e. it is not a full bit-mapped screen, but rather each character on the screen is defined as a 12-bit value. Before bit-mapped screen became cheap, this was how text was typically displayed. For example, the original IBM PC had a 80x25 character display, each character had only 8-bits to define which character to display, and there was a ROM that converted each 8-bit value into actual dots on the screen. Quote
gsteemso Posted September 24, 2009 at 08:39 PM Author Report Posted September 24, 2009 at 08:39 PM Ah, I see. Thank you all, that simplifies things. I wanted to avoid using multiple code units per character for a variety of reasons, most of which have to do with the unwieldiness of the resulting system. I don't know how they managed it on the Asian-language versions of 1980s microcomputers. (Maybe I should look that up.) The reason I am limited to a total of 4096 displayable characters is indeed that I am using a 12-bit byte size (Why? Just because I feel like it! That's what makes this fun), and it gets a lot less hairy to implement if there is only one byte per displayed character. Even with a code repertoire of 4096 displayable characters, it still takes around 640KiB to hold the bitmapped font in ROM — I'd really rather not use a lot more than that, you know? With the design I've tentatively settled on, I've already got another 360KiB used up for the miniaturized characters' bitmap. Yeah, I could just do the 92 Japanese kana, but that feels a bit kludgy to me, plus this way I get to have small caps. :¬) (More detail: I'm currently planning to have a character-based display mode modelled loosely after the Commodore 64's, as that's the kind I am most familiar with. I'm planning to allow the display to be bitmapped if desired, too, but writing the software to drive that is an order of magnitude more complex than just echoing characters to something that acts like a dumb terminal. Not what I want to spend my time on.) Thanks again, gsteemso Quote
xiaotao Posted September 24, 2009 at 09:08 PM Report Posted September 24, 2009 at 09:08 PM If you want to learn traditional characters, zhuyin fuhao is (bopomofo) very helpful. Our family learns both traditional characters and simplified characters, therefore we know both zhuyin and pinyin. It just gives us more flexibility to enjoy more Chinese books, like being able to drive a stick shift well as automatic. Quote
Daan Posted September 24, 2009 at 11:25 PM Report Posted September 24, 2009 at 11:25 PM There's nothing wrong with bopomofo, but I don't know if it is used for anything other than teaching children reading in Taiwan nowadays. MTC textbooks here in Taiwan still use bopomofo, although they also include Hanyu pinyin transcriptions. One publisher that hasn't switched to pinyin yet, probably because of space constraints in its editions of the Chinese classics, is Sanmin. Most locals do not know Hanyu pinyin, so when you ask them to transcribe something they'll use bopomofo. Other than in those contexts, it's Hanyu (or Tongyong, but that was officially replaced a few months ago) pinyin all the way. Quote
imron Posted September 25, 2009 at 12:16 AM Report Posted September 25, 2009 at 12:16 AM How many codepoints do the other characters take up? You could actually get a lot of useful characters into 2,000-3,000 codepoints. The top 3,000 characters by frequency of usage account for approximately 99% of all characters in modern texts. The top 2,000 account for approximately 97%. See this page for frequency statistics. This only takes into account Simplified characters however. You'd probably need to make a choice between using Simplified or Traditional characters (Simplified are used mostly on the mainland, Traditional are used mostly in Hong Kong, Taiwan and overseas Chinese communities that were well established before Simplification efforts were implemented in the 50's). Quote
gsteemso Posted September 25, 2009 at 02:18 AM Author Report Posted September 25, 2009 at 02:18 AM Unfortunately, the other characters will probably eat around 2500-3000 codepoints… I think, it depends on what scripts I include and how much of the Unicode I am using as a working guide can safely be omitted. I'm still trawling through Unicode glyph charts trying to figure out what parts were only included for legacy encodings and may therefore safely be dropped for casual use. In any case, the Chinese community here (Seattle/Vancouver) is well entrenched from before 1950, and a lot of the Vancouver one at least came over from Hong Kong in 1997, so I’d probably have to go with “traditional.” It just is not going to work given a 12-bit code size, looks like. How many characters do you need for a working use of Traditional Chinese? I could probably kludge something up with a second 4096-point repertoire especially dedicated, but I doubt it would be worth the hassle when I’d be the only user. This is all very interesting. Thank you for discussing it with me. Quote
imron Posted September 25, 2009 at 03:35 AM Report Posted September 25, 2009 at 03:35 AM How many characters do you need for a working use of Traditional Chinese?The figures will still be basically unchanged as the written language is still essentially the same. Small differences in regional usage will alter slightly which characters make it into the top 3,000 most frequently used characters but the core will remain the same. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.