imron Posted September 12, 2009 at 09:20 AM Report Share Posted September 12, 2009 at 09:20 AM Also, in answer to your other question, Unicode is the character set. That is, it specifies a unique codepoint for every single character. Separate fonts are then free to implement as many different characters in that character set as they like. As long as the character you want is in the Unicode standard and you have a font that implements this character, then you will be able to display it on your computer. For example, Unicode defines the codepoint for character 人 as 4EBA. All unicode fonts that support the character 人 will make sure that character number 4EBA in their font will look like 人 (although a songti font will have this character stylized to look as if it was in songti, likewise a kaiti font would have the character stylized to look as if it was in kaiti etc). Each of these individual characters in the font is known as a glyph. What happens however is that including glyph information for every single character defined by unicode a) takes a long time and a lot of effort, and B) makes the font file huge. It simply doesn't make sense for most fonts to include glyphs for all the characters defined by Unicode, especially when most of them will never be used. Which is why only specialised fonts, or fonts that aim to be complete will include them. So assuming you had a font that supported the character you were interested in, then yes, you could locate it in Word using the method you specified, however that seems to me like a long way to do it. A better way would be to use the Unihan page (which is far easier to search/lookup) and just copy the "Your Browser" field into Word. If "Your Browser" just shows a question mark or a box, it is unlikely that Word will be able to display it either. Quote Link to comment Share on other sites More sharing options...
Lugubert Posted September 12, 2009 at 01:27 PM Report Share Posted September 12, 2009 at 01:27 PM If "Your Browser" just shows a question mark or a box, it is unlikely that Word will be able to display it either. I'm not quite that pessimistic. On language boards, I sometimes encounter boxes, for example for vowels with a breve sign on them (that is, like the pinyin tone 3 mark). When I copy the text into Word, they are shown correctly. Quote Link to comment Share on other sites More sharing options...
imron Posted September 12, 2009 at 01:52 PM Report Share Posted September 12, 2009 at 01:52 PM That sounds more like an issue with combining diacritical marks though, which browsers typically haven't been so good with. I would be surprised to find it happen with a Chinese character from the "Your Browser" field of the Unihan page (but then I'm regularly surprised by all manner of things ). Quote Link to comment Share on other sites More sharing options...
OneEye Posted September 12, 2009 at 02:20 PM Report Share Posted September 12, 2009 at 02:20 PM The Han Nom A and Han Nom B fonts are the most complete I've been able to find (not that I've looked a ton). Han Nom A is a Ming/Song font (serif) and Han Nom B is a Hei font (sans serif). From what I found, it contains all the CJK Unified Ideographs, plus the CJK Extensions A and B, for a total of 70,207 glyphs. It contains the character you mentioned and handles variants pretty well it seems. I haven't tested it a lot, but for instance, it shows both 說 and 説. I'm not an expert on these things, but this is what I found on the Chinese Text Project website. According to the Font Test Page, for some browsers (IE and Opera) you need a registry update in order for CJK Extension B to show up. I don't know what that means, but maybe it will help. Quote Link to comment Share on other sites More sharing options...
Mark Yong Posted September 13, 2009 at 02:37 AM Author Report Share Posted September 13, 2009 at 02:37 AM Okay... I managed to get hold of the sursong.ttf file and downloaded it. The good news is, I now can view the the "4-dragon" character, I can find it in MS Word, and I can copy-and-paste it from MS Word into here: Quote Link to comment Share on other sites More sharing options...
Mark Yong Posted September 13, 2009 at 02:39 AM Author Report Share Posted September 13, 2009 at 02:39 AM Okay... I managed to get hold of the sursong.ttf file and downloaded it. The good news is, I now can view the the "4-dragon" character, and I can find it in MS Word. The funny thing is, when I copy-and-paste it into this post and submit it, the post is truncated right from the point where the character was inserted onwards! (imron, you will find that you had the same problem in your post above dated 30th July 2008, 03:52 PM). Any idea why? Anyways, "Thank you" to imrom and everyone else for helping out on this! Now, the next question. With the Song Surrogate Fonts database loaded, I now have a wonderful database in excess of 50,000 characters to play with. The problem now is efficiently searching for what I want. The 4-dragons still does not appear in my Windows XP IME character recognition pad, so I am assuming that it is still drawing on another font set to find characters by recognition. imron, suggestion taken on using the Unihan radical look-up as a starting point. Now, that works find most of the time if you know what the radical is. But in those few instances were you don't, it could be a problem... per below. imron wrote:Regarding the other characters, how are 勿 and 會 combined (left/right, top/bottom etc). I did a quick search on the unihan page and couldn't find the appropriate character (though there are quite a few others with 勿 as the radical). They are combined from left to right. By the way, silly as it may sound, I could not find the 勿 as a radical in the Unihan database. It only lists the character under the 勹 radical with 2 residual strokes. Quote Link to comment Share on other sites More sharing options...
imron Posted September 13, 2009 at 03:06 AM Report Share Posted September 13, 2009 at 03:06 AM Yeah, the truncating of posts is a bug with MySQL (the database used by the forums) which doesn't support surrogates in utf-8 and truncates any post containing them. It will hopefully be fixed in a future version. Most IMEs will probably have trouble producing any surrogate character because they are so rare and most don't even have a known pronunciation (which makes it difficult for pinyin based IMEs). For handwriting recognition, most probably don't have the stroke recognition order for the rarer characters. I know Sogou's IME allows you to piece characters together for some rarer characters (so the four dragons might work by typing longlonglonglong - though you'd need to test it to make sure). As for radicals, yes, technically 勿 is not a radical, I was just being lazy in calling it that. According to my dictionary 勹 is the correct radical. I also tried looking for the character you mentioned under the 人 radical (used by 會) but didn't have any luck either. After a bit more hunting though, it turns out that it is under the 曰 radical (go figure). See here for the actual character. It didn't actually occur to me that this would be the case and the way I found it was by googling 勿 會. The top result contained a post with that character in it, which I then copied/pasted into the Unihan search box. For most of these rarer characters, there's just not going to be an easy way to do it, and so you've got to get creative. Quote Link to comment Share on other sites More sharing options...
imron Posted September 13, 2009 at 03:41 AM Report Share Posted September 13, 2009 at 03:41 AM I should add that most IMEs allow you to create your own user specified values for given key combinations, so once you've found an interesting character that you think you'll want to type regularly (or even just occasionally) you should be able to add bue as the input combination for the character 勿會 (not written here as a single character to prevent truncation) and then in the future you'll just need to type bue and it will appear as a choice. Similarly, you could use llll for the four dragons etc. Quote Link to comment Share on other sites More sharing options...
Mark Yong Posted September 13, 2009 at 10:01 AM Author Report Share Posted September 13, 2009 at 10:01 AM imron wrote:I know Sogou's IME allows you to piece characters together for some rarer characters (so the four dragons might work by typing longlonglonglong - though you'd need to test it to make sure) I should add that most IMEs allow you to create your own user specified values for given key combinations... That's a new one to me... I wasn't aware that one can custom-create new characters! Could you provide a brief step-by-step on how to do so in Windows XP? For want of a better example, let's just say that for some obscure reason, I want to create a character that combines 戶 with 同 'underneath' it (actually, 廈門方言誌 uses this as the character for t'ang 'window', but it's not found in Unihan). Quote Link to comment Share on other sites More sharing options...
Lugubert Posted September 13, 2009 at 10:51 AM Report Share Posted September 13, 2009 at 10:51 AM Another multi-animal question, which at the tine was commented on in several places: http://wiesmann.codiferes.net/wordpress/?p=3727. The characters are OK in my browser, even if the three horses aren't as black as the single one. When I copy them to Word, they look fine but the three are very slightly smaller and should have been a fraction of millimetre higher. Probably because of my standard font settings, Word tags the first character as Simsun, and the second as Simsun Founder Extended. I can also copy the three from the Mojikyo Character Map. That looks even better. The problem is of course that a normally saved Word file won't display either multihorse when opened on a computer without SS Founder Extend and Mojikyo's M108: The SSFE is lost without even leaving a box, and the M was a totally different character. Saving with the Word option Include TrueType etc. worked! (But such files can't be edited.) Quote Link to comment Share on other sites More sharing options...
imron Posted September 13, 2009 at 01:21 PM Report Share Posted September 13, 2009 at 01:21 PM @Mark Yong, it depends greatly on the IME that you use. Also you can only do it for characters that already exist in a given character set. You cannot use it to create new characters that don't exist in a character set already (you need a different program for that). To get your IME working like that for existing characters you'll need to look in the help for your IME, it will be called something like "custom dictionary" or "user dictionary" or something (I use a Mac and the QIM IME. There it's called: 用户自定义字符串). Both the Sogou pinyin input method and the Google pinyin input method support similar things and I'm sure other IMEs do too. It basically just allows you to specify a key combination and a character/string of characters that should be suggested when that key combination is entered. If the character itself is not in Unicode, you can still actually create it yourself provided you have software that can create/edit fonts - I'm not sure of the name of any, but I'm sure a Google Search would turn up something. Unicode has then set aside a block of characters for private use. You would need to create your font/edit an existing font and add the character to the private use block. Then you could configure your IME so that upon a certain key combination e.g. t'ang it would output the unicode value for your new private character. This would then be displayed correctly on any machine that had your font installed (other users would just see ? or a box). As I mentioned above, I use a Mac, so I can't really do a step by step guide that would work for you. It is however possible to do (although editing fonts can require a fair amount of work). Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.