Jockster Posted March 12, 2006 at 12:48 AM Report Share Posted March 12, 2006 at 12:48 AM The fact that there are no spacess between words throws a spanner in the works when building search engines for Chinese-language material. I am referring to search engines for a finite set of material here, NOT Internet search enginers. Approaches used European languages don't work, because a computer does not know a priori how to delimit the words in Chinese. You can index each character separately, match the material against a dictionary when you index it, etc. It is hard. Does anyone have an idea of how well Google, Alibaba, Baidu perform in this respect? Of course, if you tap in a few common search terms in the Chinese version of Google, you get a lot of hits. But - and this is a blessing for the Internet search engines - you cannot tell whether the search engine missed something, whether some hits are missing. Obviously the reason is that none of us know what is out there - this is one of the reasons we used the search engine in the first place. When you have a finite set of documentation, say 10.000 pages if printed, then you can (relatively) easily check whether the search engine's accuracy is 100%, because you can check the material. Any programmers out there who would like to share their insights? Quote Link to comment Share on other sites More sharing options...
yingguoguy Posted March 12, 2006 at 11:28 AM Report Share Posted March 12, 2006 at 11:28 AM Agreed not having spaces to delimit words is a big hassle for programmers, especially if you want to do something like count word frequencies. I'd guess the search engines don't use dictionaries, as it would be useless for names and for new words, which would be missing from dictinaries but are usually exactly the kinds of things people want information about. At a guess I'd say it wouldn't be a big problem for search engines though, they'd just treat each character as a seperate word and use the same code/approach for finding compound words in Chinese as they do adjacent words in English. The approach for indexing "Buckingham Palace" (with quotes) should be the same as for 故宫. I don't think Google needs to know if something is a word or a phrase. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.