xaze Posted September 14, 2010 at 01:12 AM Report Share Posted September 14, 2010 at 01:12 AM Hello, does anyone know why searching for a bi-gram such as "信息" produces more search results than searching for an individual character that is part of that bi-gram, such as "息" Some bi-grams will return results in the millions, but individual character searches will return 1/2 that. Is there a special way to search using google when using chinese characters? Search one Search two Quote Link to comment Share on other sites More sharing options...
roddy Posted September 14, 2010 at 03:38 AM Report Share Posted September 14, 2010 at 03:38 AM I'd imagine Google assumes that you aren't searching for a character so much as a word, and as 信息 is more common as a word than 息 you see that reflected in the results. Quote Link to comment Share on other sites More sharing options...
jbradfor Posted September 14, 2010 at 06:24 PM Report Share Posted September 14, 2010 at 06:24 PM That does seem strange. When I just tried it, it wasn't 2:1, but the single-character hit was slightly higher. I would assume this is just error in the estimate of the number of page hits. While useful, the page estimate is just that, an estimate, and google has stated that it is subject to error. Quote Link to comment Share on other sites More sharing options...
xaze Posted September 15, 2010 at 03:49 AM Author Report Share Posted September 15, 2010 at 03:49 AM I would say it was an estimation error too. However, some of the hits are extremely off or they are about equal, which probably shouldn't be the case. I was going to use the search results as a poor mans word frequency (even though it is not quite a word frequency given that it matches pages and not words) for long-tail chinese words. However, the estimates are so inaccurate it is useless. I did the same for Spanish irregular conjugations and for western languages the estimates are a lot more useful. Quote Link to comment Share on other sites More sharing options...
roddy Posted September 15, 2010 at 05:46 AM Report Share Posted September 15, 2010 at 05:46 AM There are definitely word factors coming into play. If you do a search for 息 you'll notice that it's not just that character that gets highlighted - 信息, when it appears in the snippet, is also highlighted in red, so there's clearly an awareness of words in the algorithm. Quite how it works I don't know. Quote Link to comment Share on other sites More sharing options...
tooironic Posted September 16, 2010 at 02:02 PM Report Share Posted September 16, 2010 at 02:02 PM Yes, Google's search methods continue to elude me. "信息的": 31,600,000 "信息的一": 35,100,000 [??] "信息的一個": 3,840,000 I don't get it. Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.