webmagnets Posted December 25, 2018 at 10:45 PM Report Share Posted December 25, 2018 at 10:45 PM When I go to an English language web page and long press on a word, it will highlight the entire word. I can understand how the browser or OS knows where the word starts and finishes because it has spaces. However, with Chinese it still knows where the words are. When I long press on the 大 of 大家, the entire 大家 gets highlighted. How does that work? Quote Link to comment Share on other sites More sharing options...
Publius Posted December 26, 2018 at 08:16 AM Report Share Posted December 26, 2018 at 08:16 AM Doesn't seem to work for me. Quote Link to comment Share on other sites More sharing options...
wibr Posted December 26, 2018 at 08:22 AM Report Share Posted December 26, 2018 at 08:22 AM On iOS you see a word segmentation happening if you select something, but that's a function provided by iOS, not by the website. Quote Link to comment Share on other sites More sharing options...
webmagnets Posted December 26, 2018 at 12:36 PM Author Report Share Posted December 26, 2018 at 12:36 PM I'm seeing this on my Chromebook, but not on my Android. Quote Link to comment Share on other sites More sharing options...
imron Posted December 27, 2018 at 10:56 AM Report Share Posted December 27, 2018 at 10:56 AM I would guess it's realtime segmentation done by the OS. It'll never have to segment much text because it only needs to look forward and back to the nearest punctuation and/or whitespace. 1 Quote Link to comment Share on other sites More sharing options...
mikelove Posted December 27, 2018 at 02:59 PM Report Share Posted December 27, 2018 at 02:59 PM Can confirm this is the case, yes; AFAIK all major OSes now include some sort of Chinese word segmentation support, though not every browser / text editor necessarily taps into it. The default approach (used by anybody without the AI chops to do better) is to use ICU's dictionary-based word segmenter, which finds possible breakdowns using a Chinese word list and then picks the most likely one based on word frequencies. (pretty much the same thing we do, though our dictionary's bigger because we're not asking OEMs to devote flash storage to it on a billion devices ?) 4 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.