chinasnippets Posted January 9, 2009 at 12:07 PM Report Posted January 9, 2009 at 12:07 PM Hi, I'm working on a project where having a stop word list for characters would come in handy. I have searched around but I have only found some thesis that studied this but didn't find any actually lists. Do any of you know of or have a Chinese character stop word list that I maybe can use. Thanks a lot. Quote
roddy Posted January 9, 2009 at 12:18 PM Report Posted January 9, 2009 at 12:18 PM 停用词表?Something like this? Quote
msittig Posted January 9, 2009 at 05:28 PM Report Posted January 9, 2009 at 05:28 PM I learned something new today. http://hi.baidu.com/seosky/blog/item/d18cfa3360fa4744ad4b5fc6.html 为节省存储空间和提高搜索效率,搜索引擎在索引页面或处理搜索请求时会自动忽略某些字或词,这些字或词即被称为Stop Words(停用词)。 Quote
trevelyan Posted January 10, 2009 at 09:30 AM Report Posted January 10, 2009 at 09:30 AM If you want to do intelligent segmentation or text processing for Chinese text perhaps you should take a look at Adso. It is a Chinese text segmentation and analysis engine. The following command: ./adso -f [file] -g grammar/alexandre1.txt -g grammar/alexandre2.txt --no-phrases Takes this as input: 在1 月9日召开的2009年全国卫生工作会议上,卫生部党组书记高强指出,促进经济平稳较快增长、进一步改善民生,是今年党和政府工作的首要任务。全国卫生系统和各级卫生行政部门要充分认识做好医疗卫生工作的重要意义和现实意义,坚定不移地贯彻落实党中央、国务院的部署和要求,加快推进医疗卫生体制改革,切实抓好重点项目建设,为改善民生和实现扩内需、保增长的目标服务。 And returns this as output: 卫生工作会议 卫生部党组书记 高强 经济 增长 民生 今年党 和政 首要任务 卫生系统 各级 卫生 部门 医疗卫生 重要意义 现实意义 党中央 国务院 推进 医疗卫生体制改革 重点项目建设 改善 民生 内需 目标 You can take a look at the two grammar files invoked to get a sense of what is happening. The software is basically making selective decisions about what content to filter based on part of speech, word length, and a few other criteria. You generally need to customize it but it is easy to customize to add or subtract rules. The database is distributed with the software distribution - it includes POS information so you can use that to generate your own stop word list easily enough. Quote
chinasnippets Posted January 10, 2009 at 01:27 PM Author Report Posted January 10, 2009 at 01:27 PM Thanks a lot for the great replies. Very useful. The list is great and I'll have to study the adsotrans software (and ask my programmer as well) a bit more on how it could be applied in the specific project I'm working on Thanks again, I have a starting point now. Cheers, G. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.