Ednorog Posted October 18, 2011 at 08:08 AM Report Posted October 18, 2011 at 08:08 AM Hi, I need some help with the following: I have a list of Chinese words, and then another list of characters. Then, I'd like to be able to see all words on the list that contain ONLY characters from the second list. Could someone advise me on how to do that? I know nearly nothing about programming and my Excel skills apparently are not sufficient either (still I'd prefer an Excel solution if possible). Any help would be greatly appreciated. Quote
cababunga Posted October 18, 2011 at 06:19 PM Report Posted October 18, 2011 at 06:19 PM Here is an example of how it could be done in Python 2. You'll have to adapt it to your case. Let's say your list of words is actually a CEDICT release file: cedict_1_0_ts_utf-8_mdbg.txt.gz. And let's say your list of characters is stored in a file named characters.tab, which is UTF-8 encoded, tab-delimited file with first field in each row being the character. import gzip words=gzip.open('cedict_1_0_ts_utf-8_mdbg.txt.gz') chars=set((x.split()[0].decode('utf-8') for x in open('characters.tab'))) for line in words: if line[0]=='#': continue line=line.strip().decode('utf-8') word=line.split()[1] # <- use .split()[0] for tradiotional if not (set(word) - chars): print line.encode('utf-8') # <- print word, if you want just a word The script will probably spit a lot of lines, so you want to redirect its output into a file, like this: python filter-words.py >words.txt 1 Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.