Filtering only words that contain certain characters

October 18, 2011 at 08:08 AM

Hi, I need some help with the following:

I have a list of Chinese words, and then another list of characters. Then, I'd like to be able to see all words on the list that contain ONLY characters from the second list. Could someone advise me on how to do that? I know nearly nothing about programming and my Excel skills apparently are not sufficient either (still I'd prefer an Excel solution if possible). Any help would be greatly appreciated.

October 18, 2011 at 06:19 PM

Here is an example of how it could be done in Python 2. You'll have to adapt it to your case.

Let's say your list of words is actually a CEDICT release file: cedict_1_0_ts_utf-8_mdbg.txt.gz. And let's say your list of characters is stored in a file named characters.tab, which is UTF-8 encoded, tab-delimited file with first field in each row being the character.

import gzip
words=gzip.open('cedict_1_0_ts_utf-8_mdbg.txt.gz')
chars=set((x.split()[0].decode('utf-8') for x in open('characters.tab')))
for line in words:
 if line[0]=='#': continue
 line=line.strip().decode('utf-8')
 word=line.split()[1]			 # <- use .split()[0] for tradiotional
 if not (set(word) - chars):
	 print line.encode('utf-8')   # <- print word, if you want just a word

The script will probably spit a lot of lines, so you want to redirect its output into a file, like this:

python filter-words.py >words.txt

Sign In

Filtering only words that contain certain characters

Recommended Posts

Ednorog

cababunga

Join the conversation