Jump to content
Chinese-Forums
  • Sign Up

Filtering only words that contain certain characters


Recommended Posts

Posted

Hi, I need some help with the following:

I have a list of Chinese words, and then another list of characters. Then, I'd like to be able to see all words on the list that contain ONLY characters from the second list. Could someone advise me on how to do that? I know nearly nothing about programming and my Excel skills apparently are not sufficient either (still I'd prefer an Excel solution if possible). Any help would be greatly appreciated.

Posted

Here is an example of how it could be done in Python 2. You'll have to adapt it to your case.

Let's say your list of words is actually a CEDICT release file: cedict_1_0_ts_utf-8_mdbg.txt.gz. And let's say your list of characters is stored in a file named characters.tab, which is UTF-8 encoded, tab-delimited file with first field in each row being the character.

import gzip
words=gzip.open('cedict_1_0_ts_utf-8_mdbg.txt.gz')
chars=set((x.split()[0].decode('utf-8') for x in open('characters.tab')))
for line in words:
 if line[0]=='#': continue
 line=line.strip().decode('utf-8')
 word=line.split()[1]			 # <- use .split()[0] for tradiotional
 if not (set(word) - chars):
	 print line.encode('utf-8')   # <- print word, if you want just a word

The script will probably spit a lot of lines, so you want to redirect its output into a file, like this:

python filter-words.py >words.txt

  • Like 1

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...