Jump to content
Chinese-Forums
  • Sign Up

Excel: filtering a dictionary list


fenlan

Recommended Posts

I have an Excel list of the characters by frequency, and I also have an Excel list of the Richwin dictionary (duoyuanpinyin ciku) from http://technology.chtsai.org/wordlist/. There are 120,329 words in the dictionary, all of more than one syllable. They are arranged in two columns owing to the large number of words that exceeds the number of rows Excel permits. How can I filter the list to remove all words that use characters that are not among the 5000 most frequent Chinese characters? I posted in the Vocabulary forum a list of characters by frequency, but I want to arrange the 5000 characters in one column, the words in 2 other columns, and get Excel to show me a list of words that only include the said 5000 characters.

Link to comment
Share on other sites

There may be an easier way to do this, but here is one way I know works. Assuming your first two-character word is in A1:

In column B1, enter the formula

=leftb(A1)

to extract the leftmost double-byte character of the string in A1.

In column C1, enter the formula

=rightb(A1)

to extract the rightmost double-byte character of the string in A1.

You can then use a VLOOKUP to flag the characters you want to delete, and after you have the two-character words you want to keep (through a process of deleting and sorting), you can paste the pairs back together in single cells using a formula like this in column D1:

=CONCATENATE(B1," ",C1)

Let me know if you have problems with this.

And send me a copy when you are finished 8)

Link to comment
Share on other sites

Gary, thank you for your advice. This was hard to do in Excel, with constant recalculation of all cells etc. I had thought that, seeing as there are around 60,000 words in the Contemporary Chinese Dictionary and 13,000 characters, that the Richwin dictionary on the Internet would contain a lot of entries for rarer characters. I filtered for the most common 5020 characters (I didn't want to cut off at 5000, as the character frequency spreadsheet I posted in the Vocab forum showed quite a few characters that cropped up at a similar frequency around the 5000 mark). The result was that the lisst of 120329 words was filterd down to....119,017!! With only 5000 characters, you can make 120,000 words. Amazing. The list contains place names and proper names, however, as well as other words. I can send you the spreadsheet, but it is 5.5MB long. That is no problem for my email - let me know if you still want it. I had hoped that around 40000 words would pop out, making a target of words to go through after the HSK vocab, but 120,000 is a rather unwieldy! There are also unusual lexical items in the list, such as 全党全军和全国!

Link to comment
Share on other sites

I guess I'll pass on the list. Nice to know that I "only" need to memorize 5,000 characters to recognize 120,000 words. :-?

I think I've read that about 80 percent of printed words in Shanghainese are made up of the 1,000 most frequent characters, so maybe it's not so surprising that 5,000 characters will give you the keys to 120,000 words.

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...