Jump to content
Chinese-Forums
  • Sign Up

Word frequency list based on a 15 billion character corpus: BCC (BLCU Chinese Corpus)


Recommended Posts

Posted

A strange thing: in global file 第    2002074595 (first) 的    943370349 (second), but in all lists  的   is always the first.

Posted

15 billion characters? Do you mean words?

 

50 thousand characters was the biggest number I have ever come across in the past, that many can't have been added in the last few years.

Posted

The corpus has a total size of 15 billion characters. Of course it's not 15 billion unique characters. The 中华字海 lists 85,568 unique characters, by the way.

Posted

Surely just means the total character count of the corpus surveyed. Otherwise Chinese in three months just got harder!

Posted

@johnvaraderoYes it does but as that link says 50,000 is the first half of the dictionary and the remainder are made up of :

 

characters missed by previous dictionaries, as a result of manual error or due to lack of knowledge of such characters. Among these include complex characters hidden in old Buddhist texts, rare characters found within the Dunhuang manuscripts, characters used during the Song, Yuan, Ming and Qing Dynasties that fell from use, dialectal characters, newly created characters as a result of advancement in science and technology (such as the Chinese character for the element Darmstadtium, 鐽, which is not present in prior dictionaries[6]), as well as rare characters used today in personal and location names.[5] Additionally, regional characters and variant characters from Taiwan, Hong Kong, Macau and Singapore, as well as non-native characters from Japanese Kanji and Korean Hanja, are also listed in the Zhonghua Zihai. 

 

I think that unless you have a use for any of these characters in part 2 of the dictionary, I would happily confine myself to the first 50,000 but even then I only expect to master enough to read newspapers and general things in everyday life.

 

It good though to know where to go if I needed to.

Posted

Thank you for sharing this resource!

 

4 hours ago, johnvaradero said:

The corpus is much larger than the CCL (470 million characters), the CNC (100 million characters), the SUBTLEC-CH (47 million characters) and the LCMC (less than 2 million characters). It seems as if the frequency lists derived from this corpus might be the most reliable frequency lists currently available.

 

The LCMC is definitely smaller than the other corpuses, but it has one big advantage: all the word segmentation was done by hand. The other corpuses (including this one, presumably, given it's mammoth size) were segmented by an computer algorithm.

 

Also, isn't it "SUBTLEX-CH", not "SUBTLEC-CH"?

 

Posted

Good to see 3 different punctuation marks (~-!) making into the Weibo Top 20

Posted

Character frequency list, or word frequency list?  

 

The former is useless while the latter is something I've been needing for a while.

Posted
10 hours ago, vellocet said:

while the latter is something I've been needing for a while

You can try making your own.

 

The higher your level of Chinese, the less useful general frequency lists will be, because your reading habits are usually limited to a set of specific fields and topics so the frequencies from a general list won't be consistent with the frequencies of words that you encounter in what you're reading.

  • Like 2
Posted
13 hours ago, vellocet said:

Character frequency list, or word frequency list?  

These are word frequency lists.

 

3 hours ago, imron said:

because your reading habits are usually limited to a set of specific fields and topics so the frequencies from a general list won't be consistent with the frequencies of words that you encounter in what you're reading.

I completely agree, but I think it really depends on one's goals. If you are mainly interested in reading crime fiction in Chinese, a general frequency list is the wrong choice. However, if your goal is as general as "being able to read newspapers", a general frequency list can be very beneficial.

Posted
3 hours ago, johnvaradero said:

a general frequency list can be very beneficial.

It's still more beneficial to get frequency statistics from what you are reading.  Even reading a general newspaper, each person will have a different preference for the types of stories they read and for the newspapers that they are reading, and both of those things have a significant impact on word frequencies, especially once you can understand a large portion of the words in any given piece of text - say more than 80% (which is still very low percentage and won't allow for reading without dictionary help).

Posted
1 hour ago, imron said:

It's still more beneficial to get frequency statistics from what you are reading. 

Reading certain types of newspaper stories or certain types of newspapers are specific goals. In order to achieve these kinds of specific goals, creating your own frequency lists with the Chinese Text Analyzer (love that program btw, great work) is the way to go.

But a general goal such as being able to read newspapers (meaning at least the common ones, meaning all of their articles) can be achieved most efficiently if one is guided by a general news frequency list (which doesn’t have to mean to just learn the words from that list, it can also mean to use the list (as a Pleco user dictionary) in order to determine which ones of the unknown words in an article to learn and which ones not to learn).

 

Jun Da concluded in this paper (page 15) that in order to achieve the general goal “reading news for information”, as a first step learners have to “acquire around 20,000 high to medium frequency words and phrases used in journalistic Chinese”. If one is guided by a general news frequency list in the process of acquiring these 20 000 words and phrases, according to his findings, one should have achieved a basic comprehension rate that allows for extensive reading of all kinds of news articles and subsequent extension of the passive vocabulary through exposure.

However, if one learns 20 000 words based on own frequency lists build upon the kinds of new stories one usually likes to read, one will greatly improve the ability to read these kinds of stories. But regarding the general goal of reading newspapers/reading news for information one’s comprehension rate will be worse than the comprehension rate that could have been achieved by learning 20 000 words based on a general news frequency list, in particular one will not achieve the basic comprehension rate that is required according to Jun Da.

 

It’s a different question whether it’s a good idea to have such general goals. Pursuing specific goals is probably more motivating and allows for faster progress.

  • Like 2
Posted
20 hours ago, johnvaradero said:

it can also mean to use the list (as a Pleco user dictionary) in order to determine which ones of the unknown words in an article to learn and which ones not to learn

 

I agree wholeheartedly with this.

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...