Word frequency list based on a 15 billion character corpus: BCC (BLCU Chinese Corpus)

June 15, 2018 at 11:51 AM

The Beijing Language and Culture University created a balanced corpus of 15 billion characters. It’s based on news (人民日报 1946-2018，人民日报海外版 2000-2018), literature (books by 472 authors, including a significant portion of non-Chinese writers), non-fiction books, blog and weibo entries as well as classical Chinese. Frequency lists derived from the corpus can be downloaded here: http://bcc.blcu.edu.cn/downloads/resources/BCC_LEX_Zh.zip

The zip-File contains a global frequency list based on the whole corpus and frequency lists based on specific categories (e.g. news, literature...) of the corpus. These text files can easily be turned into a Pleco user dictionary.

The corpus is much larger than the CCL (470 million characters), the CNC (100 million characters), the SUBTLEC-CH (47 million characters) and the LCMC (less than 2 million characters). It seems as if the frequency lists derived from this corpus might be the most reliable frequency lists currently available. More detailed information about the corpus can be found in this paper.

Maybe this is useful to some of you.

June 15, 2018 at 01:23 PM

A strange thing: in global file 第 2002074595 (first) 的 943370349 (second), but in all lists 的 is always the first.

June 15, 2018 at 02:07 PM

15 billion characters? Do you mean words?

50 thousand characters was the biggest number I have ever come across in the past, that many can't have been added in the last few years.

June 15, 2018 at 02:13 PM

The corpus has a total size of 15 billion characters. Of course it's not 15 billion unique characters. The 中华字海 lists 85,568 unique characters, by the way.

June 15, 2018 at 02:16 PM

Surely just means the total character count of the corpus surveyed. Otherwise Chinese in three months just got harder!

June 15, 2018 at 02:57 PM

@johnvaraderoYes it does but as that link says 50,000 is the first half of the dictionary and the remainder are made up of :

characters missed by previous dictionaries, as a result of manual error or due to lack of knowledge of such characters. Among these include complex characters hidden in old Buddhist texts, rare characters found within the Dunhuang manuscripts, characters used during the Song, Yuan, Ming and Qing Dynasties that fell from use, dialectal characters, newly created characters as a result of advancement in science and technology (such as the Chinese character for the element Darmstadtium, 鐽, which is not present in prior dictionaries^[6]), as well as rare characters used today in personal and location names.^[5] Additionally, regional characters and variant characters from Taiwan, Hong Kong, Macau and Singapore, as well as non-native characters from Japanese Kanji and Korean Hanja, are also listed in the Zhonghua Zihai.

I think that unless you have a use for any of these characters in part 2 of the dictionary, I would happily confine myself to the first 50,000 but even then I only expect to master enough to read newspapers and general things in everyday life.

It good though to know where to go if I needed to.

June 15, 2018 at 04:12 PM

Thank you for sharing this resource!

4 hours ago, johnvaradero said:

The corpus is much larger than the CCL (470 million characters), the CNC (100 million characters), the SUBTLEC-CH (47 million characters) and the LCMC (less than 2 million characters). It seems as if the frequency lists derived from this corpus might be the most reliable frequency lists currently available.

The LCMC is definitely smaller than the other corpuses, but it has one big advantage: all the word segmentation was done by hand. The other corpuses (including this one, presumably, given it's mammoth size) were segmented by an computer algorithm.

Also, isn't it "SUBTLEX-CH", not "SUBTLEC-CH"?

June 15, 2018 at 04:58 PM

Good to see 3 different punctuation marks (~-!) making into the Weibo Top 20

June 15, 2018 at 06:25 PM

Character frequency list, or word frequency list?

The former is useless while the latter is something I've been needing for a while.

June 16, 2018 at 05:18 AM

10 hours ago, vellocet said:

while the latter is something I've been needing for a while

You can try making your own.

The higher your level of Chinese, the less useful general frequency lists will be, because your reading habits are usually limited to a set of specific fields and topics so the frequencies from a general list won't be consistent with the frequencies of words that you encounter in what you're reading.

June 16, 2018 at 08:21 AM

13 hours ago, vellocet said:

Character frequency list, or word frequency list?

These are word frequency lists.

3 hours ago, imron said:

because your reading habits are usually limited to a set of specific fields and topics so the frequencies from a general list won't be consistent with the frequencies of words that you encounter in what you're reading.

I completely agree, but I think it really depends on one's goals. If you are mainly interested in reading crime fiction in Chinese, a general frequency list is the wrong choice. However, if your goal is as general as "being able to read newspapers", a general frequency list can be very beneficial.

June 16, 2018 at 11:31 AM

3 hours ago, johnvaradero said:

a general frequency list can be very beneficial.

It's still more beneficial to get frequency statistics from what you are reading. Even reading a general newspaper, each person will have a different preference for the types of stories they read and for the newspapers that they are reading, and both of those things have a significant impact on word frequencies, especially once you can understand a large portion of the words in any given piece of text - say more than 80% (which is still very low percentage and won't allow for reading without dictionary help).

June 16, 2018 at 12:46 PM

1 hour ago, imron said:

It's still more beneficial to get frequency statistics from what you are reading.

Reading certain types of newspaper stories or certain types of newspapers are specific goals. In order to achieve these kinds of specific goals, creating your own frequency lists with the Chinese Text Analyzer (love that program btw, great work) is the way to go.

But a general goal such as being able to read newspapers (meaning at least the common ones, meaning all of their articles) can be achieved most efficiently if one is guided by a general news frequency list (which doesn’t have to mean to just learn the words from that list, it can also mean to use the list (as a Pleco user dictionary) in order to determine which ones of the unknown words in an article to learn and which ones not to learn).

Jun Da concluded in this paper (page 15) that in order to achieve the general goal “reading news for information”, as a first step learners have to “acquire around 20,000 high to medium frequency words and phrases used in journalistic Chinese”. If one is guided by a general news frequency list in the process of acquiring these 20 000 words and phrases, according to his findings, one should have achieved a basic comprehension rate that allows for extensive reading of all kinds of news articles and subsequent extension of the passive vocabulary through exposure.

However, if one learns 20 000 words based on own frequency lists build upon the kinds of new stories one usually likes to read, one will greatly improve the ability to read these kinds of stories. But regarding the general goal of reading newspapers/reading news for information one’s comprehension rate will be worse than the comprehension rate that could have been achieved by learning 20 000 words based on a general news frequency list, in particular one will not achieve the basic comprehension rate that is required according to Jun Da.

It’s a different question whether it’s a good idea to have such general goals. Pursuing specific goals is probably more motivating and allows for faster progress.

June 17, 2018 at 06:20 AM

@johnvaradero well thought out opinion. Please post more. I want to read more of your opinions.

June 17, 2018 at 08:01 AM

@艾墨本 It's an honor to hear that from you. A few days ago I voted for you in the 汉教英雄会, good luck with that!

June 17, 2018 at 09:38 AM

20 hours ago, johnvaradero said:

it can also mean to use the list (as a Pleco user dictionary) in order to determine which ones of the unknown words in an article to learn and which ones not to learn

I agree wholeheartedly with this.

Sign In

Word frequency list based on a 15 billion character corpus: BCC (BLCU Chinese Corpus)

Recommended Posts

johnvaradero

furiop

Shelley

johnvaradero

Jim

Shelley

大块头

roddy

vellocet

imron

johnvaradero

imron

johnvaradero

艾墨本

johnvaradero

Guest realmayo

Join the conversation