Jump to content
Chinese-Forums
  • Sign Up

Classical Chinese Frequency List


Recommended Posts

Posted

This list was generated from all the texts of the "Pre-Qin and Han" category of the ctext.org website, which includes all of the Classical Chinese corpus prior to the end of the Han dynasty (220 AD). It consists of a base file of 12,236,622 characters. I took this massive data file (5,609 pages!) and sorted it using a character frequency counter online. This method found approximately 14,000 unique characters. After cleaning the data for non-Chinese characters (, . ? ! 1 @ # [ 。、) and etc., I was left with a frequency-sorted list of 13,673 unique characters. I have included the tab-separated raw frequency data to enable you to ascertain how common a character's occurrence is within Pre-Qin and Han texts when viewing this table.

 

https://docs.google.com/spreadsheets/d/e/2PACX-1vTk5SxXG_n-V6erAG0dJJOu7SYQwxyO6pEk0lzp2rdfpfdmiT_b1mbiKJK2rzdJMjwLejug-amMY15Y/pubhtml?gid=1128595619&single=true

 

Classical Chinese Frequency List.txt

  • Like 1
  • Helpful 1
Posted

Very cool. Consider sharing it via Github, Archive.org, or some other platform with greater file longevity.

Posted

@大块头I still don't understand Github to this day, but maybe someone who has an account can do it for me. 

 

The only other frequency list for Classical Chinese that I can find is on the Jun Da website https://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=CL, but as people have said in the past, it contains modern characters (的,你,个,etc.), so I think this is better. There may have been modern texts mixed in the Jun Da data, but the raw data files from ctext are only the original text sans annotations, translations, and commentaries.

 

I may use this to make an Anki deck with sentences containing these characters specifically for learning Classical Chinese, but I myself am still learning, so we'll see. Maybe someone will use this and beat me to the punch.

Posted

I've updated the original post because my word processor was only showing a character count of 261,484 characters (file to big). The actual number is 12,236,622 which makes more sense. 13,673 is still the unique character count and the amount in the list.

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...