叫我小山 Posted January 14, 2023 at 10:01 PM Report Posted January 14, 2023 at 10:01 PM This list was generated from all the texts of the "Pre-Qin and Han" category of the ctext.org website, which includes all of the Classical Chinese corpus prior to the end of the Han dynasty (220 AD). It consists of a base file of 12,236,622 characters. I took this massive data file (5,609 pages!) and sorted it using a character frequency counter online. This method found approximately 14,000 unique characters. After cleaning the data for non-Chinese characters (, . ? ! 1 @ # [ 。、) and etc., I was left with a frequency-sorted list of 13,673 unique characters. I have included the tab-separated raw frequency data to enable you to ascertain how common a character's occurrence is within Pre-Qin and Han texts when viewing this table. https://docs.google.com/spreadsheets/d/e/2PACX-1vTk5SxXG_n-V6erAG0dJJOu7SYQwxyO6pEk0lzp2rdfpfdmiT_b1mbiKJK2rzdJMjwLejug-amMY15Y/pubhtml?gid=1128595619&single=true Classical Chinese Frequency List.txt 1 1 Quote
大块头 Posted January 14, 2023 at 10:35 PM Report Posted January 14, 2023 at 10:35 PM Very cool. Consider sharing it via Github, Archive.org, or some other platform with greater file longevity. Quote
叫我小山 Posted January 15, 2023 at 01:18 AM Author Report Posted January 15, 2023 at 01:18 AM @大块头I still don't understand Github to this day, but maybe someone who has an account can do it for me. The only other frequency list for Classical Chinese that I can find is on the Jun Da website https://lingua.mtsu.edu/chinese-computing/statistics/char/list.php?Which=CL, but as people have said in the past, it contains modern characters (的,你,个,etc.), so I think this is better. There may have been modern texts mixed in the Jun Da data, but the raw data files from ctext are only the original text sans annotations, translations, and commentaries. I may use this to make an Anki deck with sentences containing these characters specifically for learning Classical Chinese, but I myself am still learning, so we'll see. Maybe someone will use this and beat me to the punch. Quote
叫我小山 Posted January 15, 2023 at 03:20 PM Author Report Posted January 15, 2023 at 03:20 PM I've updated the original post because my word processor was only showing a character count of 261,484 characters (file to big). The actual number is 12,236,622 which makes more sense. 13,673 is still the unique character count and the amount in the list. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.