Guide to reading Chinese fiction, from absolute beginner to beyond HSK 6

January 25, 2022 at 12:13 PM

I’ve collaborated with a few other fellow Chinese language learners to put together a document focusing on reading Chinese fiction, especially webnovels. Webnovels are extremely popular in China (many are adapted into anime, manga, audiobook/drama and TV shows), and are easily accessible digitally online (both for free and paid).

We have divided the document into levels by character count and HSK level. We did our best to fill each section with useful resources and tips to help guide you on your Chinese reading journey. The resources in each level are ones we've personally used and found useful.

We are aware that the levels may not be perfect, and using character count may not work for everyone, however it's one way that most people will be able to relate to.

You can find the resource here: https://docs.google.com/document/d/e/2PACX-1vSjVsapt4NOZx0KuDwgBUfQggTyT15hdgUjHHdqZRnV8LTnzQ5lY-fKjJhV0cb7I06q3x_syq1DyE4H/pub

Hope you find it useful!

January 25, 2022 at 12:25 PM

Thanks.

I would however not rank the levels according to known character count, but word count.

(If someone studied 3000 characters by Heisig method and/or Anki and was then asked to read a novel or news site, it would not be pretty at all.)

January 25, 2022 at 12:30 PM

This is excellent. One tip: define "manhua." I have never run across that word before and had to look it up.

Also, is there a way to download the document? I don't normally use Google docs and did not see a download link.

January 25, 2022 at 12:30 PM

On 1/25/2022 at 12:25 PM, Jan Finster said:

I would however not rank the levels according to known character count, but word count.

(If someone studied 3000 characters by Heisig method and/or Anki and was then asked to read a novel or news site, it would not be pretty at all.)

By character count isn't the best way, but we can't think of another as not everyone follows HSK, and using words like "beginner, intermediate" is also meaningless as well. If someone does decide to only study 3000 characters and nothing else then thinks they can read a novel, well ...... it's probably a good wake up call when they realise that's impossible. However, having said all that, do you have an idea how we can rank the content?

January 25, 2022 at 12:35 PM

On 1/25/2022 at 12:30 PM, Moshen said:

Also, is there a way to download the document? I don't normally use Google docs and did not see a download link.

You might be able to save it as a PDF by going to File > Print > Save as PDF. We often update the resources (we find better ones or some gets taken down) so I would reocmmend to come back and check on it very now and then.

January 25, 2022 at 02:39 PM

Looks really good! I'll have to take a really good look!

On 1/25/2022 at 2:30 PM, MoonIvy said:

However, having said all that, do you have an idea how we can rank the content?

How about a word frequency?
You could take a frequency list like this: https://www.plecoforums.com/threads/word-frequency-list-based-on-a-15-billion-character-corpus-bcc-blcu-chinese-corpus.5859/

And then analyze each text and calculate the weighted average frequency of all the words in the text. The end result would be a score that tells you the relative "easiness" of the text based on the word frequencies. The more there are frequently used words in the text, the greater the score and, in theory, the easier the text.

January 25, 2022 at 03:43 PM

@Jan Finsterah I misread your message, you suggested to rank it by word count. We thought about this when we first started the doc and came across the following problem.

When analysising Chinese text, every variation counts as a word, for example, 第一次，第二次，第三次, 一棵树，两棵树 a computer would count all these are different words, however as a learner, you don't need to know them all individually. Another example is 走走着走着走着, a computer would count all these as different words.

We had previously analysised many native webnovels, we found the word count to be extremely high (150k-300k), even children's books for 6-7 years olds were around 5-8k. If we were to change the ranks, it'll be something along lines of 5,000 50,000 100,000 200,000+ They're really high numbers, and I feel that'll just scare people.

We had a word count column in our reading resource: https://docs.google.com/spreadsheets/d/e/2PACX-1vTRuZcZySCSm6NRzXpXKbjp6KX5vWlqndQVNNYsmvpE9nJNpcYC9G-A8nt2BVhPdc8vzg6BRz2HuYyx/pubhtml# but the numbers were so high that we eventually got rid of it (besides the 儿童 section) because it just looked so scary to see 150k-300k.

Also not many people will keep an accurate list of their known words (they need some sort of tool for that, which not everyone has one), and most people will probably go by the number of words they have in their flashcard deck, which doesn't reflect the actual amount of words they know as many words are learnt naturally from consumming content (or they are variations like 第一次，第二次，第三次, 星期一，星期二)

It's why we settled on character count as people can use tools to figure out a super rough estimate on how many characters they know.

This is an issue we're pondered on for a while, and we had previously gone back and forth with this. Maybe a written description of what it means to be on each level, the sort of content they can comfortably consume?

At the moment, the resources that sit in each section are placed there based on experience from myself and a few others.

January 25, 2022 at 04:09 PM

On 1/25/2022 at 10:39 PM, alantin said:

And then analyze each text and calculate the weighted average frequency of all the words in the text. The end result would be a score that tells you the relative "easiness" of the text based on the word frequencies. The more there are frequently used words in the text, the greater the score and, in theory, the easier the text.

Too complicated if you ask me. The Flesch readability score for English takes only three easy-to-obtain numbers: total sentences, total words, total syllables.

As a side note, high number of one-syllabe words as well as four-syllable words may indicate archaic, semi-classical style, so word length won't work in Chinese as it does in English.

If only someone could develop a method, or better still, build a website, that works for Chinese!

On 1/25/2022 at 8:30 PM, MoonIvy said:

On 1/25/2022 at 8:25 PM, Jan Finster said:

I would however not rank the levels according to known character count, but word count.

(If someone studied 3000 characters by Heisig method and/or Anki and was then asked to read a novel or news site, it would not be pretty at all.)

Expand

By character count isn't the best way, but we can't think of another as not everyone follows HSK, and using words like "beginner, intermediate" is also meaningless as well. If someone does decide to only study 3000 characters and nothing else then thinks they can read a novel, well ...... it's probably a good wake up call when they realise that's impossible.

Yeah I reached the same conclusion when I started the short-lived First Chapter Project. There's not only the problem of word segmentation, there's also the fact that Chinese school system actually uses character count as goals. Of course native kids aren't learning characters the Heisig way. They're asked to use the character they learned in a word (组词) and use the word in a sentence (造句), not to mention copying out the character by hand - all the things an adult learner hates. So if you have learned characters properly the number of known characters is a good indicator of your literacy level.

January 25, 2022 at 04:50 PM

That guide looks pretty good! My comment would be that you should bump up the character count by 500 at each level.

I didn't feel comfortable starting reading native materials until 2500, and didn't feel like I was confident until >3500.

When I was reading subtitles / comics, I didn't feel comfortable starting until I got to 1500. Before then I was just picking at random fragments. I didn't recognize enough to piece together much meaning.

When a book has 1500 unique characters, I estimate you need to know about 2500-3000 characters off the frequency list to catch that book's particular subset of 1500 (I basically learned off the frequency list). I consider that subset of 1500 almost like the book's/author's character fingerprint. That's why when you read sequels or more books by that author, it tends to be much easier, cause the second time around, you already know the author's favored subset.

On 1/25/2022 at 10:39 PM, alantin said:

And then analyze each text and calculate the weighted average frequency of all the words in the text. The end result would be a score that tells you the relative "easiness" of the text based on the word frequencies. The more there are frequently used words in the text, the greater the score and, in theory, the easier the text.

This is very similar to the concept of information "entropy" or "surprisal". Basically you measure how surprised you are to find a character/word of a certain frequency in the book. If words that are supposed to be common turn up in a book, it's not much of a surprise. If they're supposed to be rare but appear often, it's a "surprise." Mathematically, it's measured by using the negative log of the expected frequency.

High surprisal generally means a book contains a lot of "information", and is harder to read. Content with low surprisal are easier to process. Easy books are those that have a lot of common words arranged in a very common order that is frequently seen, so you're never surprised when reading it.

https://en.wikipedia.org/wiki/Information_content

For a book, you can do it at the character level or at the word level. For the word level, I'd take add a list of say 5000 common words, and then have a single bucket for everything not on the list of 5000.

You then add up the "surprisal" measure (negative log of the frequency) of each unit (char/word) in the book and divide by the total number of units in the book.

I've thought of doing this myself at some point, but have also thought of seeing if someone else has already done it. Someone must have done it already because entropy/surprisal is a very common measure in multiple scientific fields, but I haven't seen it in casual googling.

I almost expect to find it in some linguistic / AI library or on github somewhere, but haven't run across it yet. Maybe @AntonOfTheWoods knows somewhere this might be available?

January 25, 2022 at 05:35 PM

On 1/25/2022 at 6:50 PM, phills said:

This is very similar to the concept of information "entropy" or "surprisal". Basically you measure how surprised you are to find a character/word of a certain frequency in the book. If words that are supposed to be common turn up in a book, it's not much of a surprise. If they're supposed to be rare but appear often, it's a "surprise." Mathematically, it's measured by using the negative log of the expected frequency.

Wow! I didn't know that concept when I wanted to compare the difficulty of the chapters in my books and came up with that weighted average. I personally do it at character level though because there are a lot less of them to compare. I'll have to look into surprisal and change my measures!

January 25, 2022 at 07:33 PM

On 1/25/2022 at 4:43 PM, MoonIvy said:

We had previously analysised many native webnovels, we found the word count to be extremely high (150k-300k), even children's books for 6-7 years olds were around 5-8k. If we were to change the ranks, it'll be something along lines of 5,000 50,000 100,000 200,000+ They're really high numbers, and I feel that'll just scare people.

I am not sure how you get to 150-300K individual words (???) (the total word count is not releavant)

I could imagine you could go in steps of 2.5K, 5K, 10K, 15K, 20K and 25k+ unique word for the levels. This is of course not an exact science, but number like these put up on this forum again and again (e.g. https://www.chinese-forums.com/forums/topic/61248-reading-material-chasm/?do=findComment&comment=480572). I believe Imron also said something along the line that of 5K to start, 10K for easy novels and 20K+ for pretty much the rest (I hope I am remembering this correctly, otherwise I apologise for misquoting)

On 1/25/2022 at 4:43 PM, MoonIvy said:

When analysising Chinese text, every variation counts as a word, for example, 第一次，第二次，第三次, 一棵树，两棵树 a computer would count all these are different words, however as a learner, you don't need to know them all individually. Another example is 走走着走着走着, a computer would count all these as different words

I know, but does this really matter? Such words might constitute less than 5% of all unique words and they should average themselves out across the levels as you learn more words. In other words, you acknowledge them as a source of error, but this error is similar across all levels (maybe a bit higher at the 0-5K word level).

On 1/25/2022 at 5:09 PM, Publius said:

there's also the fact that Chinese school system actually uses character count as goals. Of course native kids aren't learning characters the Heisig way. They're asked to use the character they learned in a word (组词) and use the word in a sentence (造句), not to mention copying out the character by hand - all the things an adult learner hates. So if you have learned characters properly the number of known characters is a good indicator of your literacy level.

Yes, in that particular setting, I believe character count makes sense. For Chinese as a second language learners, it probably does not.

On 1/25/2022 at 5:09 PM, Publius said:

The Flesch readability score for English takes only three easy-to-obtain numbers: total sentences, total words, total syllables.

I wonder, if sentence length also plays a role in Chinese?

(In German it certainly does and we are famous for creating those long-winding and somewhat confusing word strings with nested subclauses, but not only that, even nested subclauses within nested subclauses, so that at the end of a sentence you do not really know how it started, but, if you do, you can consider yourself equal to the author Thomas Mann, who was famous for creating such long sentences and whose books challenge the minds of the TikTok generation... (you get the gist)?)

January 25, 2022 at 08:09 PM

The Chinese also seem to enjoy writing pagefulls of text using only commas as punctuation....
But it's not the same as German. In their case they just seem to indicate that what's being said next is still related to what was said before the comma.

January 25, 2022 at 08:19 PM

On 1/25/2022 at 7:33 PM, Jan Finster said:

I am not sure how you get to 150-300K individual words (???) (the total word count is not releavant)

Oh sorry, my bad! Was looking at the wrong data. Native content is around 15k-30k, so your suggestion could work. Do you know of a tool that people can use to work out roughly how many words they know?

On 1/25/2022 at 4:50 PM, phills said:

I didn't feel comfortable starting reading native materials until 2500, and didn't feel like I was confident until >3500.

What sort of content did you read? Myself and members of our Discord have been reading Chinese webnovels since around 2k characters. We've found some relatively easy, modern, slice of life webnovels.

January 25, 2022 at 08:29 PM

On 1/25/2022 at 12:30 PM, Moshen said:

One tip: define "manhua." I have never run across that word before and had to look it up.

Didn't think about this! Thanks. I've added an explanation

January 25, 2022 at 08:43 PM

On 1/25/2022 at 9:19 PM, MoonIvy said:

Do you know of a tool that people can use to work out roughly how many words they know?

Chinese Text Analyzer (by Imron (one of the mods)) would be the obvious suggestion. I guess, anyone, who is serious enough about Chinese to learn 5K+ plus words in Chinese, can spend 15$ on this great tool.

January 25, 2022 at 08:50 PM

On 1/25/2022 at 3:43 PM, MoonIvy said:

At the moment, the resources that sit in each section are placed there based on experience from myself and a few others.

Perfect! If there was a useful way to quantify difficulty precisely, you or someone else would have done it already. So just go with your judgement and your experience, and perhaps explain your reasoning - but don't get sucked into too much left-brain analysis. People who are really axed to do the analysis will have their own preferred tools and parameters.

On 1/25/2022 at 8:43 PM, Jan Finster said:

Chinese Text Analyser (by Imron (one of the mods)) would be the obvious suggestion

Agreed.

January 25, 2022 at 11:01 PM

welp. i just followed your desktop OCR guide, and wanna say. THANK YOU!!!

I've been searching for a few weeks for something like pleco ocr and actually used the pleco live feature to read characters on my PC....

This seems especially useful for games, an area where my knowledge of vocab is especially poor

January 25, 2022 at 11:38 PM

On 1/25/2022 at 8:50 PM, realmayo said:

Perfect! If there was a useful way to quantify difficulty precisely, you or someone else would have done it already. So just go with your judgement and your experience, and perhaps explain your reasoning - but don't get sucked into too much left-brain analysis. People who are really axed to do the analysis will have their own preferred tools and parameters.

Yeah it is hard to divide up levels, even within the levels each person's ability to comprehend the content will vary.

January 26, 2022 at 12:25 AM

On 1/26/2022 at 3:33 AM, Jan Finster said:

Yes, in that particular setting, I believe character count makes sense. For Chinese as a second language learners, it probably does not.

Yet they aim to one day read like a native. When in Rome... use hand gestures, is all I'm saying.

On 1/26/2022 at 3:33 AM, Jan Finster said:

we are famous for creating those long-winding and somewhat confusing word strings with nested subclauses, but not only that, even nested subclauses within nested subclauses, so that at the end of a sentence you do not really know how it started

Mark Twain summed it up pretty neatly. My first encounter with the German language was in the late 80s, when West German embassy used to give out free textbooks (Auf Deutsch Gesagt, Familie Baumann) if asked politely. When I ventured into the literary world, however, I was like, wtf, don't shoot! I surrender! (Young Werther was the culprit, I believe.)

January 26, 2022 at 06:52 AM

On 1/26/2022 at 4:19 AM, MoonIvy said:

What sort of content did you read? Myself and members of our Discord have been reading Chinese webnovels since around 2k characters. We've found some relatively easy, modern, slice of life webnovels.

I was halfway through the HSK6 character list (which goes up to ~2700), when I started reading 活着. I was somewhere in the low/mid 2000s. I decided to pre-memorize all the unknown chars in the book (about ~250), before I started, which took me up to around 2500.

Then I tried reading other books, and I decided to pre-memorize the ~100-200 unknown chars in each book before I started. Basically I did this until I got to about 3500 chars (~4 books), then it wasn't worth the effort anymore.

I never tried web-novels. Also, perhaps you could know fewer chars if you can tolerate more ambiguity. Even after 4 books, I could still find ~100 unknown chars in each new book, but I decided to just figure them out in context as they came along. But to get a comfortable cushion of characters, suitable for lots of different texts, I think you need 2500-3500.

At least for me, I get frustrated when people under-estimate the number of characters you need to know. I suppose they want to make reading seem more accessible. But the fewer characters you know, the harder the slog when you first read. If you've already memorized 2000, another 500 isn't too much more. I'd rather know more chars ahead of time, and have an easier time reading, then interrupt my reading constantly with a dictionary. That could just be my subjective preference though (I prefer extensive > intensive reading). Some people are more tenacious about getting through the slog.

I suppose you talk about the same thing in your "Reading Pain" section. Your 2000/3000 numbers are set at the Reading Pain level rather than Intensive or Extensive Reading level I'd boost it by 500 if you want less Reading Pain when you start.

On 1/26/2022 at 4:19 AM, MoonIvy said:

Do you know of a tool that people can use to work out roughly how many words they know?

http://www.zhtoolkit.com/posts/tools/ does that, but strangely it seems busted now. I tried it a few months ago, and it worked. Maybe there's another site hosting it, or it'll recover by itself when they notice the bug.

Here's a nice graph from that site, summarizing test results:

http://www.zhtoolkit.com/posts/2011/06/skill-levels-quantified/

Sign In

Guide to reading Chinese fiction, from absolute beginner to beyond HSK 6

Recommended Posts

MoonIvy

Jan Finster

Moshen

MoonIvy

MoonIvy

alantin

MoonIvy

Publius

phills

alantin

Jan Finster

alantin

MoonIvy

MoonIvy

Jan Finster

Guest realmayo

malazann

MoonIvy

Publius

phills

Join the conversation