How to measure the difficulty of a text?

October 4, 2015 at 01:00 PM

I want to find some way to measure the difficulty level of some Chinese eBooks arrange them from easiest to most difficult.

Here is what I tried:

I ran the text in a word segmenting software to identify each unique word.
In the text, I replaced all HSK 1 words with "1", all HSK 2 words with "2", etc., and all words not in the HSK list with "7".
I removed all remaining symbols. This resulted in a document containing only numbers, from 1 to 7.
I then calculated the mean average of all of the numbers in the document.

I thought this would give a rough estimate of the difficulty level. However, the calculations all came out very close to 3.5. There was little difference in the numbers between a children's story that I can easily read and a challenging novel.

Is there some way to improve this? Or a better process for determining the relative difficulty?

October 4, 2015 at 01:05 PM

Yes there is an easier and better way

Try Chinese Text Analyser here http://www.chinese-forums.com/index.php?/topic/44383-introducing-chinese-text-analyser/

Hope this is the sort of thing you meant.

October 4, 2015 at 01:23 PM

It looks like you won't get as fine an ordering as you aimed for, so some coarser measure, which should be easier to calculate, should be good enough. If you feed Chinese Text Analyser, recommended above, a list of the words you already know, and have it analyse the text, it will give you a word count and a list of the words you don't know. Then a simple measurement of difficulty, "% unknown words" sounds pretty good.

Hey, don't waste too much time ordering the texts. :-)

October 4, 2015 at 01:40 PM

I think your step 4 is not strict enough. I would test different algorithms for step 4 until things line up with your expectations.

An even better idea however would be to just leave the overanalysing alone. Reading the first page of a book and judging its difficulty by just feeling it out is the way I deem most fitting.

October 4, 2015 at 04:43 PM

Look also at sentence / clause length. Total vocabulary items (your approach will rate a text with every HSK4 word in as equally hard as a text with ten HSK4 words repeated over and over). Word length. Paragraph length. Range of structures.

October 5, 2015 at 12:57 AM

The other thing that Chinese Text Analyser does is keep track of your known vocabulary.

It's all very well to sort by HSK lists, but that might not be a reflection of difficulty for you - especially as you start reading native content and your vocab starts to diverge from what you might find in the HSK lists.

One of the design goals of CTA was to allow you to have a reasonably fast and accurate way to asses whether a given text is appropriate for your level.

By keeping track of your known vocabulary, CTA can very quickly (e.g. less than a second for most books) give you an approximate number of unknown words in the text (I say approximate, because minor errors with the segmentation will throw off the count slightly).

You can also use it to export lists of unknown words (sorted by frequency and/or first occurrence), which can then be used to study words in advance of reading.

*disclaimer, I'm the developer of Chinese Text Analyser

October 5, 2015 at 10:25 AM

I don't think the new HSK lists are appropriate for the process described in the op.

The lower levels contain too few words to be useful with native contents, that's why it averages at 3.5.

But maybe they could work better with the various levels of the Chinese Breeze series?

You'd need different lists for native content I think. Perhaps try the old HSK lists.

edit: the new hsk lists don't have words such as 星期天！ useless.

October 5, 2015 at 03:00 PM

My approach is the number of distinct (unknown) vocabulary items relative to length of the text. This is a very inaccurat result, but there is some correlation with difficulty. I think that if you combine it with sentence/clause length as Roddy suggest you should get pretty decent results.

It's however not an exact science. Specially when the sentence length and unknown vocabulary is uneven distributed reality may be quite different. Generally for example dialog is relative easy while narrative is often a bit harder. Specialist subjects may contain relatively small, but subject specific rare vocabulary etc.

October 9, 2015 at 09:56 PM

I've made attempts at a worthwhile method, but my best result is a very rough approximation. The biggest issue is that while it does come down to unknown words, sometimes those words are unimportant fluff, and sometimes a single word is the one critical item that is the key to understanding a whole paragraph.

The OP formula isn't a bad try. You would have gotten better granularity using 10^level rather than just the raw HSK level, since the levels are for different windows of word frequency, which decrease exponentially. But the biggest problem is that many words aren't in the HSK so they can't be ranked.

In the US, various algorithms are used to grade reading levels. One common one is Lexiles. According to The Origin of the Lexile Specification Equation, the formula is:

Theoretical Logit = (9.82247*LMSL)-(2.14634*MLWF)

where LMSL = log of the mean sentence length and MLWF = mean of the log word frequences. LMSL and MLWF are used as proxies for syntactic complexity and semantic demand. (Stenner & Burdick, 1997).

This raw score gets plugged into a normalization formula to get the official Lexile score of 0 to 2000+. So, it's basically a weighted combination of sentence length and word frequency.

Flesch–Kincaid is another well-known one. However, it includes number of syllables as a proxy for word difficulty, which may not apply so well to Chinese. Of the methods listed in Wikipedia, most of them have number of syllables or letters in a word as a factor.

I will throw in one final reference, which is from Donald Hayes, who has come up with his own measures. I have yet to find a formula for his "LEX" computed readability. However, a related measure is his "MeanU" number, which is the average word frequency of all words in the text. This is an easy number to calculate, if you are using text analysis software that will list the corpus frequencies for every word.

November 20, 2020 at 11:19 AM

I hope it is ok to piggyback this question on this thread. Regarding the difficulty of a given text:

Does the "95% or 97% known words recommendation" refer to "% of all words" or "% unique words" (CTA)?

November 20, 2020 at 04:41 PM

I'm with Roddy on this.

On 10/4/2015 at 5:43 PM, roddy said:

Look also at sentence / clause length....

... ... etc., etc., etc. If only it was a question of vocabulary!

Vocabulary is the easy part with all the electronic aids at hand. The rest of it isn't so easy, one only really knows one can read a book when reading it. I have a heap of abandoned reading attempts, waiting for the right time, mood and circumstances. But it's not just unknown words. See here:

Quote

她们说。她们以为自己是王。她们嘱我跟她们去看屋子，我去了。我看见屋子，它和它的那些房子朋友们排了一种它们自家高兴排的队，占满整条大街的两边，如一座林。大屋它独个儿凹在一个角落上，别的房子高，它矮；别的房子瘦，它胖；别的房子开朗活泼，它笨，又呆。这，我想起来了，它完全如同我阿果。它正在睡觉，我由得它去睡。天气不冷，但它缩做一团，灰色的外石墙，有如裹了一件厚极了的粗呢外套，加上麻点子的绒毛围巾，以及手套，以及袜子。屋子的楼下有铁闸，由五把锁把守在一起。闸内有大门，门上是弹簧锁。门内的一边是楼梯，每一级上可以让五个我并排挤在一起坐。

Can you understand this paragraph without reading it several times?

Known words? Probably a high % for intermediate students, 75 - 80%? It's only the 2nd or 3rd paragraph in Chapter 1 of 《我城》by the Hong Kong writer 西西. She writes like a child, uses simple, common words, no allusions or literary convolutions. But, does she play with the language?

Most e-book sites have free samples that one can read even without opening an account, these can be quite generous, quite enough to check whether the book is up to one's standard (or patience).

November 21, 2020 at 09:09 AM

21 hours ago, Jan Finster said:

Does the "95% or 97% known words recommendation" refer to "% of all words" or "% unique words" (CTA)?

All words. It’s a measure of how much you’ll understand paragraph to paragraph when reading the text.

November 21, 2020 at 09:22 AM

7 minutes ago, imron said:

All words. It’s a measure of how much you’ll understand paragraph to paragraph when reading the text.

Thank you! I could not really find the answer anywhere, but what you say makes a lot of sense.

Still, reading a book with 4% unknown words can boil down to 3000 new unique words ?

Here is one, I had in mind (沃顿商学院最受欢迎的谈判课):

   Total   183.829
   Known   176.322
   Percent Known   95,92%
   Unknown   7.507
   Percent Unknown   4,08%
   Unique   11.136
   Known   8.087
   Percent Known   72,62%
   Unknown   3.049
   Percent Unknown   27,38%

Piece of cake?

November 22, 2020 at 10:31 AM

What happens to unique known when you take out all the unknown words that only appear once?

November 22, 2020 at 12:29 PM

1 hour ago, imron said:

What happens to unique known when you take out all the unknown words that only appear once?

That is another good point. Leaves me with about 1000 words and only about 400 words of them occur 4 or more times ?

November 23, 2020 at 07:24 AM

Most Chinese text will have a long tail of low frequency words that won’t have a huge impact on understanding if you only see one of them a paragraph.

You can usually safely ignore these words until you are reading a different text that uses those words with more frequency.

November 23, 2020 at 10:13 AM

On 10/4/2015 at 3:40 PM, HerrPetersen said:

An even better idea however would be to just leave the overanalysing alone. Reading the first page of a book and judging its difficulty by just feeling it out is the way I deem most fitting.

Keep this in mind.

Using CTA and calculating words and whatnot is all good and well, but be careful that you don't spend more time analysing than just picking up a text, starting to read it, and putting it aside if it proves too difficult. This analysing can be a version of the Textbook Pitfall, where a prospective learner keeps searching for the perfect textbook instead of just starting to study with any reasonably good one.

Sign In

How to measure the difficulty of a text?

Recommended Posts

Friday

Shelley

querido

HerrPetersen

roddy

imron

edelweis

Silent

c_redman

Jan Finster

Luxi

imron

Jan Finster

imron

Jan Finster

imron

Lu

Join the conversation