HSK 3.0 ... new, new HSK?

November 30, 2020 at 04:55 AM

2 hours ago, 叫我小山 said:

made an Anki deck combining all of the 普通话水平测试 vocabulary; the HSK3.0 vocabulary

Someone found my flashcard decks. Cool.

2 hours ago, 叫我小山 said:

then removed all of the individual characters

I wouldn't remove all of the individual characters, as even within the first few dozen entries on the a PSC(2) list 癌 & 庵 are characters that don't get featured as parts of entire words.

2 hours ago, 叫我小山 said:

I figure if you learn these words you probably won’t run into any words that you: a) won’t understand from context and from the characters encompassing it, or b) would find extremely rare and probably don’t need to know (as you might hear it less than once a year or something).

You'd probably want to include the first 2000 idioms, from this list, as well then.

Btw, what dictionary did you use for the anki entries? CC-CEDICT? I've been looking at doing something similar, but wanted to have both my dictionaries of "Oxford C-E" and "现代汉语规范词典" as entries. But, there probably isn't an automatic progress you can use for that is there? Also for the idioms one would probably want to include the "多功能成语词典", which is just a treat.

November 30, 2020 at 08:36 AM

5 hours ago, 叫我小山 said:

I made an Anki deck combining all of the 普通话水平测试 vocabulary; the HSK3.0 vocabulary; as well as the current HSK1-5 (as I’ve learnt up to that point now) then removed all of the individual characters (as I have a deck for that already), and it came to approximately 15,500 words. I figure if you learn these words you probably won’t run into any words that you: a) won’t understand from context and from the characters encompassing it, or b) would find extremely rare and probably don’t need to know (as you might hear it less than once a year or something)

Have you tested that assumption with actual books you checked in CTA?

I have made a CTA word list consisting of the more comprehensive old HSK (pre 2010), HSK 3.0. and 普通话水平测试.

Then I tested a NY Times bestseller:

   Total   191.024
   Known   163.423
   Percent Known   85,55%
   Unknown   27.601
   Percent Unknown   14,45%
   Unique   14.533
   Known   8.112
   Percent Known   55,82%
   Unknown   6.421
   Percent Unknown   44,18%

I think it is important to know that the 普通话水平测试 does not equal every word a Chinese College student knows. Rather it is an extra set of words they need to know on top of the "who knows how many words that are assumed to be common knowledge".

If the College student only knew the words of 普通话水平测试 that person would struggle with said NY Times bestseller:

   Total   191.024
   Known   84.019
   Percent Known   43,98%
   Unknown   107.005
   Percent Unknown   56,02%
   Unique   14.533
   Known   3.784
   Percent Known   26,04%
   Unknown   10.749
   Percent Unknown   73,96%

November 30, 2020 at 10:48 AM

@Jan Finster A lot of those "unknown words" can be guessed by the reader due to knowing the constituent characters though... 73.96% unknown? Seems rather extreme. How many of those are proper nouns?

2 hours ago, Jan Finster said:

think it is important to know that the 普通话水平测试 does not equal every word a Chinese College student knows. Rather it is an extra set of words they need to know on top of the "who knows how many words that are assumed to be common knowledge".

Actually, the 普通话水平测试 is a test of words that are assumed are common knowledge. It's made by the same commissions that created the HSK. It's literally just a glorified frequency list. Except for lacking most common idioms, and overemphasizing neutral tones and erhua the PSC is literally just the HSK on steroids. There is a REALLY large overlap.

November 30, 2020 at 12:19 PM

1 hour ago, Weyland said:

A lot of those "unknown words" can be guessed by the reader due to knowing the constituent characters though... 73.96% unknown? Seems rather extreme. How many of those are proper nouns?

Why do only nouns count? Personally, verbs and to some extent adjectives are just as important.

The book I am talking about is a non-fiction book, but again a NY Times Bestseller, so while there are some "technical terms" it is written for an educated general audience.

1 hour ago, Weyland said:

Actually, the 普通话水平测试 is a test of words that are assumed are common knowledge

For fun, I ran the highly imaginative sentence 我是学生 against 普通话水平测试 and 学生 is not even on the list. I tried another example: 外面很阳光 and 阳光 is not on the list. So, this cannot possibly be a comprehensive reflection of common knowledge.

1 hour ago, Weyland said:

It's made by the same commissions that created the HSK. It's literally just a glorified frequency list. Except for lacking most common idioms, and overemphasizing neutral tones and erhua the PSC is literally just the HSK on steroids. There is a REALLY large overlap.

When I run the old HSK (pre 2010), which is supposed to be the more challenging HSK version, against 普通话水平测试 [with the latter being the reference word list), I get:

   Total   8.583
   Known   3.551
   Percent Known   41,37%
   Unknown   5.032
   Percent Unknown   58,63%
   Unique   8.583
   Known   3.551
   Percent Known   41,37%
   Unknown   5.032
   Percent Unknown   58,63%

How is this a large overlap? I was shocked!

November 30, 2020 at 12:36 PM

18 minutes ago, Jan Finster said:

For fun, I ran the highly imaginative sentence 我是学生 against 普通话水平测试 and 学生 is not even on the list. I tried another example: 外面很阳光 and 阳光 is not on the list. So, this cannot possibly be a comprehensive reflection of common knowledge.

I don't know what list you're using, but here is 学生 at #5427. And here is 阳光 at #5526.

EDIT: Also, when used as an adjective ”阳光" can only be used to describe "open to the public (e.g. an investigation) or "when someone is upbeat / cheery". Otherwise, it means "sunshine" and is a noun. You're probably thinking of 晴朗, which is part of HSK6 and the PSC.

...

18 minutes ago, Jan Finster said:

When I run the old HSK (pre 2010), which is supposed to be the more challenging HSK version, against 普通话水平测试 [with the latter being the reference word list)

What lists are you even using? We have a list for the new HSK3.0 with 11,092 words. And the PSC is 17,055 words.

18 minutes ago, Jan Finster said:

Why do only nouns count? Personally, verbs and to some extent adjectives are just as important.

It isn't about it "being important", but rather about it being "various". You can define one proper noun, like a name or an invention and then use it from then on without it having to be "known" by anyone past or present. E.g. the town name of "Llanfairpwll-gwyngyllgogerychwyrndrob" in Wales, is a proper noun. Heck, every person's name on this forum is proper noun.

So in the future, before you "see what the overlap is" you might want to first prune that list of words with, I don't know, a dictionary. If it isn't in a popular, everyday, dictionary then it probably is either a proper noun or the program made a mistake when segmenting the words.

Edited November 30, 2020 at 12:42 PM by Weyland
阳光 grammar lesson

November 30, 2020 at 01:47 PM

1 hour ago, Weyland said:

I don't know what list you're using, but here is 学生 at #5427. And here is 阳光 at #5526.

I used this list: https://www.chinese-forums.com/forums/topic/37109-psc-普通話水平測試-vocabulary-list/

Do you have the whole list as txt or xls so I can compare where the mistakes are?

1 hour ago, Weyland said:

1 hour ago, Jan Finster said:

When I run the old HSK (pre 2010), which is supposed to be the more challenging HSK version, against 普通话水平测试 [with the latter being the reference word list)

What lists are you even using? We have a list for the new HSK3.0 with 11,092 words. And the PSC is 17,055 words.

I was not talking about HSK 3.0., but the old HSK 1.0. I found that list here:

https://www.chinese-forums.com/forums/topic/53566-old-hsk-vocab-lists/

So, basically I ran https://www.chinese-forums.com/forums/topic/37109-psc-普通話水平測試-vocabulary-list/ against https://www.chinese-forums.com/forums/topic/53566-old-hsk-vocab-lists/

Let me know if there are mistakes in those lists.

November 30, 2020 at 03:41 PM

I should first state that this deck is for personal use, so it’s purpose is a bit custom to me, but it solely contains the PSC, HSK3.0 and HSK1-5. (Someone could add the 2010 HSK level 6 as well if they wanted to be thorough, but I assume that any vocabulary in it would already be in the PSC and HSK3.0).

Weyland, I removed all the individual characters because the 3,000 most common ones will be the base knowledge and any single characters after that that I encounter “in the wild” I will add manually.

成语 are a completely other matter, as it seems natives knows multiple thousands. I have a 成语 book with over 10,000 and my wife breezed through it and knew almost every one. They just have exposure to them much more than a lowly foreigner student. I will be adding the most frequent 2,000 as you suggested, because there aren’t that many in this deck I have made.

I used Anki and then used the Chinese Support add-on to add definitions. I believe it uses CC-CEDICT, but it’s formatted in such a way that it’s only really usable in CN -> EN, or “passive recognition (i.e. reading)”. If you were to have the definition on the front it would be very difficult as there are sometimes 12-15 English definitions. It doesn’t pick the 3 most common ones, for example.

Jan Finster, is it possible that they NYT bestseller you are referring to is a translation EN -> CN? If so it isn’t an accurate reflection of 普通话 frequency, as translation is usually more simple in terms of lexical variety. I read 小王子 and it seems easier than it should have been, and is likely “simplified” translating a certain noun or verb into CN.

November 30, 2020 at 04:11 PM

2 hours ago, Jan Finster said:

So, basically I ran https://www.chinese-forums.com/forums/topic/37109-psc-普通話水平測試-vocabulary-list/ against https://www.chinese-forums.com/forums/topic/53566-old-hsk-vocab-lists/

Let me know if there are mistakes in those lists.

Both 学生 & 阳光 are part of the lists you've linked.

2 hours ago, Jan Finster said:

Do you have the whole list as txt or xls so I can compare where the mistakes are?

Already linked them.

November 30, 2020 at 04:30 PM

19 minutes ago, Weyland said:

Both 学生 & 阳光 are part of the lists you've linked.

Thanks!

Then it is CTA`s fault or the lists were somewhat compromised before I included them in CTA . I basically copy/pasted all words of 普通话水平测试 into CTA and marked all as known. I saved this as a reference word list. Then I copy/pasted 我是学生 and 外面很阳光 into CTA and got the results I mentioned above.

3 hours ago, Weyland said:

It isn't about it "being important", but rather about it being "various". You can define one proper noun, like a name or an invention and then use it from then on without it having to be "known" by anyone past or present. E.g. the town name of "Llanfairpwll-gwyngyllgogerychwyrndrob" in Wales, is a proper noun. Heck, every person's name on this forum is proper noun.

So in the future, before you "see what the overlap is" you might want to first prune that list of words with, I don't know, a dictionary. If it isn't in a popular, everyday, dictionary then it probably is either a proper noun or the program made a mistake when segmenting the words.

I think this is a valid remark, but it should not apply when you compare the word lists we are talking about. They should not contain a significant number of proper nouns.

So, if I cannot trust CTA´s comparison, then how do you have a method to determine how much overlap there is between:

HSK 1.0. (old HSK) vs 普通话水平测试

HSK 3.0. vs 普通话水平测试

HSK 2.0. vs 普通话水平测试

Thanks!

November 30, 2020 at 09:25 PM

4 hours ago, Jan Finster said:

HSK 1.0. (old HSK) vs 普通话水平测试

HSK 3.0. vs 普通话水平测试

HSK 2.0. vs 普通话水平测试

Based on Weyland's word lists, here the comparison back to back according to CTA. "Reference word list" (column) is set as 100% known and then HSK 1.0. (pre 2010), HSK 2.0. (2010-2020), HSK 3.0. (2021+) and PSC (Weyland's 普通话水平测试 ) is copy/pasted into CTA. The numbers stand for words not covered in the reference list:

		Reference Word List
	HSK 1.0	HSK 2.0	HSK 3.0	PSC
HSK 1.0		4153	1673	896
HSK 2.0	567		490	435
HSK 3.0	3905	6308		2359
PSC	12564	14078	7596

December 1, 2020 at 12:03 PM

I'd be cautious about how you interpret CTA's numbers: it's a super-useful tool but (unless it's improved recently) it doesn't do a perfect job of segmenting. That means it'll often tie two characters together as one (often very rare) word, rather than realising that the first character is, say, the last part of someone's name, and the second character belongs to another word or should be standing alone.

The result is it overestimates - I think - the number of rare vocabulary items ... and because they're so rare, you're unlikely to have studied them, making you more pessimistic than perhaps you should be about how much vocab you'll already know in a novel.

Best to use it as a relative figure: 'it says XX%, that puts it halfway between that really easy book I read last week and that tricky one from last month' for instance.

December 2, 2020 at 06:36 AM

On 11/30/2020 at 8:30 AM, Jan Finster said:

I basically copy/pasted all words of 普通话水平测试 into CTA and marked all as known. I saved this as a reference word list.

This might be part of the problem - you're relying on CTA's segmenter when the words are already segmented by line. It would be better to import this list of words, rather than copy and paste and mark as known.

18 hours ago, realmayo said:

it doesn't do a perfect job of segmenting.

This is probably another part of the problem. If you're looking for more accurate segmentation (and don't mind if it takes a couple min to process a book, compared with less than a second in CTA), I recommend using the Stanford Segmenter to segment the text, the feeding the segmented text into CTA.

December 2, 2020 at 07:45 AM

1 hour ago, Yadang said:

This might be part of the problem - you're relying on CTA's segmenter when the words are already segmented by line. It would be better to import this list of words, rather than copy and paste and mark as known.

19 hours ago, realmayo said:

it doesn't do a perfect job of segmenting.

This is probably another part of the problem. If you're looking for more accurate segmentation (and don't mind if it takes a couple min to process a book, compared with less than a second in CTA), I recommend using the Stanford Segmenter to segment the text, the feeding the segmented text into CTA.

@imronI would be curious as to what you, Imron, have to say about this. Also, based on how you created CTA, do you think the data I shown above is accurate or flawed?

On 11/30/2020 at 10:25 PM, Jan Finster said:

The numbers stand for words not covered in the reference list:

Reference Word List

HSK 1.0 HSK 2.0 HSK 3.0 PSC

HSK 1.0 4153 1673 896

HSK 2.0 567 490 435

HSK 3.0 3905 6308 2359

PSC 12564 14078 7596

December 2, 2020 at 09:38 AM

CTA trades accuracy for speed, and produces figures that are ballpark correct.

This is useful enough for extracting frequently occurring unknown words and for comparing texts to get an idea of relative difficulty - the two main design goals of CTA.

I would like to improve the segmenter, but doing so would involve a non-trivial amount of effort for little extra improvement in the two areas.

I generally agree with what realmayo and yadang said.

December 2, 2020 at 11:01 AM

1 hour ago, imron said:

I would like to improve the segmenter, but doing so would involve a non-trivial amount of effort for little extra improvement in the two areas.

Can't you just piggyback a freely available segmenter onto CTA?

https://nlp.stanford.edu/software/segmenter.shtml

December 2, 2020 at 11:12 AM

No.

The stanford segmenter is generally considered best-of-breed but:

1. It's licensed under the GPL which means I couldn't incorporate it in to CTA without also releasing the CTA (and its source) under the GPL

2. Even if I did that, it's written in Java where CTA is written in C++ and the languages are not compatible enough to make this a useful endeavour

3. Even if I did decide to glue them together (it's not impossible) the stanford segmenter is significantly slower than CTA and uses significantly more memory than CTA (both because of Java and because of the better segmentation algorithm), which would cause some problems because several parts of CTA require a fast segmenter to work properly (realtime highlighting of text).

4. I could use the same algorithm (a CRF word segmenter) and write my own implementation in C++ which would be faster and unencumbered by the GPL *however* that then falls back to what I said above about non-trivial effort for little improvements.

5. If I had the time to do such a thing (which I currently don't) that time would be better spent on other parts of CTA - e.g. the corpus feature, graphs of learnt words over time, etc which would provide greater value to users than a minor improvement to segmentation.

December 2, 2020 at 11:34 AM

21 minutes ago, imron said:

No.

The stanford segmenter is generally considered best-of-breed but:.....

32 minutes ago, Jan Finster said:

Can't you just...

I had a feeling the word "just" would be loaded.... ?

December 31, 2020 at 08:27 PM

Credit goes to @alanmd at HSK东西 for the idea of plotting word frequencies in a stacked histogram.

Word frequency data is largely based on the gigantic 15-billion-character BLCU corpus, along with some supplementation from the Lancaster Corpus of Modern Chinese and the SUBTLEX-CH word frequency listings.

March 2, 2021 at 11:01 PM

I hazily remember being offered a choice between the old and new HSK in the summer of 2010. Does anybody recall how long both tests were available the last time they revised the exam?

March 3, 2021 at 05:57 AM

That was a bit of an unusual situation, as those were rival exams being offered by different bodies (BLCU and Hanban). I'm not sure if you'll see HSK 2.0 and 3.0 offered at the same time in the same place, although there may well be chances to take HSK 3.0 while they're trialing it, or to pick and choose by switching between local exam centres during the roll-out.

Sign In

HSK 3.0 ... new, new HSK?

Recommended Posts

Weyland

Jan Finster

Weyland

Jan Finster

Weyland

Jan Finster

叫我小山

Weyland

Jan Finster

Jan Finster

Guest realmayo

Yadang

Jan Finster

imron

Jan Finster

imron

Jan Finster

大块头

大块头

roddy

Join the conversation