Jump to content
Chinese-Forums
  • Sign Up

Recommended Posts

Posted

I have found many frequency lists for the most common Chinese characters, but is there any such list for the most common words (including multiple character words rather than just single characters)?  I have had a look around but haven't been able to find anything yet.

 

Thanks

Posted

There's not a lot out there.  One I can think of off the top of my head is this one.  The problem with these types of lists however is that they may or may not be relevant to your vocabulary and to the type of content you are wanting to use.  The more advanced your level is, the more likely this is to be true.

 

Take the above list for example, it's generated from film subtitles, and so it will be relatively good for words found in dialog and spoken text, but relatively poor for words found in newspaper articles and novels.

 

If you'll excuse the shameless plug, I wrote a tool that lets you generate your own wordlists based on frequency and/or several other metrics, from any piece of Chinese text.  It will also keep track of your known vocabulary over time, so you can use it to export the top 10 unknown words from a given article, and so on.

  • 3 weeks later...
  • New Members
Posted

it's better for you to know the background of those words.

Posted

The SUBTLEX-CH has its own issues. Mainly that the corpus it is based on contained a lot of translated subtitles for American movies and shows. So, you'll notice that a ton of transliterations of English proper nouns show up. And who knows what other biases are introduced from the fact that it's all translated material. For example, maybe some common chengyus are ranked really low because they wouldn't be a natural way to translate anything from English. Overall I see this as a serious problem with this particular frequency list.

If your goal is just to have a list of words to study then I think the official HSK vocab lists work very well for that purpose.

  • Like 1
Posted
Overall I see this as a serious problem with this particular frequency list.

It's a serious problem with any frequency list generated from content you are not currently reading.  Even things like the HSK vocab lists are very poor in this regard.

  • Like 1
Posted

Well, some content will be more general and well-rounded than others. To me, a corpus that heavily features translated texts is particularly problematic from a linguistic standpoint, in a way that goes beyond the problem that every corpus will have of not being perfectly tailored to your own interests.

As for the HSK list, my opinion is that up through level 5 the words are frequent enough that it's worth it to learn the entire list straight through. At HSK 6 it starts getting murky and it becomes worth it to mine your own vocab from content that you are reading/watching.

Posted
in a way that goes beyond the problem that every corpus will have of not being perfectly tailored to your own interests.

Which is why I advocate creating a corpus perfectly tailored to your own interests! :mrgreen: (or, if not your own interests, then at least what you are currently reading).

 

At HSK 6 it starts getting murky

I think it gets murkier before that.  Yes the words on the earlier lists have a high frequency in general texts, but there are also plenty of other frequent words that are not on these lists, and that will change quite significantly depending on what you are reading.

 

If you have a choice between learning words that might be relevant in a few months time, or words that will be directly relevant that day, or in the coming days, then for me it's really a simple choice.  You'll end up learning all the HSK vocab eventually, just on a more random schedule.

  • 3 weeks later...
Posted

 

 

https://en.wiktionar...Frequency_lists

Thanks, iand, I've been looking for something like this for a while.

Noticed something interesting in the first list: 台灣 台湾 is no. 80? 

 

Academia Sinica (not "Academica" as it is spelled in Wikipedia article) is a Taiwanese academic institution, so their corpus probably reflects Mandarin as it is spoken and written in Taiwan.

Posted

Not only will a few minutes browsing the list prove it has a Taiwan-centered vocabulary, but a very old one, I think I saw various Soviet references in there.

  • 1 month later...
Posted

Ps. Are you sure you want to use a frequency dictionary? It's so boring to nail down all the words in a list, I prefer reading a lot instead of learning words out of context... "enjoy the process" :D

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...