Jump to content
Chinese-Forums
  • Sign Up

Is there an authoritative resource that measures how common terms 词 are in the Chinese language?


Recommended Posts

Posted

I came across the term 费解 (unintelligible, puzzled) recently. It’s not in HSK. Seems like it could be a fairly common word, so that led me to wonder: is there an authoritative source yet for finding out just how common or uncommon a Chinese term is (i.e., separately from how common a character is)? Seems like there’d be a market for an easily lookup-able database online of Chinese terms which provides some kind of ranking in terms of frequency of use. 

  • New Members
Posted

Hi my name is David, I'm new, and this is my first post.

 

There is one. I use Pleco and they sort the 词s by usage frequency. For example if you look up for the word 人, it will give you the 词s containing 人 by such order. I'm not 100% sure but it does look pretty much like that.

 

pleco screenshot

 

  • Helpful 1
Posted
55 minutes ago, imron said:

The problem is, that once you're at HSK 4 or above, the relative frequencies of each word will be significantly different depending on what you are reading.

 

That means there is no authoritative source, and you're betting off calculating the frequencies of words from content that you are reading.  I made a tool that will help you do just that.

 

I agree with that sentiment. Its really not ideal to go through frequency tables on the chance you might come across it. I'd probably go a little further and say up to HSK 5 (maybe) is at least a fairly OK path to take, but then its high time to veer off, Above HSK5 and you really need to a good reason (like sitting the exam) to continue along the HSK word list . In anycase, bashing through a pile of flashcards before you read something is pretty difficult task. It is useful for a chapter of a book, text lesson etc limited to 50 or so and you are certain you will read it in the next few days.

 

Looking at my deck now, only it has 9618 words and only 3895 are from HSK1-6. This spreadsheet came from all sorts or sources, wechat, text , messages, shopping apps, graded readers, text books, odd movie, slang words people told me etc  However I probably only know 2/3 of this list 

  • Thanks 1
Posted

Some time ago I found a set of word lists put together by 北京语言大学 showing word frequency in different areas, there is a list compiled from newspapers, literature, Weibo, and one or two other sources plus a "global" one which I assume is all of them together. 

 

I have uploaded it to a folder in the Google Drive that I use for the Transcription Project. You can access it here: https://drive.google.com/drive/folders/1w2BPsbMmuruTONmr4xy6s1CwG7IJ5CTd?usp=sharing

 

There is also Subtlex which uses only data from subtitles of movies. 

 

No doubt the words are split using NLP which probably gives pretty good results but probably not 100%. 

 

 

I have in fact made a dictionary type thing which I use myself for looking up words and seeing their frequency.

 

Currently it only displays a percentage, which is the percentage of films that it appears in, based on the subtlex data that I linked to above, or no figure at all if it's not part of the Subtlex dataset. I do at some stage hope to increase it's scope and use a larger dataset with different categories. You are welcome to use it if you like, it's online here: http://www.sino-dex.com/

 

If you want use it you need to sign up to access it at all. (not because I want anyoen to sign up but simply because I made it that way and haven't had a need to change it since I'm the only one who uses it).

  • Helpful 1
Posted

I clicked on Roddy's link only after posting, not realising that he already linked to it. Anyway, now it's been mentioned twice! 

  • Thanks 1
Posted

You can buy word frequency books in China. I'm not sure how common they are, but I have seen at least one.

 

As others have pointed out, though, relative frequency will depend a lot upon what you are reading. I'm not sure how the books are compiled.

  • Like 1
Posted

Thanks Imron. I may just test it out. I’m at an HSK 5 or 6 level in terms of reading vocabulary. I recently picked up the novel 活着. A tool that would analyze the terms in the novel and help create wordlists seems like it would be helpful. I guess I need to get the digital version though, haha.

 

“The problem is, that once you're at HSK 4 or above, the relative frequencies of each word will be significantly different depending on what you are reading.

 

That means there is no authoritative source, and you're better off calculating the frequencies of words from content that you are reading.  I made a tool that will help you do just that.”

Posted
7 hours ago, gedawei said:

I guess I need to get the digital version though, haha.

I've recently been using https://www.shutxt.com/ to find digital copies of books, and it seems to be quite good.

 

Here's 活着.  The download links are on the left.

  • Helpful 1

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...