gedawei Posted March 16, 2019 at 10:47 PM Report Posted March 16, 2019 at 10:47 PM I came across the term 费解 (unintelligible, puzzled) recently. It’s not in HSK. Seems like it could be a fairly common word, so that led me to wonder: is there an authoritative source yet for finding out just how common or uncommon a Chinese term is (i.e., separately from how common a character is)? Seems like there’d be a market for an easily lookup-able database online of Chinese terms which provides some kind of ranking in terms of frequency of use. Quote
Publius Posted March 17, 2019 at 02:59 AM Report Posted March 17, 2019 at 02:59 AM Google Ngram is hardly an authoritative source, but it does give you a number, for example, 0.000080% for 费解 (in comparison, 费用 is 0.0140%, 175 times more common). 1 Quote
gedawei Posted March 17, 2019 at 04:33 AM Author Report Posted March 17, 2019 at 04:33 AM Thank you!! “Google Ngram is hardly an authoritative source, but it does give you a number, for example, 0.000080% for 费解 (in comparison, 费用 is 0.0140%, 175 times more common).” Quote
DavyJonesLocker Posted March 17, 2019 at 07:46 AM Report Posted March 17, 2019 at 07:46 AM This might help https://en.wiktionary.org/wiki/Appendix:Mandarin_Frequency_lists i didn't check as to where the data came from, age, validity etc but it's a start I guess. 1 Quote
Popular Post imron Posted March 17, 2019 at 09:28 AM Popular Post Report Posted March 17, 2019 at 09:28 AM The problem is, that once you're at HSK 4 or above, the relative frequencies of each word will be significantly different depending on what you are reading. That means there is no authoritative source, and you're better off calculating the frequencies of words from content that you are reading. I made a tool that will help you do just that. 3 2 Quote
New Members davidchenx Posted March 17, 2019 at 10:20 AM New Members Report Posted March 17, 2019 at 10:20 AM Hi my name is David, I'm new, and this is my first post. There is one. I use Pleco and they sort the 词s by usage frequency. For example if you look up for the word 人, it will give you the 词s containing 人 by such order. I'm not 100% sure but it does look pretty much like that. 1 Quote
DavyJonesLocker Posted March 17, 2019 at 10:40 AM Report Posted March 17, 2019 at 10:40 AM 55 minutes ago, imron said: The problem is, that once you're at HSK 4 or above, the relative frequencies of each word will be significantly different depending on what you are reading. That means there is no authoritative source, and you're betting off calculating the frequencies of words from content that you are reading. I made a tool that will help you do just that. I agree with that sentiment. Its really not ideal to go through frequency tables on the chance you might come across it. I'd probably go a little further and say up to HSK 5 (maybe) is at least a fairly OK path to take, but then its high time to veer off, Above HSK5 and you really need to a good reason (like sitting the exam) to continue along the HSK word list . In anycase, bashing through a pile of flashcards before you read something is pretty difficult task. It is useful for a chapter of a book, text lesson etc limited to 50 or so and you are certain you will read it in the next few days. Looking at my deck now, only it has 9618 words and only 3895 are from HSK1-6. This spreadsheet came from all sorts or sources, wechat, text , messages, shopping apps, graded readers, text books, odd movie, slang words people told me etc However I probably only know 2/3 of this list 1 Quote
roddy Posted March 17, 2019 at 11:28 AM Report Posted March 17, 2019 at 11:28 AM It all depends on the corpus. Some are more useful than others. 1 Quote
markhavemann Posted March 17, 2019 at 01:04 PM Report Posted March 17, 2019 at 01:04 PM Some time ago I found a set of word lists put together by 北京语言大学 showing word frequency in different areas, there is a list compiled from newspapers, literature, Weibo, and one or two other sources plus a "global" one which I assume is all of them together. I have uploaded it to a folder in the Google Drive that I use for the Transcription Project. You can access it here: https://drive.google.com/drive/folders/1w2BPsbMmuruTONmr4xy6s1CwG7IJ5CTd?usp=sharing There is also Subtlex which uses only data from subtitles of movies. No doubt the words are split using NLP which probably gives pretty good results but probably not 100%. I have in fact made a dictionary type thing which I use myself for looking up words and seeing their frequency. Currently it only displays a percentage, which is the percentage of films that it appears in, based on the subtlex data that I linked to above, or no figure at all if it's not part of the Subtlex dataset. I do at some stage hope to increase it's scope and use a larger dataset with different categories. You are welcome to use it if you like, it's online here: http://www.sino-dex.com/ If you want use it you need to sign up to access it at all. (not because I want anyoen to sign up but simply because I made it that way and haven't had a need to change it since I'm the only one who uses it). 1 Quote
markhavemann Posted March 17, 2019 at 01:06 PM Report Posted March 17, 2019 at 01:06 PM I clicked on Roddy's link only after posting, not realising that he already linked to it. Anyway, now it's been mentioned twice! 1 Quote
anonymoose Posted March 17, 2019 at 01:53 PM Report Posted March 17, 2019 at 01:53 PM You can buy word frequency books in China. I'm not sure how common they are, but I have seen at least one. As others have pointed out, though, relative frequency will depend a lot upon what you are reading. I'm not sure how the books are compiled. 1 Quote
Bibu Posted March 17, 2019 at 03:10 PM Report Posted March 17, 2019 at 03:10 PM I used to go BCC corpus : bcc.blcu.edu.cn/ and CCl : http://ccl.pku.edu.cn:8080/ccl_corpus/index.jsp?dir=gudai BCC corpus is of better quality , but do not work for a while. And now there is a 3rd, from an institute under GOV edu department :http://corpus.zhonghuayuwen.org/ 1 1 Quote
gedawei Posted March 17, 2019 at 03:30 PM Author Report Posted March 17, 2019 at 03:30 PM Thanks Imron. I may just test it out. I’m at an HSK 5 or 6 level in terms of reading vocabulary. I recently picked up the novel 活着. A tool that would analyze the terms in the novel and help create wordlists seems like it would be helpful. I guess I need to get the digital version though, haha. “The problem is, that once you're at HSK 4 or above, the relative frequencies of each word will be significantly different depending on what you are reading. That means there is no authoritative source, and you're better off calculating the frequencies of words from content that you are reading. I made a tool that will help you do just that.” Quote
imron Posted March 17, 2019 at 10:52 PM Report Posted March 17, 2019 at 10:52 PM 7 hours ago, gedawei said: I guess I need to get the digital version though, haha. I've recently been using https://www.shutxt.com/ to find digital copies of books, and it seems to be quite good. Here's 活着. The download links are on the left. 1 Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.