Popular Post BearXiong Posted July 6, 2018 at 12:03 AM Popular Post Report Posted July 6, 2018 at 12:03 AM I created a user dictionary for Pleco which adds a new "dictionary" to Pleco so that when you look up a word, you'll also see the frequency ranking of the word in 6 types of corpora: subtitles, weibo, blogs, literature, news and technology (see the screenshot below). The subtitle word frequencies are based on SUBTLEX, and the other 5 are based on BLCU Chinese Corpus (see this thread). I thought others might find it useful. To install it, you'll need to purchase Pleco's flashcard add-on to access the user dictionary options. Next, download the zip file attached to this post, and extract it to get a plecofreq.pqb file which you'll need to put on your phone. In Pleco go to Settings -> Manage Dictionaries -> Add User -> Existing, then locate the plecofreq.pqb file. In the next post I'll attach a text file with the raw entries (as it's too large to attach to this post). There are a total of 109207 entries, which includes all words which are in the top 50,000 most frequent words in at least one of the 6 corpora. The pinyin of the words is based on the CEDICT dictionary, and is left blank if the word isn't in CEDICT. plecofreq.zip 6 6 2 Quote
BearXiong Posted July 6, 2018 at 12:07 AM Author Report Posted July 6, 2018 at 12:07 AM I've attached the raw entries of the user dictionary to this post. You can install the user dictionary based on the raw entries, but it could take a while to process (probably like an hour). To do so, in Pleco go to Settings -> Manage Dictionaries -> Add User -> New. Then give it a name e.g. "Freq". Click on the (empty) dictionary which you created, go to Import Entries -> Choose File, and select the file pleco_freq_dict.txt which I've attached to this post, and finally Begin Import. pleco_freq_dict.txt 1 Quote
Beelzebro Posted July 6, 2018 at 08:40 AM Report Posted July 6, 2018 at 08:40 AM This is a great idea! Am I correct that adding user dictionaries requires the flashcard add-on? I use Anki but I might be tempted to buy the add-on just so I can add this dict haha. Quote
BearXiong Posted July 6, 2018 at 09:02 AM Author Report Posted July 6, 2018 at 09:02 AM Ah, I didn't notice that but you're right. You need the flashcard add-on to access the user dictionary options according to this. 1 Quote
dtcamero Posted July 7, 2018 at 12:20 PM Report Posted July 7, 2018 at 12:20 PM this is super awesome... trying to decide whether or not to study this word... is it obscure or daily use? here's the immediate answer for every dictionary entry... simply brilliant. should be automatically adopted by pleco's @mikelove as a default service. well done sir! Quote
mikelove Posted July 7, 2018 at 03:27 PM Report Posted July 7, 2018 at 03:27 PM Cool project! We actually do have a frequency list in Pleco already, but we currently only use it for sorting search results; we aren't really confident enough in it to encourage people to base their vocabulary study decisions on it yet. (to be honest for that I'd be more confident with something human-curated rather than with corpus analysis) Quote
New Members ErwinHero Posted August 17, 2019 at 10:03 PM New Members Report Posted August 17, 2019 at 10:03 PM I know I'm late, but thank God for this BearXiong! I am not learning Chinese to pass exams. I am learning to increase work efficiency in China. This may save me hundreds of hours. Why this is not a default option to turn on is beyond me. It's not something that should be overthought--the value is incontrovertible. 1 Quote
New Members sopsku Posted February 14, 2020 at 03:53 PM New Members Report Posted February 14, 2020 at 03:53 PM I loaded the plecofreq.pqb file and "played" with it a bit. I noticed some two character words that have definitions in say the ABC dictionary do show up in this FREQ dictionary (e.g. 纸鸢 is in both ABC and CC as the work "kite".) This in itself is not a major problem but it led me to take a look at the raw data file (pleco_freq_dict.txt). This file has freq numbers in the 300Ks, but is only 112137 lines long and therefore cannot have 300K+entries. The file is rather difficult to sort on say the weibo numbers but I think I was able to rewrite it and to do so (perhaps not without error.) I found that the first 50K words were contained in a bit more that 50K lines (some special symbols are mixed in ) but then there are enormous gaps to get to the 300K's freqs in the remaining 60K lines. Hence I would conclude the dictionary may only be useful up to about the 50K freqs. Lastly, I can only seem to make the FREQ dictionary available from the reader or from a direct entry in the Placo dictionary search box. That is, if I take a word from the reader popup that is in the FREQ dictionary and then expand the popup (click the box with up arrow) the FREQ dictionary is not available to entries in the CHARS tab. Nor is the FREQ dictionary ever made available for entries in the WORDS tab. Perhaps this is not the fault of this FREQ USR dictionary and has something to do with my installation. I have checked all of the boxes to configure it to be the same as all of my other dictionaries. Can some comment if this is normal for USR dictionaries in general not to be available for entries in the CHARS and WORDS tabs? Quote
New Members sopsku Posted February 14, 2020 at 05:33 PM New Members Report Posted February 14, 2020 at 05:33 PM I asked: "Can some (one) comment if this is normal for USR dictionaries in general not to be available for entries in the CHARS and WORDS tabs? " So I installed the FREQ USR dictionary on an old Tablet (4-5 old). Both it and my newer phone (1 year old) are running Pleco 3.2.73 and FREQ is installed in an identical way on both. FREQ dictionary seems to work in both the CHARS and WOOD tabs on the old tablet but not on the my phone. Phone is Android 7.0 and Tablet is 7.1.1. I find this very strange. Quote
mikelove Posted February 14, 2020 at 10:00 PM Report Posted February 14, 2020 at 10:00 PM User dictionaries are not reliably supported in those at the moment, no - feature parity between user and Pleco dictionaries is an ongoing goal (and already in current development builds of 4.0 we’ve got them sharing about 95% of their search code, which makes that a lot easier) but at the time we launched the current design they were actually too slow to work well on those screens and so we left them out. Quote
NanJingDongLu Posted May 8, 2021 at 02:34 PM Report Posted May 8, 2021 at 02:34 PM This user dictionary is fantastic and I'm glad I saw this linked in another thread. I'm a little confused as to what exactly it all means though. In the original thread it lists the basis for the corpus as follows: On 6/15/2018 at 12:51 PM, johnvaradero said: It’s based on news (人民日报 1946-2018,人民日报海外版 2000-2018), literature (books by 472 authors, including a significant portion of non-Chinese writers), non-fiction books, blog and weibo entries as well as classical Chinese In the dictionary it lists the following 6 topics: Weibo Subtitles Blogs Literature News Tech Three of these seem fairly self explanatory, but... What is a blog in this context? My native friends say that they've never read a blog that wasn't on Weibo, or is it including 知乎、豆瓣、天涯 and the like? Where does Classical Chinese fit into it? Is it merged with Literature? If you did this for English The Bible, Sherlock Holmes, and Shakespeare would all come up as Literature, but they don't have much overlap if your goal learning English is to read Harry Potter and Twilight. I'm assuming it also merged the non-fiction books in. That would be a shame as you couldn't tell the difference between what people are writing in romance novels vs academic writing. What is tech? We've already got weibo and blogs. I'm mostly confused because if you search, for example, 笨蛋 you'd expect it to lean towards casual and spoken language, but it gives the following stats: Weibo - 4139 Subtitles - 1229 Blogs - 37322 Literature - 6494 News - 50704 Tech - 80233 So this means that you're unlikely to read 笨蛋 in a weibo, but very likely to see it in a blog. Unlikely to see it in a TV show or book, but very likely to read it in the news or "tech". Quote
BearXiong Posted May 8, 2021 at 06:21 PM Author Report Posted May 8, 2021 at 06:21 PM Good question. A lower number represents a more frequent word! So 笨蛋 occurs more often in subtitles and on weibo, compared to in other corpora. The number 1229 for subtitles means that if you listed words by frequency as they occur in subtitles then it's the 1229th most frequent word. Hope that helps! 1 Quote
NanJingDongLu Posted May 8, 2021 at 07:50 PM Report Posted May 8, 2021 at 07:50 PM Oh it's a ranking! I thought it was a frequency count. That makes so much more sense. Thank you. Could you help me with any of the other questions? The main one I'm keen to get an answer on is the definition of "tech" 5 hours ago, NanJingDongLu said: What is tech? We've already got weibo and blogs. Quote
williamwu123 Posted October 28, 2024 at 06:42 PM Report Posted October 28, 2024 at 06:42 PM Amazing Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.