Jump to content
Chinese-Forums
  • Sign Up

Sharing: A Pleco word-frequency user dictionary


BearXiong

Recommended Posts

I've attached the raw entries of the user dictionary to this post. 

 

You can install the user dictionary based on the raw entries, but it could take a while to process (probably like an hour). To do so, in Pleco go to Settings -> Manage Dictionaries -> Add User -> New. Then give it a name e.g. "Freq". Click on the (empty) dictionary which you created, go to Import Entries -> Choose File, and select the file pleco_freq_dict.txt which I've attached to this post, and finally Begin Import.

pleco_freq_dict.txt

  • Helpful 1
Link to comment
Share on other sites

this is super awesome...

trying to decide whether or not to study this word... is it obscure or daily use? here's the immediate answer for every dictionary entry... simply brilliant.

should be automatically adopted by pleco's @mikelove as a default service.

well done sir!

Link to comment
Share on other sites

Cool project!

 

We actually do have a frequency list in Pleco already, but we currently only use it for sorting search results; we aren't really confident enough in it to encourage people to base their vocabulary study decisions on it yet. (to be honest for that I'd be more confident with something human-curated rather than with corpus analysis)

Link to comment
Share on other sites

  • 1 year later...
  • New Members

I know I'm late, but thank God for this BearXiong! I am not learning Chinese to pass exams. I am learning to increase work efficiency in China. This may save me hundreds of hours. Why this is not a default option to turn on is beyond me. It's not something that should be overthought--the value is incontrovertible. 

  • Like 1
Link to comment
Share on other sites

  • 5 months later...
  • New Members

I loaded the plecofreq.pqb file and "played" with it a bit. I noticed some two character words that have definitions in say the ABC dictionary do show up in this FREQ dictionary (e.g.  纸鸢 is in both ABC and CC as the work "kite".) This in itself is not a major problem but it led me to take a look at the raw data file (pleco_freq_dict.txt). This file has freq numbers in the 300Ks, but is only 112137 lines long and therefore cannot have 300K+entries. 

 

The file is rather difficult to sort on say the weibo numbers but I think I was able to rewrite it and to do so (perhaps not without error.)  I found that the first 50K words were contained in a bit more that 50K lines (some special symbols are mixed in ) but then there are enormous gaps to get to the 300K's freqs in the remaining 60K lines. Hence I would conclude the dictionary may only be useful up to about the 50K  freqs.

 

Lastly, I can only seem to make the FREQ dictionary available from the reader or from a direct entry in the Placo dictionary search box. That is, if I take a word from the reader popup that is in the FREQ dictionary and then expand the popup  (click the box with up arrow) the FREQ dictionary is not available to entries in the CHARS tab.  Nor is the FREQ dictionary ever made available for entries in the WORDS tab.  Perhaps this is not the fault of this FREQ USR dictionary and has something to do with my installation. I have checked all of the boxes to configure it to be the same as all of my other dictionaries.  Can some comment if this is normal for USR dictionaries in general not to be available for entries in the CHARS and WORDS tabs? 

Link to comment
Share on other sites

  • New Members

I asked:

"Can some (one) comment if this is normal for USR dictionaries in general not to be available for entries in the CHARS and WORDS tabs? "

 

So I installed the FREQ USR dictionary on an old Tablet (4-5 old). Both it and my newer phone (1 year old) are running Pleco 3.2.73 and FREQ is installed in an identical way on both.

 

FREQ dictionary seems to work in both the CHARS and WOOD tabs on the old tablet but not on the my phone. Phone is Android 7.0 and Tablet is 7.1.1. I find this very strange.

Link to comment
Share on other sites

User dictionaries are not reliably supported in those at the moment, no - feature parity between user and Pleco dictionaries is an ongoing goal (and already in current development builds of 4.0 we’ve got them sharing about 95% of their search code, which makes that a lot easier) but at the time we launched the current design they were actually too slow to work well on those screens and so we left them out.

Link to comment
Share on other sites

  • 1 year later...

This user dictionary is fantastic and I'm glad I saw this linked in another thread. I'm a little confused as to what exactly it all means though.

 

In the original thread it lists the basis for the corpus as follows:

On 6/15/2018 at 12:51 PM, johnvaradero said:

It’s based on news (人民日报 1946-2018,人民日报海外版 2000-2018), literature (books by 472 authors, including a significant portion of non-Chinese writers), non-fiction books, blog and weibo entries as well as classical Chinese

 

In the dictionary it lists the following 6 topics:

  • Weibo
  • Subtitles
  • Blogs
  • Literature
  • News
  • Tech

Three of these seem fairly self explanatory, but...

  1. What is a blog in this context? My native friends say that they've never read a blog that wasn't on Weibo, or is it including 知乎、豆瓣、天涯 and the like?
  2. Where does Classical Chinese fit into it? Is it merged with Literature? If you did this for English The Bible, Sherlock Holmes, and Shakespeare would all come up as Literature, but they don't have much overlap if your goal learning English is to read Harry Potter and Twilight.
  3. I'm assuming it also merged the non-fiction books in. That would be a shame as you couldn't tell the difference between what people are writing in romance novels vs academic writing.
  4. What is tech? We've already got weibo and blogs.

I'm mostly confused because if you search, for example, 笨蛋 you'd expect it to lean towards casual and spoken language, but it gives the following stats:

  • Weibo - 4139
  • Subtitles - 1229
  • Blogs - 37322
  • Literature - 6494
  • News - 50704
  • Tech - 80233

So this means that you're unlikely to read 笨蛋 in a weibo, but very likely to see it in a blog. Unlikely to see it in a TV show or book, but very likely to read it in the news or "tech".

Link to comment
Share on other sites

Good question. A lower number represents a more frequent word! So 笨蛋 occurs more often in subtitles and on weibo, compared to in other corpora. The number 1229 for subtitles means that if you listed words by frequency as they occur in subtitles then it's the 1229th most frequent word. Hope that helps!

  • Like 1
Link to comment
Share on other sites

Oh it's a ranking! I thought it was a frequency count. That makes so much more sense. Thank you.

 

Could you help me with any of the other questions? The main one I'm keen to get an answer on is the definition of "tech"

 

5 hours ago, NanJingDongLu said:

What is tech? We've already got weibo and blogs.

 

Link to comment
Share on other sites

  • 3 years later...

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...