Jump to content
Chinese-Forums
  • Sign Up

Recommended Posts

Posted

For a while now, whenever I refer to or work through an online transcript for the TV show 锵锵三人行, I copy the text into a Word document. I've now accumulated 100 of these documents and since Word lets you combine whole directories worth of documents, I told it to do just that, then fed the result into Imron's Chinese Text Analyser software which spat out a list of all the words used in these 100 episodes, with their respective frequencies.

 

Then I realised I've got no use for this data ... but in case anyone else has, it's attached as a text file.

qq3rx.txt

 

 

Edit: and a file with definitions etcqq3rxFULL.txt

Posted

Info: 6103 unique words. 3826 of them are used only once each. Only 1227 are used more than three times. 

 

Edit: Sorry, I used a smaller dictionary when I extracted these numbers. They must be wrong. 

Posted

Nice work!  For a wide-ranging talk show, I'm amazed that there are so few unique words over the span of 100 episodes.

Posted

Whoops! I obtained the numbers from - of course - Chinese Text Analyser, but I forgot that I have it using a smaller dictionary now. Sorry for this oversight.

 

Edit: I just tried it with a fresh CEDICT and the numbers are the same as the above. I'll let somebody else sort it out.

Posted

Ok, here are the correct stats (not from CTA, but from looking at the file itself):

 

Total unique words: 12,232, of which 4,676 are only used once, and only 5,580 are used more than 3 times.

  • Like 1
Posted

And for characters, total 338k characters, unique 2,994. I find it perversely satisfying, though I know it's ultimately meaningless, that the unique character count comes in below the magic 3,000 figure!

Posted

With this carefully selected list of less than 2,500 Chinese words, YOU can understand 90% of a popular Chinese talk show!*

2474.txt

  • Like 1
Posted

Actually what would be interesting (relative to the values of interesting with which we're working here) would be to compare that to other corpuses (corpii?) and see where the differences are. I note 美女 ranks in the top 1,000, with 客人 and 观众 lagging further behind. 

Posted

Seeing stats like this, I think it really helps drive home the point I have been making in other threads that once you reach a certain level, vocabulary becomes dramatically less important and provides only marginal increases in understanding.

 

That level is also significantly lower than most people believe.

Posted

I should point out that names of people appearing on the show appear very heavily, because of the nature of the transcripts where each chunk of speech is preceded by the name of the person saying it.

Posted
once you reach a certain level, vocabulary becomes dramatically less important and provides only marginal increases in understanding

 

And conversely, once you reach a certain level, every little step higher in understanding requires an enormous amount of exposure!

 

I find it a bit depressing: I could read for hours every day but after a year if I pick up a new book at random, my comprehension rate -- in theory -- will be basically the same as it would have been 12 months earlier.

 

In practical terms though this shows that once you're at that certain level, there's greater impact going over previously-learned but perhaps-forgotten, or only half-remembered, higher frequency words, to make sure you have them rock solid, than in learning new ones.

Posted
I find it a bit depressing: I could read for hours every day but after a year if I pick up a new book at random, my comprehension rate -- in theory -- will be basically the same as it would have been 12 months earlier.

Probably not though.

 

Take for example the corpus you put together.  338,000 characters, so at the relatively slow reading speed of 150 cpm, it would take you about 40 hours to read through the entire transcript.  An hour a day, would see you get through the above wordlist in a little over a month.

 

After a year of doing that, the numbers will be a bit higher for quite a few of those single item frequencies.  Maybe you'd see them 2-3 times in the entire year - which if you translate that to SRS terms is a revision every 4-6 months, which is probably not so bad if you spend some time learning the word properly the first time.

 

 

 

In practical terms though this shows that once you're at that certain level, there's greater impact going over previously-learned but perhaps-forgotten, or only half-remembered, higher frequency words, to make sure you have them rock solid, than in learning new ones.

I think this is very true.  It also highlights the need to make sure you are learning words from material that you are actually encountering, otherwise there's a high chance that you're learning words you'll never see/use.

Posted

Another interesting tidbit (which I got after getting the original Word document from realmayo and playing around with it myself in CTA) the magical 98% comprehension kicks in at almost the exact spot where the single frequency items start (less than 10 words different).

 

So it's not so dire as post #12 makes out.  You could never learn the single frequency items, and still have enough comprehension to learn new words almost entirely from context.

Posted

20-25 minutes.

 

So for 100 shows, it's about 3383 total characters per show and about 135-170 cpm.

 

Word wise, it's about 2334 total words per show, or about 93-115 wpm.

 

There will be some margin for error as it doesn't take in to account pauses and cutting to ads, and title sequences and so forth.

Posted

One way of looking at that is 98% comprehension means one missed word or two per minute? 

  • Like 1
Posted

I'd say that's probably a fair assessment.

 

I'm quite impressed at how much useful data is coming from this.  I might at some point try scraping their site for transcripts to build a bigger corpus!

Posted

I know we all like the show, but the few transcripts I worked on were quite buggy, and a teacher I showed them to suggested they might have been generated by machine (doubtful) or at least quickly and cheaply (more likely). If you're going to put effort into such a project, I suggest something with more reliable transcripts. Unfortunately none spring to mind.

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...