锵锵三人行 'corpus'!

January 27, 2016 at 04:49 PM

For a while now, whenever I refer to or work through an online transcript for the TV show 锵锵三人行, I copy the text into a Word document. I've now accumulated 100 of these documents and since Word lets you combine whole directories worth of documents, I told it to do just that, then fed the result into Imron's Chinese Text Analyser software which spat out a list of all the words used in these 100 episodes, with their respective frequencies.

Then I realised I've got no use for this data ... but in case anyone else has, it's attached as a text file.

qq3rx.txt

Edit: and a file with definitions etcqq3rxFULL.txt

January 27, 2016 at 06:23 PM

~~Info: 6103 unique words. 3826 of them are used only once each. Only 1227 are used more than three times.~~

Edit: Sorry, I used a smaller dictionary when I extracted these numbers. They must be wrong.

January 27, 2016 at 10:15 PM

Nice work! For a wide-ranging talk show, I'm amazed that there are so few unique words over the span of 100 episodes.

January 27, 2016 at 10:24 PM

Whoops! I obtained the numbers from - of course - Chinese Text Analyser, but I forgot that I have it using a smaller dictionary now. Sorry for this oversight.

Edit: I just tried it with a fresh CEDICT and the numbers are the same as the above. I'll let somebody else sort it out.

January 27, 2016 at 10:36 PM

Ok, here are the correct stats (not from CTA, but from looking at the file itself):

Total unique words: 12,232, of which 4,676 are only used once, and only 5,580 are used more than 3 times.

January 28, 2016 at 09:34 AM

And for characters, total 338k characters, unique 2,994. I find it perversely satisfying, though I know it's ultimately meaningless, that the unique character count comes in below the magic 3,000 figure!

January 28, 2016 at 10:55 AM

With this carefully selected list of less than 2,500 Chinese words, YOU can understand 90% of a popular Chinese talk show!*

2474.txt

January 28, 2016 at 11:06 AM

Actually what would be interesting (relative to the values of interesting with which we're working here) would be to compare that to other corpuses (corpii?) and see where the differences are. I note 美女 ranks in the top 1,000, with 客人 and 观众 lagging further behind.

January 28, 2016 at 11:20 AM

Seeing stats like this, I think it really helps drive home the point I have been making in other threads that once you reach a certain level, vocabulary becomes dramatically less important and provides only marginal increases in understanding.

That level is also significantly lower than most people believe.

January 28, 2016 at 11:22 AM

corpuses (corpii?)

corpora

January 28, 2016 at 11:32 AM

I should point out that names of people appearing on the show appear very heavily, because of the nature of the transcripts where each chunk of speech is preceded by the name of the person saying it.

January 28, 2016 at 11:40 AM

once you reach a certain level, vocabulary becomes dramatically less important and provides only marginal increases in understanding

And conversely, once you reach a certain level, every little step higher in understanding requires an enormous amount of exposure!

I find it a bit depressing: I could read for hours every day but after a year if I pick up a new book at random, my comprehension rate -- in theory -- will be basically the same as it would have been 12 months earlier.

In practical terms though this shows that once you're at that certain level, there's greater impact going over previously-learned but perhaps-forgotten, or only half-remembered, higher frequency words, to make sure you have them rock solid, than in learning new ones.

January 28, 2016 at 02:13 PM

I find it a bit depressing: I could read for hours every day but after a year if I pick up a new book at random, my comprehension rate -- in theory -- will be basically the same as it would have been 12 months earlier.

Probably not though.

Take for example the corpus you put together. 338,000 characters, so at the relatively slow reading speed of 150 cpm, it would take you about 40 hours to read through the entire transcript. An hour a day, would see you get through the above wordlist in a little over a month.

After a year of doing that, the numbers will be a bit higher for quite a few of those single item frequencies. Maybe you'd see them 2-3 times in the entire year - which if you translate that to SRS terms is a revision every 4-6 months, which is probably not so bad if you spend some time learning the word properly the first time.

In practical terms though this shows that once you're at that certain level, there's greater impact going over previously-learned but perhaps-forgotten, or only half-remembered, higher frequency words, to make sure you have them rock solid, than in learning new ones.

I think this is very true. It also highlights the need to make sure you are learning words from material that you are actually encountering, otherwise there's a high chance that you're learning words you'll never see/use.

January 28, 2016 at 02:17 PM

I note 美女 ranks in the top 1,000

窦文涛 has a thing for 竹幼婷

January 28, 2016 at 02:33 PM

Another interesting tidbit (which I got after getting the original Word document from realmayo and playing around with it myself in CTA) the magical 98% comprehension kicks in at almost the exact spot where the single frequency items start (less than 10 words different).

So it's not so dire as post #12 makes out. You could never learn the single frequency items, and still have enough comprehension to learn new words almost entirely from context.

January 28, 2016 at 03:27 PM

How long is each show? Can we work out an average words per minute?

January 28, 2016 at 03:46 PM

20-25 minutes.

So for 100 shows, it's about 3383 total characters per show and about 135-170 cpm.

Word wise, it's about 2334 total words per show, or about 93-115 wpm.

There will be some margin for error as it doesn't take in to account pauses and cutting to ads, and title sequences and so forth.

January 28, 2016 at 03:49 PM

One way of looking at that is 98% comprehension means one missed word or two per minute?

January 28, 2016 at 03:56 PM

I'd say that's probably a fair assessment.

I'm quite impressed at how much useful data is coming from this. I might at some point try scraping their site for transcripts to build a bigger corpus!

January 28, 2016 at 04:03 PM

I know we all like the show, but the few transcripts I worked on were quite buggy, and a teacher I showed them to suggested they might have been generated by machine (doubtful) or at least quickly and cheaply (more likely). If you're going to put effort into such a project, I suggest something with more reliable transcripts. Unfortunately none spring to mind.

Sign In

锵锵三人行 'corpus'!

Recommended Posts

Guest realmayo

querido

imron

querido

imron

Guest realmayo

roddy

roddy

imron

imron

Guest realmayo

Guest realmayo

imron

imron

imron

roddy

imron

roddy

imron

li3wei1

Join the conversation