Sentence Extraction Tool using Audio and Translation Alignment

January 13, 2013 at 02:15 AM

Hi All,

So I've seen some vocabulary extraction tools out there, I've even written one. But I just thought of a big step up from the basic sample sentence mining tool. At the moment I don't quite have the NLP chops to pull it off, but maybe in the not-so-distant future I will attempt this. Here's the concept:

INPUT: target word list (L), ebook (E), audio book reading (A), translated ebook (T)

OUTPUT: set of example sentences ({S}) for each word in L.

where S = ( sentence from E with target word in bold, corresponding audio clip from A, corresponding sentence from T).

IMPLEMENTATION: Computer alignment of E/A/T. Search E for target words. Build and output list of S, optionally making electronic flashcards.

This could have a huge benefit for learners. Take a book that one is interested in -- perhaps a Chinese translation of a favorite novel, or a Chinese original with which one is already familiar -- Then use that book as the source for massive sentence mining. But, have the computer help you! No tedious recording/editing by oneself.

Anyone know of such a tool already? Anyone familiar with audio book to ebook alignment algorithms? I wonder how beneficial this would be, and who it would most benefit.

Comments? Criticisms?

Thanks,

-Chris

January 13, 2013 at 08:55 AM

Why not just read the book? And listen to it? If the book is at a vocabulary level higher than your target list, then many of your example sentences will contain words you don't know. You could adjust the model above so that it generates an S for every unique word in the book, and you just go down the list picking out the ones you don't know. But again, you're almost reading the book, and I think reading the book would be more interesting and enjoyable than reading a list of detached sentences.

Also a good example sentence will make the meaning of the word clear, and/or illustrate something about its usage, while a random sentence plucked out of a book may not, and may even mislead you. If you read the book, you'll probably get the same word in 3 or 4 sentences in close proximity, and the context of the story will help you fix the meaning in your memory. Imagine a sentence about someone looking down on a sheet of ice covered by a penguin colony, all noisy and crapping everywhere. Then later there's a description of this person fighting a penguin with an icepick, then a few sentences later a description of how disgusting penguin meat is. None of these sentences by themselves would really pin down 'penguin', while the three together do a better job (not an ideal job, I admit, but the first gives you the social structure, the second gives you an indication of the size, and the third tells you it's something we don't normally eat). Plus you get a vivid series of images to help you remember.

January 13, 2013 at 10:43 AM

Liwei,

You make some very valid points. I wholeheartedly agree that reading a book is infinitely more interesting than working with a list of disjointed sentences. For the most part, lists bore and exhaust me.

But perhaps this method could still have an application when (1) studying towards a test and (2) using books that you have already read. Here's my use case: I'm studying for the HSK. Normally I don't like word lists, but when Hanban gives you an exhaustive wordlist and you have limited time before the test, it only makes sense to prioritize learning those words. If I continue at the pace I am reading now, I will eventually learn all those words from natural contexts in new books I read. However, many of those words I have already been exposed to in earlier books, but I did not manage to learn them at the time. I imagine a tool like I describe being useful for going back into books one has already read to allow focused re-reading. The benefit is that the chosen wordlist is what you will be focusing on.

I suppose that if you don't have a fixed wordlist, you are just better off reading and listening. I can fall back on searching an ebook for just those words I'm having trouble with on the HSK list, to see if I already know a context that they will naturally slip into, without the sophisticated automation I describe.

In response to your penguin example, I agree that you cannot learn the meaning of a word from a single isolated instance. However, the meaning of a word can be gleaned even from a single occurrence, given a sufficiently rich imaginative context. That context may come from the author's descriptive language, or from the reader's previous familiarity with the work (from seeing a film version, from hear-say, from reading a translation, etc.).

January 13, 2013 at 10:52 AM

Okay, given that most people working to word lists are going to be using either the HSK or the Taiwan version, maybe the effort of all that programming would be better spent compiling a list of best possible sentences for the words on those lists. And, that's something that everyone could contribute to, whereas a programming task as you describe would be down to one or two very skilled people.

January 13, 2013 at 11:17 AM

Or, a compromise (assuming your computer monitor is wide enough):

Side by side:

a) Text in Chinese: text is exported into Excel file, one sentence on each row. Each row is numbered. First instances of HSK-list words highlighted. (Should be easy to set up?)

b) Text in English

c) Audio file of text, open in e.g. Audacity.

Listen to the first couple of sentences. Then check understanding against Chinese text and English translation. Then listen to the next chunk. And so on.

When you see a highlighted word in the Chinese text, copy the English translation over to the adjacent column in the Excel table, then excerpt the audio for that sentence as mp3 and save it with a filename corresponding to the row number in Excel.

Slowly work through the book. For the tricky bits that don't have HSK material, you'll understand simply because you've got an English translation. For the bits with HSK material, make sure you understand the sentence it's in.

Eventually you'll have read and listened to and understood a book. You'll have an excel table with a Chinese sentence, an English translation, and a number. You put those into an SRS programme, along with the mp3 files which will match the numbered sentences.

Plus those sentences actually mean something to you because they are from a novel that you've pored over for some time, they have context.

Edit: I think this works best if you don't need the English translation. Otherwise it's too tedious to synchronise everything. So on reflection, sorry this won't help OP.

January 13, 2013 at 12:30 PM

There is a tool already available, which works at least similar to your proposed programm:

http://subs2srs.sourceforge.net/#how_to_use

What this programm is lacking, is your concept of having an input-list of the desired new vocab.

You can use this procedure to manually align audio with text - I once used it to align (audio vs text) of some chinesepod lessons:

http://forum.koohii.com/viewtopic.php?id=5880

January 13, 2013 at 02:37 PM

It'll take a decent amount of work to make this run smoothly. You'll have to deal with cases where the recorded audio and the source text vary (this happens more often than you think), and it's not exactly straightforward to align a Chinese source text with an English translation either. Sentence boundaries in the original Chinese usually do not correspond to those in English, so you would in effect have to have some kind of statistical machine translation program deal with this for you.

Why not just look for the target words in a source text, extract the relevant sentences, and run them through a decent TTS engine? eSpeak springs to mind as a free, open-source TTS engine with okay performance for Mandarin and Cantonese. Clearly, it's not as good as some of the other TTS engines out there, but it's a start, and you won't have to worry about any audio-to-text alignment issues whatsoever. All you need is some source texts (Project Gutenberg? my free Leiden Weibo Corpus?), Linux grep, and eSpeak. With a small bash script to glue these together, it shouldn't be hard at all to create a corpus of example sentences + recordings for a list of target words.

January 13, 2013 at 11:52 PM

@HerrPeterson:

Woah! Great idea to use subs2srs. This entirely accomplishes what I'm looking for, although for film/show content rather than books. But, come to think of it, the film format may be preferable for a number of reasons.... Then there is the fact that people have already generated audio AND subs with correct alignment for many languages to the same video. This is BRILLIANT.

A masochistic part of me still wants to dig up my NLP and DSP books and start hacking a statistical alignment algorithm, but that can be a PhD project . This subs2srs plus some minor custom scripting is the ticket.

@Dann & @RealMayo:

Good suggestions. I'll keep these tricks in mind. I haven't had much good TTS experience, and manual text alignment stresses me out... but it's worth a try at least to see how technology / my patience has improved in the last few years.

January 14, 2013 at 01:26 PM

You don't have to use film format. You can also use .mp3 files with subtitle (karaoke) files. I don't have matching pairs, but .lrc (lyrics) files can be found for instance via http://mp3.sogou.com/ (click on 歌词). Subs2srs can also be used for audio-books - but you need to have/generate corresponding .lrc-files.

Anyhow, please let me know, once you have your script running as I am also interested in having a tool, which generates anki-cards containing example-sentences.

For my personal usage I would like to have the following:

1.) Create a database of example sentences (via subs2srs, smartfm, etc.)

2.) Create a tool which you feed with vocab-items and which outputs anki-cards with said vocab-items+example sentences+audio from 1.).

I plan to write such a programm, unfortunatly it will have to wait until summer, since only then I have the necessary free-time (and my programming skills are somewhat rusty)

Edit: I just found .lrc files I created for Chinese-Pod lessons:

http://www.mediafire.com/?nnnniyeqjyu

Don't remember, how accurate I timed them though.

January 14, 2013 at 07:13 PM

For sentence alignment, I have had success with the open source Champollion Tool Kit. It does need one sentence per line in the files, which requires some extra processing because most e-books aren't in that format. English-Chinese is one of the two language pairs that the software can do by default.

For splitting of audio when you only have the text and the raw audio available but no timings, a while ago I detailed a method using the free tools Transcriber (to create timings) and Audacity (to split by the labels created by a custom Transcriber plugin).

Edit: I just noticed that sub2srs can import the .trs project files created by Transcriber. I suppose that means you could skip the Audacity step and the manual creation of the Anki data.

Edited January 14, 2013 at 07:20 PM by c_redman

January 14, 2013 at 09:27 PM

I do sentence mining in Anki for words I'm having trouble with, mostly words with multiple meanings. I just use www.ichacha.net, read the example sentences, and paste some of them into Anki. It's not that hard. I also feel like actually having to read all the example sentences in and of itself gives me a much better understanding of the word and the context it's usually used in.

I studied computer programming in college and understand the obsession with doing things efficiently and with automation, but I don't feel like the methods you guys talk about are really saving that much time. Sentence mining is basically a substitute for extensive reading, except it's a bad substitute. And putting multiple sentences into Anki for each word is inefficient. Think about it: You can have one card for a word's meaning, or multiple example sentences. As long as you remember the one card, it's more efficient. Like I said, I only use sentence mining for words I am really having trouble remembering.

Also, why do you need audio?

January 15, 2013 at 02:12 AM

@WestTexas:

I want audio because I like to set up my cards with only the audio on the front: it helps build 聽力, and it's easier to do while walking outside. I use the text to make searching easier, and to put on the back as confirmation that I heard it right.

Edit: I should add that I agree that extensive reading is preferable to SRS for many reasons... but when cramming for a test and/or working on problem areas, the SRS is a great way to focus on a certain subset of the language and really pound it into memory.

Sign In

Sentence Extraction Tool using Audio and Translation Alignment

Recommended Posts

navaburo

li3wei1

navaburo

li3wei1

Guest realmayo

HerrPetersen

Daan

navaburo

HerrPetersen

c_redman

WestTexas

navaburo

Join the conversation