I made a Windows app for processing language mp3 files

February 14, 2021 at 03:22 PM

Not sure if this is allowed, and whether this would be better in the resources forum?

Basically I had many mp3 files in the "exam format" where each dialog is spoken at natural speed followed by a long silence, and I found such files were not so useful for general listening practice (but almost every chinese language book I've bought includes such files). So I made this app to batch process these files into a more digestible format.

It's a windows app. It takes a set of mp3 files as input, and outputs a set of mp3 files. The operations it performs are user configurable (via a wizard) and includes things like stripping out the silences, repeating dialogs and slowing down dialogs.

Let me know if you find it useful. If so, I may port it to android and/or make it a real-time (instead of file-based) tool.

Download page (github)

February 14, 2021 at 04:09 PM

I've been thinking of doing something very similar (although I don't know what your output sounds like) using SoX tools on the command line.

There's so much useful content out there but I guess it's all Hanban copyright... if only they had decided to make it CC instead.

I'm a Mac user but will check out your github later when I have time. Cheers!

February 14, 2021 at 04:20 PM

If I have a whole bunch of MP3’s of single sentences at 100% speed, can it batch process them to 80% speed?

February 14, 2021 at 04:23 PM

Yes...

https://stackoverflow.com/questions/33957747/how-do-i-reduce-the-speed-of-a-voice-mp3-file-with-sox-to-75

February 14, 2021 at 05:03 PM

36 minutes ago, Flickserve said:

If I have a whole bunch of MP3’s of single sentences at 100% speed, can it batch process them to 80% speed?

Yes absolutely; the app is designed for just this kind of thing. And not just change the whole thing to 80% speed, but you could, say, play sentence 1 at 80%, then sentence 1 at 100%, then sentence 2 at 80%, then sentence 2 at 100%, and so on.

Oh, and in reference to sound quality, it changes the tempo, not just speed i.e. the voice pitch will not be changed.

BUT it is an early version of the app, and that's why I put it here and not in Resources because I'm expecting feedback. I've processed hundreds of files with it already but it's possible that there are bugs. Also NB the app will not overwrite files -- it insists that you output to a different folder than the original files, to protect against any possibility of losing data.

February 14, 2021 at 05:18 PM

Sounds useful!

My most recent use case is recording a session with a tutor, then separating the two people speaking into two different audio files and stripping silences from those. My own voice for later to compare if I can discern any improvement and the tutor's for listening practice.

I wonder if anyone knows anything that would do this for me?

It probably would require some kind of AI tool to separate the two.
Currently I'm doing this by hand in audacity.

February 14, 2021 at 05:23 PM

Well my app can auto strip the silences.
And it can cut the audio sections into separate files. For example, imagine you have a file with 10 sentences padded with silence in between. It can strip the silences and save the result as one file, or it can strip the silences then save each sentence to a separate mp3 file.

But no, currently there is no processing of different voices, so it cannot create, say, speakerA speakerB files

February 14, 2021 at 05:40 PM

I did some googling and it seems that this is called a "cocktail party problem" and it's so tricky that up until two or three years ago it wasn't possible to separate multiple voices talking on top of each other. Google (probably among others) has been however been researching it with pretty impressive results to use in video call noise cancelling and transcription, but I couldn't yet find any service that would do what I need. I think it should be available for anyone to use in a few years.

February 15, 2021 at 03:32 AM

Very nice. Looks professional too. I'm not sure I have a use for this right now, but it's definitely a nice tool to have available. Thanks for sharing!

February 15, 2021 at 05:00 AM

11 hours ago, alantin said:

but I couldn't yet find any service that would do what I need. I think it should be available for anyone to use in a few years.

Many Chinese programs have background music. If you listening to sentences and trying to mimic, reducing the background music would be great.

February 15, 2021 at 05:03 AM

11 hours ago, alantin said:

I wonder if anyone knows anything that would do this for me?

I use an app called Evaer. It can record the just the teachers voice on a single channel from a Skype call. But when you speak, there will be no sound on the recording.

I am not bothered about recording my own voice during the lesson. In lesson, I don't concentrate on pronunciation very much and prefer to fine tune it later.

February 15, 2021 at 05:43 AM

2 hours ago, markhavemann said:

Thanks for sharing!

Thanks for checking it out!

February 15, 2021 at 07:22 AM

2 hours ago, Flickserve said:

Many Chinese programs have background music. If you listening to sentences and trying to mimic, reducing the background music would be great.

I know a tool for removing background music!

https://vocalremover.org/

2 hours ago, Flickserve said:

I use an app called Evaer. It can record the just the teachers voice on a single channel from a Skype call. But when you speak, there will be no sound on the recording.

I am not bothered about recording my own voice during the lesson. In lesson, I don't concentrate on pronunciation very much and prefer to fine tune it later.

Perfect! Somehow I failed to realize that you could tap into the skype call data directly!

I agree. Listening to your own voice too, albeit grueling for me, does help with improving pronunciation.

I tried recording a whole lesson for the first time last week using the Skype recording function and I've listened to the recording multiple times now.

This is a new technique for me and I find it very helpful. I can understand enough to keep the conversation going, but I miss a lot of small things. Listening to it later allows me to pick up the missing pieces, notice the tutor using grammar points I didn't notice before (like 才 in a specific point in one sentence), what kinds of filler words or ways to correct sentences when the train of thought changes, and a lot of vocabulary that I missed the first time too. This gets a lot more efficient when you can separate only the tutors voice and then strip the silences.

February 17, 2021 at 08:03 PM

On 2/14/2021 at 4:22 PM, Mijin said:

Let me know if you find it useful. If so, I may port it to android and/or make it a real-time (instead of file-based) tool.

I tried it with an audio file of TheChairMansBao and it did not work at all. I wanted it to split the text sentence by sentence according to silence and always ended up with one file, which was identical to the original. Splitting it at fixed intervals is not useful as sentence length obviously varies.

Incidentally, I found this recommendation from Steve Kaufmann (Lingq). The program he uses is Wavepad. It does the job you try to do perfectly: https://www.nch.com.au/splitter/index.html

February 17, 2021 at 08:06 PM

Oh dear. Thanks for trying my app anyway. Can I get a copy of that audio file, I can find what went wrong.

thanks

I would guess that maybe the silences aren't entirely silent. Of course there is a tolerance for a certain amount of noise, but it's just an arbitrary cutoff.

I guess WavePad must do some actual signal processing to tell the difference between content and hiss.

Thanks for pointing out WavePad too. I'll take a look at it and figure out if what I was trying to do is completely redundant

February 17, 2021 at 08:21 PM

11 minutes ago, Mijin said:

Oh dear. Thanks for trying my app anyway. Can I get a copy of that audio file, I can find what went wrong.

thanks

I would guess that maybe the silences aren't entirely silent. Of course there is a tolerance for a certain amount of noise, but it's just an arbitrary cutoff.

I guess WavePad must do some actual signal processing to tell the difference between content and hiss.

Thanks for pointing out WavePad too. I'll take a look at it and figure out if what I was trying to do is completely redundant

You can just check it with some of the free lessons: https://www.thechairmansbao.com/

Yeah, I was wondering if you tried to reinvent the wheel ? From a personal programming challenge point of view, I can totally see the value of what you are doing. But, if you want to have a competitive product, make it better than Wavepad and free ?

February 17, 2021 at 08:53 PM

Ok, I just tested it out with the first lesson "Shanghai couples tie the knot".

This lesson is just a single dialogue though. Any pauses between sentences are typically less than 2 seconds; that's shorter than the threshold of my app for cutting sentences. The silence detection assumes that there is deliberate silence of over 2 seconds spacing out separate dialogs or questions. It's not designed for finding short pauses within one contiguous dialog.

Have you tried this with WavePad too? Because it does seem to me that if we're cutting sentences with shorter pauses it starts to get arbitrary what a sentence is. For example, people often pause for a second or more before or after a conjunction ("because", "otherwise", "in reality"), so those might get parsed into two or more sentences.

But I guess breaking on a conjunction may actually be desirable for language-learning anyway.

I'll make a version where the silence duration for detection is adjustable.

Regarding "competitive product", this was just something I made for my own use. I had no intention of sharing it even, it was just when a friend told me she never bothers to download the free audio files that come with her textbooks that I thought my tool might be useful for others.

February 18, 2021 at 05:39 AM

8 hours ago, Mijin said:

Have you tried this with WavePad too? Because it does seem to me that if we're cutting sentences with shorter pauses it starts to get arbitrary what a sentence is. For example, people often pause for a second or more before or after a conjunction ("because", "otherwise", "in reality"), so those might get parsed into two or more sentences.

But I guess breaking on a conjunction may actually be desirable for language-learning anyway.

Yes, I tried it with Wavepad but so far only with one file. It does seem to make meaningful splits, so not strictly sentence by sentence, but as you say possibly also sub-clauses. This is fine for me.

February 18, 2021 at 02:19 PM

Hi @Mijin

I just tried out your application.

I fist recorded a session using Evaer as per @Flickserve's suggestion and then ran the audio through your application only leaving one second gaps between sections. It is beautiful! The time I needed after the lesson to end up with an audio file with only the tutor's speech was reduced from over an hour to just a few minutes! Worked like charm!

February 18, 2021 at 03:04 PM

Awesome! Glad to have helped!

Let me know any suggestions or feedback ?

Sign In

I made a Windows app for processing language mp3 files

Recommended Posts

Mijin

mungouk

Flickserve

mungouk

Mijin

alantin

Mijin

alantin

markhavemann

Flickserve

Flickserve

Mijin

alantin

Jan Finster

Mijin

Jan Finster

Mijin

Jan Finster

alantin

Mijin

Join the conversation