Common Voice open-source transcribed audio dataset

July 8, 2020 at 04:33 PM

Has anybody here ever used the Common Voice dataset for their language studies? They released an update last week and the Mandarin Chinese parts of the dataset now have a total of about 140 hours of recorded sentences for China and Taiwan.

From Wikipedia:

Quote

Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences will be collected in a voice database available under the public domain license CC0. This license ensures that developers can use the database for voice-to-text applications without restrictions or costs.

I immediately thought of the MorphMan add-on for Anki when I read about this update (paging @NinKenDo), though not having English translations for these sentences is a limitation.

About 15% of the sentences are tagged with the speaker's birthplace. Perhaps this dataset could be used to find good examples of regional accents?

Some example sentences from the corpus are below.

宋朝末年年间定居粉岭围。
渐渐行动不便
二十一年去世。
他们自称恰哈拉。
局部干涩的例子包括有口干、眼睛干燥、及阴道干燥。
嘉靖三十八年，登进士第三甲第二名。
这一名称一直沿用至今。
为了惩罚西扎城和塞尔柱的结盟，盟军在抵达后将外城烧毁。
河内盛产黄色无鱼鳞的鳍射鱼。
他主要演出泰米尔语电影。
福崎町是位于日本兵库县中部的行政区划。
下行月台设有厕所。
耶尔河畔圣伊莱尔人口变化图示
光绪八年再中举人。
赫拉克勒斯是希腊神话中的半神英雄。
蔡声白。
该区舰队主要负责为公海舰队的战列分舰队提供屏护。
雷诺在回归的第一年比赛中以第四名的成绩完成了比赛。
这样都可以啊
此原理也广泛应用于家庭之中用于生产软水。
本片的导演是赵秀贤和梁铉锡。
奥特拉德诺耶农村居民点是俄罗斯联邦沃罗涅日州新乌斯曼区所属的一个农村居民点。
吉内斯塔。

July 9, 2020 at 12:23 PM

this is actually pretty cool. I will try to use some of this in a morphman deck. just need to figure out how to translate sentences properly and make sure the ones i add are spoken accurately. do they have a separate section of the ones spoken well? I have to take a look

July 9, 2020 at 01:23 PM

53 minutes ago, thelearninglearner said:

do they have a separate section of the ones spoken well?

After a sentence is recorded it then goes through a validation process where other volunteers listen to it and confirm that the speaker read the sentence correctly. The number of up or down votes a sentence recording got is a part of the dataset.

July 9, 2020 at 02:15 PM

Interesting dataset, especially since it's public domain.

Any idea how the "accent" field is encoded? Where present it's just large numbers like 370000.

July 9, 2020 at 02:21 PM

370000 is 山东 I think?

Or maybe the postal code?

July 9, 2020 at 02:27 PM

Oh, they're postal codes?

August 8, 2020 at 05:11 AM

Sounds awesome. Thanks for tagging me in. I'm getting confused navigating the site though, how does one access the dataset?

EDIT: Nevermind, I'm an idiot, but I'll blame it on listening to 王菲 too loud and disorientating myself.

One thing I noticed about the Taiwan set (haven't checked China yet) is that the dataset is overwhelmingly male, which I view as a good thing for myself. So many learning materials are female spoken.

Sign In

Common Voice open-source transcribed audio dataset

Recommended Posts

大块头

thelearninglearner

大块头

mungouk

大块头

mungouk

NinKenDo

Join the conversation