Popular Post 大块头 Posted July 8, 2020 at 04:33 PM Popular Post Report Posted July 8, 2020 at 04:33 PM Has anybody here ever used the Common Voice dataset for their language studies? They released an update last week and the Mandarin Chinese parts of the dataset now have a total of about 140 hours of recorded sentences for China and Taiwan. From Wikipedia: Quote Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences will be collected in a voice database available under the public domain license CC0. This license ensures that developers can use the database for voice-to-text applications without restrictions or costs. I immediately thought of the MorphMan add-on for Anki when I read about this update (paging @NinKenDo), though not having English translations for these sentences is a limitation. About 15% of the sentences are tagged with the speaker's birthplace. Perhaps this dataset could be used to find good examples of regional accents? Some example sentences from the corpus are below. 宋朝末年年间定居粉岭围。 渐渐行动不便 二十一年去世。 他们自称恰哈拉。 局部干涩的例子包括有口干、眼睛干燥、及阴道干燥。 嘉靖三十八年,登进士第三甲第二名。 这一名称一直沿用至今。 为了惩罚西扎城和塞尔柱的结盟,盟军在抵达后将外城烧毁。 河内盛产黄色无鱼鳞的鳍射鱼。 他主要演出泰米尔语电影。 福崎町是位于日本兵库县中部的行政区划。 下行月台设有厕所。 耶尔河畔圣伊莱尔人口变化图示 光绪八年再中举人。 赫拉克勒斯是希腊神话中的半神英雄。 蔡声白。 该区舰队主要负责为公海舰队的战列分舰队提供屏护。 雷诺在回归的第一年比赛中以第四名的成绩完成了比赛。 这样都可以啊 此原理也广泛应用于家庭之中用于生产软水。 本片的导演是赵秀贤和梁铉锡。 奥特拉德诺耶农村居民点是俄罗斯联邦沃罗涅日州新乌斯曼区所属的一个农村居民点。 吉内斯塔。 6 1 Quote
thelearninglearner Posted July 9, 2020 at 12:23 PM Report Posted July 9, 2020 at 12:23 PM this is actually pretty cool. I will try to use some of this in a morphman deck. just need to figure out how to translate sentences properly and make sure the ones i add are spoken accurately. do they have a separate section of the ones spoken well? I have to take a look Quote
大块头 Posted July 9, 2020 at 01:23 PM Author Report Posted July 9, 2020 at 01:23 PM 53 minutes ago, thelearninglearner said: do they have a separate section of the ones spoken well? After a sentence is recorded it then goes through a validation process where other volunteers listen to it and confirm that the speaker read the sentence correctly. The number of up or down votes a sentence recording got is a part of the dataset. Quote
mungouk Posted July 9, 2020 at 02:15 PM Report Posted July 9, 2020 at 02:15 PM Interesting dataset, especially since it's public domain. Any idea how the "accent" field is encoded? Where present it's just large numbers like 370000. Quote
大块头 Posted July 9, 2020 at 02:21 PM Author Report Posted July 9, 2020 at 02:21 PM 370000 is 山东 I think? Or maybe the postal code? Quote
mungouk Posted July 9, 2020 at 02:27 PM Report Posted July 9, 2020 at 02:27 PM Oh, they're postal codes? Quote
NinKenDo Posted August 8, 2020 at 05:11 AM Report Posted August 8, 2020 at 05:11 AM Sounds awesome. Thanks for tagging me in. I'm getting confused navigating the site though, how does one access the dataset? EDIT: Nevermind, I'm an idiot, but I'll blame it on listening to 王菲 too loud and disorientating myself. One thing I noticed about the Taiwan set (haven't checked China yet) is that the dataset is overwhelmingly male, which I view as a good thing for myself. So many learning materials are female spoken. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.