mungouk Posted March 31, 2021 at 09:46 AM Report Posted March 31, 2021 at 09:46 AM Looks like we have an official implementation date: 1 July. 《国际中文教育中文水平等级标准》发布 2021-03-31 来源:教育部 http://www.moe.gov.cn/jyb_xwfb/gzdt_gzdt/s5987/202103/t20210329_523304.html The web page has a link to a PDF of the 260-page draft standard. Knock yourselves out. 3 Quote
Guest realmayo Posted March 31, 2021 at 10:19 AM Report Posted March 31, 2021 at 10:19 AM Looks quite a comprehensive document, including character lists, word lists and grammar lists for the nine levels. For levels 1-6 these are split up, one list per level, but for the final three levels (the "new" ones) they're shown grouped together as "7-9". That might suggest the 高级 exam will be similar to how the very first HSK was organised: all 高级 students take the same exam and get graded 7, 8 or 9 (or fail) depending on how well they do. Edit: or, they've just not yet got around to splitting up 7-9 yet. ... or there will just be three exam papers (初级、中级、高级)??? Quote
mungouk Posted March 31, 2021 at 10:35 AM Report Posted March 31, 2021 at 10:35 AM Also "syllable" lists — 音节表 p9-14. I'm curious to know how they're going to test these. Quote
Guest realmayo Posted March 31, 2021 at 10:45 AM Report Posted March 31, 2021 at 10:45 AM 8 minutes ago, mungouk said: Are those character lists or "syllable" lists — 音节表? (with the first few levels in pinyin) Looks like, for each level, syllable lists followed by character lists followed by word lists (音节表、汉字表、词汇表). I would have assumed that for foreign learners, knowing a character would include knowing its pronunciation, but they appear to have broken it down differently. Quote
Guest realmayo Posted March 31, 2021 at 11:09 AM Report Posted March 31, 2021 at 11:09 AM Also it looks like from level 4 onwards the exams will test translation/interpretation - alongside listening, speaking, reading and writing. For example it seems Level 6 requires oral interpretation of informal lanaguage, 非正式场合的口译任务, and Level 9 seems to have a simultaneous translation requirement - 同声传译任务. Interesting! Quote
roddy Posted March 31, 2021 at 11:41 AM Report Posted March 31, 2021 at 11:41 AM 29 minutes ago, realmayo said: Level 9 seems to have a simultaneous translation requirement What on earth? Into and from what language? He wrote, bemusedly, while the massive PDF downloaded. 1 Quote
mikelove Posted March 31, 2021 at 11:43 AM Report Posted March 31, 2021 at 11:43 AM Ha, they use //s in the readings for splittable verbs just like we do - now maybe people will stop emailing us to ask what’s up with those. Also it seems to be a scanned PDF with a watermark so extracting these is going to be a major pain. Quote
roddy Posted March 31, 2021 at 11:50 AM Report Posted March 31, 2021 at 11:50 AM Oh, interesting they specify characters per minute for listening speeds. Up to a max of 800 [edit: ignore, I misread it]. And also for reading. This simultaneous translation thing, though.... 能够完成正式场合专业内容的同声传译任务 - that's a postgraduate degree in itself. And the logistics. Good God, the logistics. Are they doing both directions? Which languages? 21 hours ago, mikelove said: scanned PDF with a watermark 仅供查阅 indeed. 仅供 very slow scrolling while trying to scan the text for the terms I'm interested in, more like. Quote
mungouk Posted March 31, 2021 at 12:07 PM Report Posted March 31, 2021 at 12:07 PM If you open Adobe Acrobat*, make sure you have Scanned Documents > Settings set to CHINESE (SIMPLIFIED) then open it and hit "Edit PDF" you can convert it to proper searchable and copyable text, but you might have to do it for each page separately. Sounds like a job for @大块头 ?? * I'm using Acrobat Pro DC 2021, older versions may be different. 1 Quote
mungouk Posted March 31, 2021 at 12:13 PM Report Posted March 31, 2021 at 12:13 PM 7 minutes ago, mungouk said: you might have to do it for each page separately. It looks like just scrolling through the document does the conversion a page at a time, but on my 3-year old Macbook Pro this is not very quick. Quote
mungouk Posted March 31, 2021 at 12:35 PM Report Posted March 31, 2021 at 12:35 PM 50 minutes ago, mikelove said: extracting these is going to be a major pain. I'm currently converting the PDF to Word using Acrobat Pro but if anyone has a faster computer and wants a race, please feel free. Presumably though they will release an Excel version in good time, like they did for the last set of vocab...? Quote
大块头 Posted March 31, 2021 at 12:37 PM Report Posted March 31, 2021 at 12:37 PM 29 minutes ago, mungouk said: Sounds like a job for @大块头 ?? More like a job for the freelancer I'd hire! 2 Quote
mungouk Posted March 31, 2021 at 12:59 PM Report Posted March 31, 2021 at 12:59 PM 24 minutes ago, mungouk said: converting the PDF to Word using Acrobat Pro Well, it looks like Word choked on it. At least it messed up many pages. Good luck everyone! Quote
大块头 Posted March 31, 2021 at 01:09 PM Report Posted March 31, 2021 at 01:09 PM They have that table again on p. 7 of this PDF. All the totals haven't changed since they shared the proposed vocabulary list last year, which I'll take as a sign that I can do something during my weekends this spring besides writing flashcards. 1 Quote
mikelove Posted March 31, 2021 at 01:41 PM Report Posted March 31, 2021 at 01:41 PM This looks pretty clear / regular, won't have time to try this until later but if anybody has ImageMagick I'd suggest running it through that to convert the pages into images, chunk them up into smaller images for each column, and remove anything light enough to be a watermark, then put that all back into a PDF and run that through OCR. 1 Quote
大块头 Posted March 31, 2021 at 02:12 PM Report Posted March 31, 2021 at 02:12 PM @mikelove I have Tesseract chewing on the PDF right now. Maybe all that pre-processing won't be necessary? edit: Here's the raw text output. Doing something like what Mike suggested may be the best way of extracting these wordlists. raw_ocr_output.txt 3 Quote
Guest realmayo Posted March 31, 2021 at 02:40 PM Report Posted March 31, 2021 at 02:40 PM 2 hours ago, mikelove said: now maybe people will stop emailing us to ask what’s up with those Or now maybe new people will start asking you now. 2 hours ago, roddy said: This simultaneous translation thing, though.... 能够完成正式场合专业内容的同声传译任务 - that's a postgraduate degree in itself. And the logistics. Good God, the logistics. Are they doing both directions? Which languages? Really interesting isn't it! If they're trying to test examinees' skills in simultaneous translation, that sounds a little unfair - surely lots of people are likely to be proficient in Chinese but poor at simultaneous translation. But then, in the same way, listening or reading comprehensions are exam-skills rather than language-skills. So maybe it's reasonable? Especially with material limited to just HSK9-level vocabulary and grammar patterns. But it would have to be teachable first, too. And yes the logistics, regardless of which direction(s), hard to see how a teacher in Chinese university with a class whose students speak a mix of French, English, Vietnamese, Korean and Russian could deal with this. Quote
roddy Posted March 31, 2021 at 03:21 PM Report Posted March 31, 2021 at 03:21 PM Actually, I think the standard foreigner-learn-Chinese degree curriculum includes interpretation skills, so if they're working off that... still odd though. Reading and listening comprehension tasks at least try to mimic real world skills you could reasonably expect to use. Quote
mikelove Posted March 31, 2021 at 03:42 PM Report Posted March 31, 2021 at 03:42 PM OK, ImageMagick does a beautiful job removing the watermark: Quote convert -density 300 W020210329527301787356.pdf -quality 100 hsk2022.jpg Will convert it to a bunch of JPEGs, then: Quote mogrify -threshold 70% *.jpg Will remove the watermark. (make sure you do this in a separate directory or it'll de-watermark your other JPEGs too) Run those through an OCR and you should get a relatively clean text. Quickly ran a test page (page 59) through Pleco OCR and when isolated to single columns it was 100% accurate, so just have to chunk this up into smaller images and do that in bulk. 3 Quote
大块头 Posted March 31, 2021 at 03:44 PM Report Posted March 31, 2021 at 03:44 PM The following also does a good job at removing the watermark. Quote convert input.png -fuzz 15% -fill white -opaque "#bdbcc0" result.png 2 Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.