Jump to content
Chinese-Forums
  • Sign Up

Batch Processing, Numbers?!


何傻疯

Recommended Posts

Hello there!

Firstly, thank you for this fine piece of software. It is certainly one of the best at what it does of all the options I have tried.

That said, I have run into a few issues that I would like to ask about, which probably have simple solutions but which have evaded me:

Firstly, my situation: I have a series of flat text files representing all of the region names of China down to the street level. They are separated by region into files representing unique entities at a given level (region1 = province, region2 = city, region3 = district, region4 = village region5 = street). Each entry is on a new line. As an example, here is a selection from the street file:

泉掌镇
新城东路
清水口镇
西乌兰不浪乡
金垭镇
丹清河镇
荔香街
南街南门内
龙湖镇古盈村
南二弄
岗山中路
枸乃甸乡
赤坎镇五堡管区
石埠村
省府路8号

My goal is to translate each one of these entities into pinyin using the adso tool. The tool correctly transliterated the above entries and the output looks like this:

quan2 zhang3 zhen4 
xin1cheng2 dong1 lu4 
qing1shui3 kou3 zhen4 
xi1 wu1lan2 bu4 lang4 xiang1 
Jin1 Ya1zhen4 
dan1 zhen4 qing1he2 
li4 xiang1 jie1 
nan2 jie1 Nan2men2 Nei4 
Long2 Hu2zhen4 gu3 ying2 cun1 
nan2 er4 nong4 
gang3 shan1 zhong1 lu4 
gou1 nai3 dian4 xiang1 
chi4 kan3 zhen4 wu3 bao3 guan3qu1 
shi2 bu4 cun1 
sheng3 fu3 lu4 ba1 hao4 

But there are a couple problems:

1.It is important that the pinyin files and character files line up, because I will be loading them into a map for programmatic use.

However, there appears to be an odd bug involving endline characters. A 500-line input hanzi file will occasionally result in a ~498-line output pinyin file. After examining the data we have the following issue:

From hanzi file, at roughly lines 59-62 (showing hidden LF files):

珠玑镇 (LF)

岚天乡 (LF)

六一五东路 (LF)

蒋集镇 (LF)

From pinyin file:

zhu1 ji1 zhen4 (LF)

lan2 tian1 xiang1 liu4 yi1wu3 dong1 lu4 (LF) **Note, these lines have been combined**

jiang3 ji2 zhen4 (LF)

If I convert the (LF) to a (CR)(LF), we get the following:

Hanzi file:

珠玑镇 (CR)(LF)

岚天乡 (CR)(LF)

六一五东路 (CR)(LF)

蒋集镇 (CR)(LF)

Pinyin file:

zhu1 ji1 zhen4 (CR)(LF)

lan2 tian1 xiang1 liu4 yi1wu3 dong1 lu4 (CR)

(LF)

jiang3 ji2 zhen4 (CR)(LF)

It's unclear what is happening here, but this error has resulted in a major slowdown with these large files. Is it possible to look into this sort of issue? It makes batch processing difficult, even if I split the file into numerous 250-line chunks.

By the way, this is just a rough guide--the pinyin translations will be combed over manually afterwards for accuracy reasons.

A bit of info about my set up, I am using the internally-prepared database on 32-bit ubuntu (guest VM on OS X through virtualbox) with the run command

./adso -ie utf8 -oe -utf8 -f "hanzi.txt" -y > "pinyin.txt"

2) Problem number two is probably much easier to deal with, but I would like not to translate into pinyin any numbers I come across. This is for two reasons:

a. Setups such as 六一 get translated as liu4bai3 shi2yi1 which is in this instance perhaps too smart for its own good, as this is the name of a street and not shorthand for 610.

b. I need to keep numbers as is in instances such as 195号--I do not want the pinyin

I tried using the --extra-code command, but I am unsure where to go from here:

./adso -ie utf8 -oe -utf8 -f "hanzi.txt" -y --extra-code "<IF><CLASS Number></IF><THEN><Print english><Print newline></THEN>"

But this just prints the english text at the beginning and then proceeds as usual. I can use <DELETE all> in its place, but then it doesn't print out the number portion at all!

Again, this is probably a relatively simple fix, probably involving "<INSERT>", but it's stumping me at the moment.

Any help you could provide would be appreciated!

Link to comment
Share on other sites

Join the conversation

You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.

Guest
Unfortunately, your content contains terms that we do not allow. Please edit your content to remove the highlighted words below.
Click here to reply. Select text to quote.

×   Pasted as rich text.   Paste as plain text instead

  Only 75 emoji are allowed.

×   Your link has been automatically embedded.   Display as a link instead

×   Your previous content has been restored.   Clear editor

×   You cannot paste images directly. Upload or insert images from URL.

×
×
  • Create New...