Basic Python module for adso

November 11, 2006 at 08:09 AM

Stemming from the discussion in this thread, here is a basic python module that will perform web-based queries against the adsotrans website and return the results as a list of tuples.

There are 3 files:

adso.py - the main module

adsotatepage.py - class that handles processing of the adsotrans webpage

test.py - simple test harness

if anyone was interested, it probably wouldn't be too hard to have a translatepage or a pinyinpage that would process and return the results from a translate or pinyin query.

To use the module, import it, and create an object of the Adso class.

I decided to write an Adso class rather than just having functions in the module, so that all the different adso options (conjugation, grammar, encoding, encoding_out, numeric_pinyin and quality) can easily be preserved across multiple calls. These values are set in the constructor, and are simply strings that correspond to the values passed to the adso url.

Default values are:

conjugation='on'

grammar='on'

encoding='UTF-8S'

encoding_out='UTF-8S'

numeric_pinyin='off'

quality='high'

To use, simply import the module, create an Adso object, and call the adsotate member function with the text that you want.

from adso import Adso

adso = Adso()

result = adso.adsotate( '你好世界‘ )

result will be a list of tuples containing the values (chinese, pinyin, translation), with one tuple per segment of text, ordered by the same order the segments appear in the original text. e.g. the above example produces the result:

[ ( '你好', 'nǐhǎo', 'hello' ), ( '世界', 'shìjiè', 'world' ) ]

Note: the encoding of the text you pass in should be what you provided as the encoding when creating the Adso object (defaults to utf-8 ).

Anyway, it's all pretty basic at the moment, and doesn't really do anything more advanced than generate a query to the main adsotrans webpage, and then parse the resulting html file. There's also very little in the way of error checking, so you'll get exceptions if you can't connect to the internet etc. It was done more as a proof-of-concept than anything else. Is this the sort of thing you had in mind Kudra?

BTW speaking of errors, I don't know if this is of interest to you Trevelyan, but the python HTMLParser says the output generated by Adso has malformed start tags at various places in the html. The w3.org validator reports errors in the same lines/columns, but it seems to be because it's treating the adso.zip

November 11, 2006 at 11:45 AM

Haven' t played with it yet, but from all appearances, in the words of Will Smith in Men in Black I, "Now that's what I'm talking about!"

thanks.

November 11, 2006 at 04:24 PM

Hi Imron,

Awesome work. Would you mind if I ported something like this over to Java? I'd love to be able use it in the ZDT and I'm sure others would use it as well.

Chris

November 11, 2006 at 07:04 PM

Looks good. Let me know if any changes are necessary on this end to help out. It would be possible to create a script that just spat out the information delimited in a more convenient way for parsing/processing if that would help or be faster.

November 11, 2006 at 07:52 PM

@trevelyan -- that would be convenient. In my experience of parsing yahoo pages, it is always a pain when they change the html format. By essentially providing an api you or we python(or other lang) programmers wont have to worry if you change stuff around in the html.

November 12, 2006 at 01:40 AM

@bogleg - go for it, it's not even 100 lines of code, so I can't imagine it'd take too long. Though you might want to wait until trevelyan can produce a page with a more streamlined output.

@trevelyan - yeah, a more suitable format would be nice, and would certainly be more future-proof. Maybe just a simple XML file along the lines of:

你好

nǐhǎo

hello

(or less verbosely

)

You could of course add any extra other info that was relevant/useful (part of speech, simplified/traditional conversion etc). All of which (including the 3 listed above) could be toggled by parameters.

This format would also lend itself nicely to the other styles of queries (translation/pinyin), which would simply just have one segment containing the entire body of text with the appropriate pinyin/translation.

February 12, 2007 at 04:28 PM

Currently takes GB2312 as input, but it will make sense to switch to UTF8. I'm not sure which server to put it on. Probably the new one. Ping me if anyone is clamouring to set anything up using it and I'll jump on supporting UTF sooner rather than later.

http://www.adsotate.com/adso/api.pl?text=%CB%FB%C3%C7

February 13, 2007 at 07:41 AM

I'm clamouring! Hook us up!

Chris

February 13, 2007 at 01:18 PM

That's great! Thanks for that

February 20, 2007 at 10:04 PM

Ok. First file here takes in GB2312. The second takes in UTF8. Because of the need to support both simplified and traditional, both files return content in UTF8.

http://www.adsotate.com/adso/api-gb2312.pl?text=TEXT

http://www.adsotate.com/adso/api-utf8.pl?text=TEXT

There's no guarantee these files will stay online here. So if you set up anything using them send me an email so I can notify you if they move.

December 5, 2007 at 09:58 AM

Is there an API currently? The links above are non-functional. Been looking at restarting a couple of old dead projects and easy Adsoing would be handy.

December 25, 2007 at 02:49 PM

I'm heading down to Australia at the end of this week and will be back in China January 10th. Will be getting a new server then and will look into setting up a revised API then. If you need anything in particular before that just email me Roddy.

December 25, 2007 at 02:54 PM

No rush on my behalf. A reliable Adsotrans API would be a pretty cool thing to have though, and I'm sure it would get used.

Sign In

Basic Python module for adso

Recommended Posts

imron

kudra

bogleg

trevelyan

kudra

imron

trevelyan

bogleg

imron

trevelyan

roddy

trevelyan

roddy

Join the conversation