character Posted February 13, 2008 at 04:26 PM Report Posted February 13, 2008 at 04:26 PM Is there a unified set of instructions for compiling Adso? The two READMEs seem slightly out of sync, and the README.txt in the root directory doesn't say where the scripts are located that it says should be run. README.txt: COMPILATION: [...] ./prepare_internal or ./prepare_mysql These scripts will prepare the source code for compilation. Once done simply type "make". This will create the binary "adso". -------------------- scripts/README.txt: The scripts in this directory create the necessary files for the compilation of the external MySQL database into the actual Adso source code. Just type: ./run Then go into the source directory and type: ./prepare_internal Once that is done, type: make ------------------- I tried building an "internal" version last night but it didn't produce the expected output. I'll try a MySQL version later. Thanks! Quote
trevelyan Posted February 14, 2008 at 10:46 AM Report Posted February 14, 2008 at 10:46 AM To install the internal version: (1) go to the source directory (2) type "./prepare_internal" (3) type "make" If you're doing this and getting an error message please send me details of what system you are using and what error message it is giving you. The database files necessary to compile the internal version already exist in the "database" directory by default (you may have deleted them accidentally if you ran the scripts in the "scripts" directory). You don't need to touch anything in the scripts directory unless you are already running the MySQL version and want to compile a stand-alone version of the software using your (edited) version of the dictionary/database instead of the default version that comes with the distribution. Quote
character Posted February 14, 2008 at 12:03 PM Author Report Posted February 14, 2008 at 12:03 PM I thought I did that, but the results seem to indicate something is wrong: Unit:NonChinese:Punctuation:Terminal:Newline Unit:Punctuation Unit:Punctuation 陸 陸 陸 Unit:NonChinese 小 小 Xiao Unit:Noun:Name 風 風 風 Unit:NonChinese ... I used a UTF-8 file and tried both databases, but the results seem to be the same. What is the format of the output, BTW? The Unit:Punctuation above were spaces, but the output has them as control characters. I removed the control characters so this post doesn't cause problems. Is Adso lacking a lot of traditional characters or is there some more basic problem causing it to fail to recognize characters such as feng1? My system is Ubuntu 7.10. Is there a certain set of packages (besides the compiler and mysql) you expect to be installed? Quote
trevelyan Posted February 14, 2008 at 06:58 PM Report Posted February 14, 2008 at 06:58 PM Traditional support is worse (we don't have many contributors from Taiwan), but I'd guess the problem if you're annotating really short passages is that the system doesn't have enough data to guess that it is traditional Chinese. Try specifying the input encoding and script explicitly, as with: ./adso -f [input file] -ie utf8 -is traditional If that doesn't solve your problem, please send me the text you're trying to annotate and I'll see what the problem is. We're working with a relatively new version of the software, and the code for parsing traditional characters hasn't received a lot of testing, so it's possible there's a problem somewhere and you're the unfortunate guinea pig who's running into it. Good news is, if there *is* a problem with the software, I should be able to fix it quite quickly. Quote
trevelyan Posted February 14, 2008 at 07:01 PM Report Posted February 14, 2008 at 07:01 PM You can specify output encoding and script as well: -oe utf8 -os simplified or whatever. For help with the command line instructions, just type: ./adso --help Quote
character Posted February 15, 2008 at 01:23 AM Author Report Posted February 15, 2008 at 01:23 AM Thanks, that made a big difference and produced useable output. Wenlin still doesn't recognize the output file as UTF-8, but perhaps that's a Windows/Linux problem, or a Wine problem. Is there a way to get traditional and pinyin output instead of simplified and traditional? How do the "-cn -y -t" options work? Adding them all didn't seem to change anything in the output. Some things I noticed in the output: 月圆 月圓 full moon Unit:Noun ... 圆 圓 to justify Unit:Verb 月 月 moon Unit:Noun and 他 他 他 Unit:NonChinese and 传奇性 傳奇性 traditional story's sex Unit:Noun (Wenlin says "legendary") Thanks again! Quote
trevelyan Posted February 15, 2008 at 02:12 AM Report Posted February 15, 2008 at 02:12 AM > Is there a way to get traditional and pinyin output instead of simplified and traditional? > How do the "-cn -y -t" options work? Adding them all didn't seem to change anything > in the output. -cn produces Chinese (gb2312 by default) -y produces pinyin -t produces english output If the program defaults aren't good enough for your needs, you can customize output by telling the engine exactly what form of output you want. The following command should be fairly self-explanatory ("" tells the engine to pick the most likely entry for each word before continuing): ./adso -f [input file] --code --extra-code " AND " This will produce output that looks like this, one word per line: gb2312 / simplified utf8 / traditional utf8 / english / pinyin / class gb2312 / simplified utf8 / traditional utf8 / english / pinyin / class gb2312 / simplified utf8 / traditional utf8 / english / pinyin / class .... If you need to do anything more complex, take a look at the files in the grammar directory. The simplest (giza++.grammar) is designed to preprocess texts for Franz Och's GIZA++ statistical machine translation program. > 传奇性 傳奇性 traditional story's sex Unit:Noun (Wenlin says "legendary") 传奇性 wasn't in the backend dictionary: the nonsensical definition is the giveaway. What is happening is that Adso by default looks for compound phrases and is more aggressive about putting them together in this version. Most of the time you'll see Adso do this when it has two words which are defined ONLY as nouns. This design decision is an attempt to increase the visibility of grammatical/sematic misclassifications in the backend dictionary. We've never had access to commercial dictionary data, so are building a better alternative. On the upside, this means we can afford to share the data freely. I've just added 传奇性 as "legendary". If you spot a missing word, or a bad entry, the editing interface is at the address below. It's also possible now to annotate the text through the main site and use the point-and-click editing functions (click to edit words, highlight to add new phrases).: http://adsotrans.com/uniedit.php I'll need to change the editing interface to enable the editing of traditional characters as well. Quote
character Posted February 15, 2008 at 03:20 AM Author Report Posted February 15, 2008 at 03:20 AM ./adso -f [input file] --code --extra-code " AND " I'm afraid this produces an empty file when I run it. I tried different things, but IIRC adding --vocab before --code results in the vocab output format. The vocab format is useful enough, though if you figure out that the behavior I'm seeing is a bug and fix it, that would be great. On the upside, this means we can afford to share the data freely.Except for "legendary" from Wenlin's ABC. I should be able to massage the vocab output into something closer to what I need. Thanks again! Quote
trevelyan Posted February 15, 2008 at 10:02 AM Report Posted February 15, 2008 at 10:02 AM I'll take a look with the text you forwarded later this weekend. Thx. Suggestions for better translations than legendary would be welcome. Quote
trevelyan Posted February 17, 2008 at 11:34 AM Report Posted February 17, 2008 at 11:34 AM Still haven't fixed the issue with 他, but the others are fixed in the software now. Details: http://www.chinese-forums.com/showthread.php?p=141265#post141265 We don't have many vocal users who are working on traditional. Let me know what problems you run into. Quote
trevelyan Posted February 22, 2008 at 01:40 AM Report Posted February 22, 2008 at 01:40 AM @character, The command copied above works for me. The only difference is that "input file" needs to be renamed to the name of your input file. Also, it takes a few minutes to process because the file is quite large and you're manipulating it with external code. Quote
character Posted February 22, 2008 at 02:13 AM Author Report Posted February 22, 2008 at 02:13 AM steve@wearable:~/chinese_dictionary/adso-v5.022/source$ ./adso -f lxf_prologue.u8 --code --extra-code " AND " > lxf_adso_output.u8 steve@wearable:~/chinese_dictionary/adso-v5.022/source$ ls -al *.u8 -rw-r--r-- 1 steve steve 0 2008-02-21 20:48 lxf_adso_output.u8 -rw-r--r-- 1 steve steve 22809 2008-02-21 20:47 lxf_prologue.u8 --------------- Is there some debug mode I can run for you? "The command copied above works for me." As a developer myself, "it works on my machine" isn't proof of a program's proper functioning. I'm perfectly willing to believe there's something present/missing in my system's environment which is causing the problem. But without knowing what needs to be present for Adso to work correctly, what versions, what environment settings, etc, it's hard for me to figure out what is wrong. Quote
trevelyan Posted February 23, 2008 at 08:30 AM Report Posted February 23, 2008 at 08:30 AM Strange. I've never run into this problem or heard of anyone else having it.... Can you let me know what OS you're running the program, along with (ideally) your version of g++. One suggestion would be trying on a smaller file (one or two lines) rather than throwing more than 200 at the software at a time. At the least, it will speed up testing. (ie. head -n 2 [input file] > [output file]). It could be that there is a particular line that is causing problems on your machine - identifying the line in question would be useful. Also, if the file has been created on Windows can you try running dos2unix on it before running it through Adso. It would be useful to confirm or eliminate the possibility that the problems are related to Windows file formatting. I'm working in some ways to speed up processing for those who are using Adso primarily for segmentation. With the exception of the above suggestions, I'm not sure how exactly to deal with this problem in the meantime. Quote
trevelyan Posted February 23, 2008 at 08:44 AM Report Posted February 23, 2008 at 08:44 AM Also, document redirection requires write permissions on the directory. Can you confirm that you have those write permissions and/or try the command using "sudo" (root will definitely have write permission). I've had this bite me before in other situations. Quote
character Posted February 23, 2008 at 10:36 AM Author Report Posted February 23, 2008 at 10:36 AM I created the adso directories and have write permission in them. Using sudo didn't change the (lack of) results, nor did using a two-line file instead. Ubuntu 7.10 32-bit g++ (GCC) 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2) 2GB RAM, plenty of free disk space ----- Again, ./adso -f ../lxf_prologue.u8 -ie utf8 -is traditional -oe utf8 -os traditional --vocab > lxf_adso10.u8 produces output, while the command using "--code..." does not. Is there some way the codepath used by "--code..." can fail silently? Adso seems to churn for a while, then exit, having produced no output. Is it just exiting if some problem occurs (some variable is null, some error from a method call, some exception, etc.) instead of printing an error message? Quote
trevelyan Posted February 25, 2008 at 06:59 AM Report Posted February 25, 2008 at 06:59 AM The strange thing is that the same output works for me. This suggests that it isn't an issue with the "--code" flag itself, although there might be library issues. I know that I've been using Ubuntu 7.04. Think I'll have to install Ubuntu 7.10 to find out what is happening. Will stick something in the command line version of the next release that gives you the output you want in the meantime. Quote
trevelyan Posted February 25, 2008 at 07:24 AM Report Posted February 25, 2008 at 07:24 AM @character -- version 5.023 is up and should solve your issues. Two new command lines: --trad-vocab == traditional/pinyin vocabulary export option --tonalize --> converts numeric pinyin to UTF8 tone marks Problem otherwise might be with the compile options in the Makefile (mtune, fno-reduce, etc.), or the default --static compile flag. Quote
character Posted February 25, 2008 at 05:53 PM Author Report Posted February 25, 2008 at 05:53 PM Thanks! I'll try to test it in the next couple of days. If you want to make a debug version (printing out method entry and exit, and possibly key parameters/variable values) I'll be happy to run it and send you the output. Quote
geek_frappa Posted February 26, 2008 at 03:53 AM Report Posted February 26, 2008 at 03:53 AM this is nice. looks like the makings of an embedded solution for mobiles and blackberries. Quote
trevelyan Posted February 26, 2008 at 05:54 AM Report Posted February 26, 2008 at 05:54 AM Should work on anything that supports the GNU C++ compiler. Issue with mobile devices is probably more with the interface: inputing content and displaying the output. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.