MTH123 Posted May 8, 2022 at 06:38 PM Report Posted May 8, 2022 at 06:38 PM Introduction I use the Chinese Text Analyser (CTA) software (https://www.chinesetextanalyser.com/) to process a text document and generate a list of words to learn. (This is only one of countless uses of the software.) The list of words is selected to be in order of frequency from high to low. It also includes cumulative frequency. (For example, a word that has a cumulative frequency of 90% includes the frequency of itself and the frequencies of all the words that are more frequent than it is.) A cumulative frequency of somewhere between 90% and 95% is used to determine when a text document can be read comfortably without having to spend a lot of time looking up and learning new words. Because spaces aren’t used to separate words in Chinese like they are in English, CTA has a segmenter that determines where to separate words. The Stanford Word Segmenter (SWS) software has been discussed here-and-there as the most accurate segmenter. It takes a text document as input, and it outputs a text document with spaces between words. I was curious about putting a text document through SWS first and CTA second to maximize the capabilities of CTA. Then, Imron was incredibly kind as to send me a development version of CTA that provides an option to segment by spaces, instead of using its normal segmenter. So, this post makes various comparisons between the following: · SWS: A text document is processed through the SWS software only. · SWS_CTA_spaces: A text document is processed through the SWS software first. Then, the result is processed through the CTA software using the option to segment by spaces. · SWS_CTA: A text document is processed through the SWS software first. Then, the result is processed through the CTA software using its normal segmenter. · CTA: A text document is processed through the CTA software only, using its normal segmenter. I’ve attached all the files I created for this post, including input files, output files and spreadsheets. I chose two very different examples to try. Example #1 is a transcript of a 45-minute episode of a Chinese TV drama (Episode 1 of Put Your Head on My Shoulder). I also attached the English translation of the transcript. Example #2 is a book that has about ten times more words (Just One Smile is Very Alluring aka Love O2O by Gu Man). For two different English translations of the book, see the link below. https://www.shushengbar.net/%e5%be%ae%e5%be%ae%e4%b8%80%e7%ac%91%e5%be%88%e5%80%be%e5%9f%8e-wei-wei-yi-xiao-hen-qing-cheng-%e9%a1%be%e6%bc%ab/ Stanford Word Segmenter Software Stanford Word Segmenter (https://nlp.stanford.edu/software/segmenter.html#Tutorials) is free open-source software that uses the command-line interface. I installed the MacOS version, but it is also available on other platforms, like Windows. After downloading the ZIP file, I uncompressed it and copied the uncompressed folder to the Applications folder. The SWS software requires Java (https://www.java.com/en/), which is free for personal use and is available on several platforms. I downloaded and installed it. In the Terminal application, change the folder to the folder containing the SWS software, using the command below. cd /applications/stanford-segmenter-2020-11-17 In Terminal, run the SWS software using the command below. The first argument can be either pku (for Beijing (Peking) University) or ctb (for Chinese Treebank). pku results in smaller vocabulary sizes and out-of-vocabulary rates on test data than ctb. So, pku is selected. input.txt is the input file name. UTF-8 is the encoding. SWS.txt is the output file name. ./segment.sh pku ./input.txt UTF-8 0 > ./SWS.txt The contents of SWS.txt need to be manipulated into a format that matches output from the CTA software. · Copy-and-paste the contents of SWS.txt into a Word document. Search for a space and replace all with ^p, which is a line break. This puts each word in its own line. · Copy-and-paste the contents of the Word document into an Excel spreadsheet called Comparison.xlsx. Select all the words and sort them in alphabetical order. · In a second column, do some simple math to count the number of occurrences of each word. · In a third column, so some simple math to flag the maximum count for each word. Sort by this column. Delete all the rows that don’t have the maximum count for a word. The result is that the second column is the frequency. Delete the third column, since it is no longer needed. · Delete the rows with words containing an English letter(s) and/or number(s) and/or special symbol (like the yen sign). · Sort the rows by the second column (frequency) from largest to smallest. · In a third column, calculate the % frequency. · In a fourth column, calculate the cumulative frequency. For each of Example #1 and Example #2, I use the same file names, but keep the files in separate folders. (This makes it easier to write this post. ) Chinese Text Analyser Software In the CTA software that includes the additional option to segment by spaces, · In the File menu, select Open and SWS.txt. · In the Tools menu, select Segmenter and Spaces. · In the File menu, select Export and To File…. A window pops up. For both Words and Rows, I selected All. For Sort by, I selected the default of Frequency (Descending). From Available Fields, I selected Word, Frequency, % Frequency, and Cumulative Frequency. Click the Export button. Type the file name SWS_CTA_spaces.txt. Click the Save button. · In the Tools menu, select Segmenter and Chinese. · In the File menu, select Export and To File…. After the window pops up, click the Export button. Type the file name SWS_CTA.txt. Click the Save button. · In the File menu, select Open and input.txt. · In the Tools menu, select Segmenter and Spaces. · In the File menu, select Export and To File…. After the window pops up, click the Export button. Type the file name CTA.txt. Click the Save button. The Excel spreadsheet Comparison.xlsx already has SWS in it. Add the contents of SWS_CTA_spaces.txt, SWS_CTA.txt, and CTA.txt. Cross compare various sets of words and their frequencies. Example #1: An Episode of a Chinese TV Drama SWS and SWS_CTA_spaces produced results that are nearly identical. All 945 of the Chinese words in SWS_CTA_spaces are in SWS. SWS has two extra words. They each occur only once in the episode and have a frequency of only 0.03%. (They are 懂懂; dǒng dǒng; understand and 行那; xíng nà; all right then.) SWS and SWS_CTA produced results that are fairly similar. SWS and CTA also produced results that are fairly similar. SWS has 947 Chinese words, SWS_CTA has 905 Chinese words, and CTA has 947 Chinese words. For both SWS_CTA and CTA, the normal segmenter in the CTA software produced similar results as the SWS software for all words that have a frequency higher than 0.16% with the exceptions of names and 一个 (yī gè; one). These are the top 119 most frequent words and have a cumulative frequency of 61.00%. The normal segmenter in the CTA software also produced similar results as the SWS software for most words that have a frequency lower than 0.16%. There are 151 words in SWS that aren’t in SWS_CTA, and there are 109 words in SWS_CTA that aren’t in SWS. Of the 151 words in SWS that aren’t in SWS_CTA, 50% of the words (75 words) are within a cumulative frequency of 90% and have a total combined frequency of 4.27%. 78% of the words (118 words) are within a cumulative frequency of 95% and have a total combined frequency of 5.63%. I take this to mean that the segmenting in SWS is better than the segmenting in SWS_CTA by roughly 5%. There are 205 words in SWS that aren’t in CTA, and there are 204 words in CTA that aren’t in SWS. Of the 205 words in SWS that aren’t in CTA, 52% of the words (107 words) are within a cumulative frequency of 90% and have a total combined frequency of 5.72%. 79% of the words (161 words) are within a cumulative frequency of 95% and have a total combined frequency of 7.43%. I take this to mean that the segmenting in SWS_CWA is better than the segmenting in CTA by roughly 1%. Of the 151 words in SWS that aren’t in SWS_CTA, 29 have a frequency of 2 or more and are shown in the table further below. In the table, rank is the rank of the word in the entire list of 947 words in SWS. Of the 29 words, the most frequent word is a name. The second most frequent word is 一个 (yī gè; one). The third most frequent word is another name. Of the 205 words in SWS that aren’t in CTA, 40 have a frequency of 2 or more and are shown in the table below. Of the 40 words, the 3 most frequent words are the same as the ones above. (Actually, with a little reordering, the 13 most frequent words are the same. All 29 of the words in SWS_CTA are amongst the 40 words in CTA.) SWS vs. CTA SWS vs. SWS_CTA # Word Freq %Freq CumFreq Rank # Word Freq %Freq CumFreq Rank 1 末末 11 0.3479 43.6749 42 1 末末 11 0.3479 43.6749 42 2 一个 6 0.1898 55.0917 85 2 一个 6 0.1898 55.0917 85 3 未易 5 0.1581 60.5313 116 3 未易 5 0.1581 60.5313 116 4 老校 5 0.1581 61.1638 120 4 老校 5 0.1581 61.1638 120 5 擦破 4 0.1265 65.4965 152 5 擦破 4 0.1265 65.4965 152 6 真的 4 0.1265 65.8760 155 6 真的 4 0.1265 65.8760 155 7 这次 4 0.1265 66.2555 158 7 这次 4 0.1265 66.2555 158 8 你别 3 0.0949 67.3941 169 8 你别 3 0.0949 67.3941 169 9 同 3 0.0949 68.1531 177 -- --- -- --- --- --- 10 坏 3 0.0949 69.0070 186 -- --- -- --- --- --- 11 帮你 3 0.0949 69.7660 194 9 帮你 3 0.0949 69.7660 194 12 换屏 3 0.0949 70.5250 202 10 换屏 3 0.0949 70.5250 202 13 摸摸 3 0.0949 70.6199 203 11 摸摸 3 0.0949 70.6199 203 14 算 3 0.0949 71.3789 211 -- --- -- --- --- --- 15 系里 3 0.0949 71.4738 212 12 系里 3 0.0949 71.4738 212 16 走走走 3 0.0949 72.0430 218 13 走走走 3 0.0949 72.0430 218 17 不知道 2 0.0633 73.2448 233 14 不知道 2 0.0633 73.2448 233 18 人份 2 0.0633 73.4978 237 15 人份 2 0.0633 73.4978 237 19 伤 2 0.0633 73.6875 240 -- --- -- --- --- --- 20 你先 2 0.0633 73.7508 241 16 你先 2 0.0633 73.7508 241 21 先走 2 0.0633 74.0670 246 17 先走 2 0.0633 74.0670 246 22 口 2 0.0633 75.1423 263 -- --- -- --- --- --- 23 只 2 0.0633 75.2056 264 -- --- -- --- --- --- 24 太久 2 0.0633 76.2808 281 18 太久 2 0.0633 76.2808 281 25 好吧 2 0.0633 76.4073 283 19 好吧 2 0.0633 76.4073 283 26 对啊 2 0.0633 76.9133 291 20 对啊 2 0.0633 76.9133 291 27 对对 2 0.0633 76.9766 292 21 对对 2 0.0633 76.9766 292 28 想到 2 0.0633 77.6724 303 -- --- -- --- --- --- 29 感 2 0.0633 77.7989 305 -- --- -- --- --- --- 30 手机屏 2 0.0633 77.9886 308 22 手机屏 2 0.0633 77.9886 308 31 掉 2 0.0633 78.3681 314 -- --- -- --- --- --- 32 推走 2 0.0633 78.4314 315 23 推走 2 0.0633 78.4314 315 33 来就 2 0.0633 78.8741 322 24 来就 2 0.0633 78.8741 322 34 甜度 2 0.0633 79.5066 332 25 甜度 2 0.0633 79.5066 332 35 睡著 2 0.0633 79.6331 334 -- --- -- --- --- --- 36 著手 2 0.0633 80.1392 342 -- --- -- --- --- --- 37 行行行 2 0.0633 80.2024 343 26 行行行 2 0.0633 80.2024 343 38 西药房 2 0.0633 80.2657 344 27 西药房 2 0.0633 80.2657 344 39 觉到 2 0.0633 80.3922 346 28 觉到 2 0.0633 80.3922 346 40 走走 2 0.0633 80.7084 351 29 走走 2 0.0633 80.7084 351 Example #2: A Book SWS and SWS_CTA_spaces produced results that are nearly identical. All 9,274 of the Chinese words in SWS are in SWS_CTA_spaces. SWS_CTA_spaces has one extra word. It occurs only twice in the book and has a frequency of only 0.002%. (It is 龟; guī; turtle.) SWS and SWS_CTA produced results that are fairly similar. SWS and CTA also produced results that are fairly similar. SWS has 9,274 Chinese words, SWS_CTA has 7,706 Chinese words, and CTA has 8,421 Chinese words. The normal segmenter in the CTA software produced similar results as the SWS software for all words that have a frequency higher than 0.11% with the exceptions of names and 一个 (yī gè; one). These are the top 145 most frequent words and have a cumulative frequency of 53.68%. The normal segmenter in the CTA software also produced similar results as the SWS software for most words that have a frequency lower than 0.11%. There are 2,052 words in SWS that aren’t in SWS_CTA, and there are 484 words in SWS_CTA that aren’t in SWS. Of the 2,052 words in SWS that aren’t in SWS_CTA, 10% of the words (214 words) are within a cumulative frequency of 90% and have a total combined frequency of 3.27%. 44% of the words (897 words) are within a cumulative frequency of 95% and have a total combined frequency of 4.43%. I take this to mean that the segmenting in SWS is better than the segmenting in SWS_CTA by roughly 4%. There are 2,265 words in SWS that aren’t in CTA, and there are 1,412 words in CTA that aren’t in SWS. Of the 2,265 words in SWS that aren’t in CTA, 10% of the words (226 words) are within a cumulative frequency of 90% and have a total combined frequency of 3.37%. 43% of the words (971 words) are within a cumulative frequency of 95% and have a total combined frequency of 4.63%. I take this to mean that the segmenting in SWS_CWA is better than the segmenting in CTA by less than 1%. Of the 2,052 words in SWS that aren’t in SWS_CTA, 432 have a frequency of 2 or more. Of these 432 words, 6 of the 7 most frequent words are names. The second most frequent word is 一个 (yī gè; one). Of the 2,265 words in SWS that aren’t in CTA, 463 have a frequency of 2 or more. Of these 463 words, the 7 most frequent words are the same as above. Summary SWS and SWS_CTA_spaces are essentially identical. The differences are negligible. (Again, SWS_CTA_spaces is exported from a development version of the CTA software, using an option to segment by spaces.) SWS, SWS_CTA and CTA are all fairly similar. SWS provides roughly 4% to 5% better segmenting than SWS_CTA, which in turn provides roughly 1% better segmenting than CTA. These results are based on two very different examples with similar results. Example1.zip Example2.zip 2 Quote
jannesan Posted May 9, 2022 at 07:03 AM Report Posted May 9, 2022 at 07:03 AM I would be interested how https://github.com/lancopku/pkuseg-python compares. I was never able to verify the claims on accuracy because the common evaluation dataset used by these segmenters is not openly available. 1 Quote
MTH123 Posted May 10, 2022 at 07:39 PM Author Report Posted May 10, 2022 at 07:39 PM On 5/9/2022 at 2:03 AM, jannesan said: I would be interested how https://github.com/lancopku/pkuseg-python compares. Installing pkuseg I initially had trouble figuring out how to install pkuseg (https://github.com/lancopku/pkuseg-python/blob/master/readme/readme_english.md). It turns out that it was written for an older version(s) of the Python programming language (https://www.python.org/), specifically 3.7, 3.6 and/or 3.5. So to cut to the chase, I just installed everything on an older Mac. I wanted to use Homebrew (https://brew.sh/) to install Python 3.7, because it’s simpler. So, I had to install Homebrew first. To do this, I copied-and-pasted the command below into the Terminal application. (Always hit the return key after putting in a command.) /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)" To install Python 3.7, copy-and-paste the command below into Terminal. brew install python@3.7 pkuseg also requires Numpy (whatever that is, lol). I looked up the latest version of Numpy that goes along with Python 3.7. It’s 1.21.5. To install Numpy 1.21.5, copy-and-paste the command below into Terminal. pip3 install numpy==1.21.5 To install pkuseg, copy-and-paste the commands below into Terminal. I’m not sure both are needed, but I ran both anyway. pip3 install pkuseg pip3 install -U pkuseg Running pkuseg Below is the Python script to run pkuseg on an inputted text file. This script uses the default mixed-domain model for segmentation, which is recommended, if you don’t know what model to pick. (The other models are news, web, medicine and tourism, so none of them seem to directly fit my input.) I named the output file pkuseg.txt. nthread is the number of threads. 20 is recommended for inputted text files. import pkuseg #Take file 'input.txt' as input. #The segmented result is stored in output file pkuseg.txt'. pkuseg.test('input.txt', pkuseg.txt', nthread=20) I put the script above in a text file named segtextfile.py. This file has to be in the same folder as input.txt. In the Terminal application, change the folder to the folder with input.txt in it, using the command below. In the command below, replace folderpath with the actual path, e.g., /Users/mth123/Documents/Example1. cd folderpath In Terminal, run pkuseg using the command below. python3 segtextfile.py Post-Processing Output from pkuseg The rest of the steps are the same as the original post. I’ve attached a bunch of files. The results are discussed below. Example #1 – An Episode of a Chinese TV Drama SWS and pkuseg produced results that are fairly similar. SWS has 947 Chinese words, and pkuseg has 896 Chinese words. The pkuseg script produced similar results as the SWS software for all words that have a frequency higher than 0.32% with the exception of 2 names. These are the top 51 most frequent words and have a cumulative frequency of 46.62%. The pkuseg script also produced similar results as the SWS software for most words that have a frequency lower than 0.32%. There are 165 words in SWS that aren’t in pkuseg, and there are 114 words in pkuseg that aren’t in SWS. Of the 165 words in SWS that aren’t in pkuseg, 52% of the words (86 words) are within a cumulative frequency of 90% and have a total combined frequency of 5.63%. 79% of the words (130 words) are within a cumulative frequency of 95% and have a total combined frequency of 7.02%. I take this to mean that the segmenting in SWS is better than the segmenting in pkuseg by roughly 7%. Example #2 – A Book SWS and pkuseg produced results that are fairly similar. SWS has 9,274 Chinese words and pkuseg has 9,721 Chinese words. The pkuseg script produced similar results as the SWS software for all words that have a frequency higher than 0.16%. These are the top 95 most frequent words and have a cumulative frequency of 47.59%. The pkkuseg script also produced similar results as the SWS software for most words that have a frequency lower than 0.16%. There are 1,760 words in SWS that aren’t in pkuseg, and there are 2,207 words in pkuseg that aren’t in SWS. Of the 1,760 words in SWS that aren’t in pkuseg, 8.13% of the words (143 words) are within a cumulative frequency of 90% and have a total combined frequency of 1.38%. 40% of the words (697 words) are within a cumulative frequency of 95% and have a total combined frequency of 2.31%. I take this to mean that the segmenting in SWS is better than the segmenting in pkuseg by roughly 2%. Summary The results of the two examples are noticeably different. SWS is 7% better than pkuseg for Example #1, which is an episode of a Chinese TV drama. It is 2% better for Example #2, which is a book. CTA segmented Example #1 better than pkuseg. pkuseg segmented Example #2 better than CTA. Maybe pkuseg produces better results for news, web, medicine and tourism. I don’t consume these in Chinese, so I don’t have examples to try. This has been an interesting exercise, but I’m going to put my old Mac away again. Example2.zip Example1.zip 1 Quote
imron Posted May 11, 2022 at 12:14 AM Report Posted May 11, 2022 at 12:14 AM Thanks for these writeups, they're very insightful, and also the reason why after all these years I still haven't put much effort in to improving the CTA segmenter. One of CTA's design goals was to make it easy to extract frequently used unknown words from a document, and the current segmenter still performs acceptably in that regard, and would require considerable effort to improve it, for only marginal improvements in overall utility. This makes the segmenter not a great choice if you are using CTA as a document reader, but still a good choice if you are using it for word extraction or to compare the relative difficulties of text with respect to your current vocabulary. 1 Quote
MTH123 Posted May 11, 2022 at 02:14 AM Author Report Posted May 11, 2022 at 02:14 AM On 5/10/2022 at 7:14 PM, imron said: Thanks for these writeups, they're very insightful, and also the reason why after all these years I still haven't put much effort in to improving the CTA segmenter. You’re very welcome! I had to think hard about how to analyze the results in a useful-and-practical way. The two days I spent reading the epic thread https://www.chinese-forums.com/forums/topic/44383-introducing-chinese-text-analyser/ came in handy in an unexpected way. It’ll probably be many years, before I try SWS again, because I’m happy with CTA as it is and I’m early in my Chinese learning/re-learning. Quote
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.