Chinese Text Analyser and Stanford Word Segmenter

May 8, 2022 at 06:38 PM

Introduction

I use the Chinese Text Analyser (CTA) software (https://www.chinesetextanalyser.com/) to process a text document and generate a list of words to learn. (This is only one of countless uses of the software.) The list of words is selected to be in order of frequency from high to low. It also includes cumulative frequency. (For example, a word that has a cumulative frequency of 90% includes the frequency of itself and the frequencies of all the words that are more frequent than it is.) A cumulative frequency of somewhere between 90% and 95% is used to determine when a text document can be read comfortably without having to spend a lot of time looking up and learning new words.

Because spaces aren’t used to separate words in Chinese like they are in English, CTA has a segmenter that determines where to separate words. The Stanford Word Segmenter (SWS) software has been discussed here-and-there as the most accurate segmenter. It takes a text document as input, and it outputs a text document with spaces between words.

I was curious about putting a text document through SWS first and CTA second to maximize the capabilities of CTA. Then, Imron was incredibly kind as to send me a development version of CTA that provides an option to segment by spaces, instead of using its normal segmenter. So, this post makes various comparisons between the following:

· SWS: A text document is processed through the SWS software only.

· SWS_CTA_spaces: A text document is processed through the SWS software first. Then, the result is processed through the CTA software using the option to segment by spaces.

· SWS_CTA: A text document is processed through the SWS software first. Then, the result is processed through the CTA software using its normal segmenter.

· CTA: A text document is processed through the CTA software only, using its normal segmenter.

I’ve attached all the files I created for this post, including input files, output files and spreadsheets. I chose two very different examples to try. Example #1 is a transcript of a 45-minute episode of a Chinese TV drama (Episode 1 of Put Your Head on My Shoulder). I also attached the English translation of the transcript. Example #2 is a book that has about ten times more words (Just One Smile is Very Alluring aka Love O2O by Gu Man). For two different English translations of the book, see the link below.

https://www.shushengbar.net/%e5%be%ae%e5%be%ae%e4%b8%80%e7%ac%91%e5%be%88%e5%80%be%e5%9f%8e-wei-wei-yi-xiao-hen-qing-cheng-%e9%a1%be%e6%bc%ab/

Stanford Word Segmenter Software

Stanford Word Segmenter (https://nlp.stanford.edu/software/segmenter.html#Tutorials) is free open-source software that uses the command-line interface. I installed the MacOS version, but it is also available on other platforms, like Windows. After downloading the ZIP file, I uncompressed it and copied the uncompressed folder to the Applications folder.

The SWS software requires Java (https://www.java.com/en/), which is free for personal use and is available on several platforms. I downloaded and installed it.

In the Terminal application, change the folder to the folder containing the SWS software, using the command below.

cd /applications/stanford-segmenter-2020-11-17

In Terminal, run the SWS software using the command below. The first argument can be either pku (for Beijing (Peking) University) or ctb (for Chinese Treebank). pku results in smaller vocabulary sizes and out-of-vocabulary rates on test data than ctb. So, pku is selected. input.txt is the input file name. UTF-8 is the encoding. SWS.txt is the output file name.

./segment.sh pku ./input.txt UTF-8 0 > ./SWS.txt

The contents of SWS.txt need to be manipulated into a format that matches output from the CTA software.

· Copy-and-paste the contents of SWS.txt into a Word document. Search for a space and replace all with ^p, which is a line break. This puts each word in its own line.

· Copy-and-paste the contents of the Word document into an Excel spreadsheet called Comparison.xlsx. Select all the words and sort them in alphabetical order.

· In a second column, do some simple math to count the number of occurrences of each word.

· In a third column, so some simple math to flag the maximum count for each word. Sort by this column. Delete all the rows that don’t have the maximum count for a word. The result is that the second column is the frequency. Delete the third column, since it is no longer needed.

· Delete the rows with words containing an English letter(s) and/or number(s) and/or special symbol (like the yen sign).

· Sort the rows by the second column (frequency) from largest to smallest.

· In a third column, calculate the % frequency.

· In a fourth column, calculate the cumulative frequency.

For each of Example #1 and Example #2, I use the same file names, but keep the files in separate folders. (This makes it easier to write this post. )

Chinese Text Analyser Software

In the CTA software that includes the additional option to segment by spaces,

· In the File menu, select Open and SWS.txt.

· In the Tools menu, select Segmenter and Spaces.

· In the File menu, select Export and To File…. A window pops up. For both Words and Rows, I selected All. For Sort by, I selected the default of Frequency (Descending). From Available Fields, I selected Word, Frequency, % Frequency, and Cumulative Frequency. Click the Export button. Type the file name SWS_CTA_spaces.txt. Click the Save button.

· In the Tools menu, select Segmenter and Chinese.

· In the File menu, select Export and To File…. After the window pops up, click the Export button. Type the file name SWS_CTA.txt. Click the Save button.

· In the File menu, select Open and input.txt.

· In the Tools menu, select Segmenter and Spaces.

· In the File menu, select Export and To File…. After the window pops up, click the Export button. Type the file name CTA.txt. Click the Save button.

The Excel spreadsheet Comparison.xlsx already has SWS in it. Add the contents of SWS_CTA_spaces.txt, SWS_CTA.txt, and CTA.txt. Cross compare various sets of words and their frequencies.

Example #1: An Episode of a Chinese TV Drama

SWS and SWS_CTA_spaces produced results that are nearly identical. All 945 of the Chinese words in SWS_CTA_spaces are in SWS. SWS has two extra words. They each occur only once in the episode and have a frequency of only 0.03%. (They are 懂懂; dǒng dǒng; understand and 行那; xíng nà; all right then.)

SWS and SWS_CTA produced results that are fairly similar. SWS and CTA also produced results that are fairly similar. SWS has 947 Chinese words, SWS_CTA has 905 Chinese words, and CTA has 947 Chinese words. For both SWS_CTA and CTA, the normal segmenter in the CTA software produced similar results as the SWS software for all words that have a frequency higher than 0.16% with the exceptions of names and 一个 (yī gè; one). These are the top 119 most frequent words and have a cumulative frequency of 61.00%. The normal segmenter in the CTA software also produced similar results as the SWS software for most words that have a frequency lower than 0.16%.

There are 151 words in SWS that aren’t in SWS_CTA, and there are 109 words in SWS_CTA that aren’t in SWS. Of the 151 words in SWS that aren’t in SWS_CTA, 50% of the words (75 words) are within a cumulative frequency of 90% and have a total combined frequency of 4.27%. 78% of the words (118 words) are within a cumulative frequency of 95% and have a total combined frequency of 5.63%. I take this to mean that the segmenting in SWS is better than the segmenting in SWS_CTA by roughly 5%.

There are 205 words in SWS that aren’t in CTA, and there are 204 words in CTA that aren’t in SWS. Of the 205 words in SWS that aren’t in CTA, 52% of the words (107 words) are within a cumulative frequency of 90% and have a total combined frequency of 5.72%. 79% of the words (161 words) are within a cumulative frequency of 95% and have a total combined frequency of 7.43%. I take this to mean that the segmenting in SWS_CWA is better than the segmenting in CTA by roughly 1%.

Of the 151 words in SWS that aren’t in SWS_CTA, 29 have a frequency of 2 or more and are shown in the table further below. In the table, rank is the rank of the word in the entire list of 947 words in SWS. Of the 29 words, the most frequent word is a name. The second most frequent word is 一个 (yī gè; one). The third most frequent word is another name.

Of the 205 words in SWS that aren’t in CTA, 40 have a frequency of 2 or more and are shown in the table below. Of the 40 words, the 3 most frequent words are the same as the ones above. (Actually, with a little reordering, the 13 most frequent words are the same. All 29 of the words in SWS_CTA are amongst the 40 words in CTA.)

SWS vs. CTA

SWS vs. SWS_CTA

#

Word

Freq

%Freq

CumFreq

Rank

#

Word

Freq

%Freq

CumFreq

Rank

1

末末

11

0.3479

43.6749

42

1

末末

11

0.3479

43.6749

42

2

一个

6

0.1898

55.0917

85

2

一个

6

0.1898

55.0917

85

3

未易

5

0.1581

60.5313

116

3

未易

5

0.1581

60.5313

116

4

老校

5

0.1581

61.1638

120

4

老校

5

0.1581

61.1638

120

5

擦破

4

0.1265

65.4965

152

5

擦破

4

0.1265

65.4965

152

6

真的

4

0.1265

65.8760

155

6

真的

4

0.1265

65.8760

155

7

这次

4

0.1265

66.2555

158

7

这次

4

0.1265

66.2555

158

8

你别

3

0.0949

67.3941

169

8

你别

3

0.0949

67.3941

169

9

同

3

0.0949

68.1531

177

--

---

--

---

10

坏

3

0.0949

69.0070

186

--

---

--

---

11

帮你

3

0.0949

69.7660

194

9

帮你

3

0.0949

69.7660

194

12

换屏

3

0.0949

70.5250

202

10

换屏

3

0.0949

70.5250

202

13

摸摸

3

0.0949

70.6199

203

11

摸摸

3

0.0949

70.6199

203

14

算

3

0.0949

71.3789

211

--

---

--

---

15

系里

3

0.0949

71.4738

212

12

系里

3

0.0949

71.4738

212

16

走走走

3

0.0949

72.0430

218

13

走走走

3

0.0949

72.0430

218

17

不知道

2

0.0633

73.2448

233

14

不知道

2

0.0633

73.2448

233

18

人份

2

0.0633

73.4978

237

15

人份

2

0.0633

73.4978

237

19

伤

2

0.0633

73.6875

240

--

---

--

---

20

你先

2

0.0633

73.7508

241

16

你先

2

0.0633

73.7508

241

21

先走

2

0.0633

74.0670

246

17

先走

2

0.0633

74.0670

246

22

口

2

0.0633

75.1423

263

--

---

--

---

23

只

2

0.0633

75.2056

264

--

---

--

---

24

太久

2

0.0633

76.2808

281

18

太久

2

0.0633

76.2808

281

25

好吧

2

0.0633

76.4073

283

19

好吧

2

0.0633

76.4073

283

26

对啊

2

0.0633

76.9133

291

20

对啊

2

0.0633

76.9133

291

27

对对

2

0.0633

76.9766

292

21

对对

2

0.0633

76.9766

292

28

想到

2

0.0633

77.6724

303

--

---

--

---

29

感

2

0.0633

77.7989

305

--

---

--

---

30

手机屏

2

0.0633

77.9886

308

22

手机屏

2

0.0633

77.9886

308

31

掉

2

0.0633

78.3681

314

--

---

--

---

32

推走

2

0.0633

78.4314

315

23

推走

2

0.0633

78.4314

315

33

来就

2

0.0633

78.8741

322

24

来就

2

0.0633

78.8741

322

34

甜度

2

0.0633

79.5066

332

25

甜度

2

0.0633

79.5066

332

35

睡著

2

0.0633

79.6331

334

--

---

--

---

36

著手

2

0.0633

80.1392

342

--

---

--

---

37

行行行

2

0.0633

80.2024

343

26

行行行

2

0.0633

80.2024

343

38

西药房

2

0.0633

80.2657

344

27

西药房

2

0.0633

80.2657

344

39

觉到

2

0.0633

80.3922

346

28

觉到

2

0.0633

80.3922

346

40

走走

2

0.0633

80.7084

351

29

走走

2

0.0633

80.7084

351

Example #2: A Book

SWS and SWS_CTA_spaces produced results that are nearly identical. All 9,274 of the Chinese words in SWS are in SWS_CTA_spaces. SWS_CTA_spaces has one extra word. It occurs only twice in the book and has a frequency of only 0.002%. (It is 龟; guī; turtle.)

SWS and SWS_CTA produced results that are fairly similar. SWS and CTA also produced results that are fairly similar. SWS has 9,274 Chinese words, SWS_CTA has 7,706 Chinese words, and CTA has 8,421 Chinese words. The normal segmenter in the CTA software produced similar results as the SWS software for all words that have a frequency higher than 0.11% with the exceptions of names and 一个 (yī gè; one). These are the top 145 most frequent words and have a cumulative frequency of 53.68%. The normal segmenter in the CTA software also produced similar results as the SWS software for most words that have a frequency lower than 0.11%.

There are 2,052 words in SWS that aren’t in SWS_CTA, and there are 484 words in SWS_CTA that aren’t in SWS. Of the 2,052 words in SWS that aren’t in SWS_CTA, 10% of the words (214 words) are within a cumulative frequency of 90% and have a total combined frequency of 3.27%. 44% of the words (897 words) are within a cumulative frequency of 95% and have a total combined frequency of 4.43%. I take this to mean that the segmenting in SWS is better than the segmenting in SWS_CTA by roughly 4%.

There are 2,265 words in SWS that aren’t in CTA, and there are 1,412 words in CTA that aren’t in SWS. Of the 2,265 words in SWS that aren’t in CTA, 10% of the words (226 words) are within a cumulative frequency of 90% and have a total combined frequency of 3.37%. 43% of the words (971 words) are within a cumulative frequency of 95% and have a total combined frequency of 4.63%. I take this to mean that the segmenting in SWS_CWA is better than the segmenting in CTA by less than 1%.

Of the 2,052 words in SWS that aren’t in SWS_CTA, 432 have a frequency of 2 or more. Of these 432 words, 6 of the 7 most frequent words are names. The second most frequent word is 一个 (yī gè; one). Of the 2,265 words in SWS that aren’t in CTA, 463 have a frequency of 2 or more. Of these 463 words, the 7 most frequent words are the same as above.

Summary

SWS and SWS_CTA_spaces are essentially identical. The differences are negligible. (Again, SWS_CTA_spaces is exported from a development version of the CTA software, using an option to segment by spaces.)

SWS, SWS_CTA and CTA are all fairly similar. SWS provides roughly 4% to 5% better segmenting than SWS_CTA, which in turn provides roughly 1% better segmenting than CTA. These results are based on two very different examples with similar results.

Example1.zip Example2.zip

May 9, 2022 at 07:03 AM

I would be interested how https://github.com/lancopku/pkuseg-python compares. I was never able to verify the claims on accuracy because the common evaluation dataset used by these segmenters is not openly available.

May 10, 2022 at 07:39 PM

On 5/9/2022 at 2:03 AM, jannesan said:

I would be interested how https://github.com/lancopku/pkuseg-python compares.

Installing pkuseg

I initially had trouble figuring out how to install pkuseg (https://github.com/lancopku/pkuseg-python/blob/master/readme/readme_english.md). It turns out that it was written for an older version(s) of the Python programming language (https://www.python.org/), specifically 3.7, 3.6 and/or 3.5. So to cut to the chase, I just installed everything on an older Mac.

I wanted to use Homebrew (https://brew.sh/) to install Python 3.7, because it’s simpler. So, I had to install Homebrew first. To do this, I copied-and-pasted the command below into the Terminal application. (Always hit the return key after putting in a command.)

/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"

To install Python 3.7, copy-and-paste the command below into Terminal.

brew install python@3.7

pkuseg also requires Numpy (whatever that is, lol). I looked up the latest version of Numpy that goes along with Python 3.7. It’s 1.21.5. To install Numpy 1.21.5, copy-and-paste the command below into Terminal.

pip3 install numpy==1.21.5

To install pkuseg, copy-and-paste the commands below into Terminal. I’m not sure both are needed, but I ran both anyway.

pip3 install pkuseg

pip3 install -U pkuseg

Running pkuseg

Below is the Python script to run pkuseg on an inputted text file. This script uses the default mixed-domain model for segmentation, which is recommended, if you don’t know what model to pick. (The other models are news, web, medicine and tourism, so none of them seem to directly fit my input.) I named the output file pkuseg.txt. nthread is the number of threads. 20 is recommended for inputted text files.

import pkuseg

#Take file 'input.txt' as input.

#The segmented result is stored in output file pkuseg.txt'.

pkuseg.test('input.txt', pkuseg.txt', nthread=20)

I put the script above in a text file named segtextfile.py. This file has to be in the same folder as input.txt.

In the Terminal application, change the folder to the folder with input.txt in it, using the command below. In the command below, replace folderpath with the actual path, e.g., /Users/mth123/Documents/Example1.

cd folderpath

In Terminal, run pkuseg using the command below.

python3 segtextfile.py

Post-Processing Output from pkuseg

The rest of the steps are the same as the original post. I’ve attached a bunch of files. The results are discussed below.

Example #1 – An Episode of a Chinese TV Drama

SWS and pkuseg produced results that are fairly similar. SWS has 947 Chinese words, and pkuseg has 896 Chinese words. The pkuseg script produced similar results as the SWS software for all words that have a frequency higher than 0.32% with the exception of 2 names. These are the top 51 most frequent words and have a cumulative frequency of 46.62%. The pkuseg script also produced similar results as the SWS software for most words that have a frequency lower than 0.32%.

There are 165 words in SWS that aren’t in pkuseg, and there are 114 words in pkuseg that aren’t in SWS. Of the 165 words in SWS that aren’t in pkuseg, 52% of the words (86 words) are within a cumulative frequency of 90% and have a total combined frequency of 5.63%. 79% of the words (130 words) are within a cumulative frequency of 95% and have a total combined frequency of 7.02%. I take this to mean that the segmenting in SWS is better than the segmenting in pkuseg by roughly 7%.

Example #2 – A Book

SWS and pkuseg produced results that are fairly similar. SWS has 9,274 Chinese words and pkuseg has 9,721 Chinese words. The pkuseg script produced similar results as the SWS software for all words that have a frequency higher than 0.16%. These are the top 95 most frequent words and have a cumulative frequency of 47.59%. The pkkuseg script also produced similar results as the SWS software for most words that have a frequency lower than 0.16%.

There are 1,760 words in SWS that aren’t in pkuseg, and there are 2,207 words in pkuseg that aren’t in SWS. Of the 1,760 words in SWS that aren’t in pkuseg, 8.13% of the words (143 words) are within a cumulative frequency of 90% and have a total combined frequency of 1.38%. 40% of the words (697 words) are within a cumulative frequency of 95% and have a total combined frequency of 2.31%. I take this to mean that the segmenting in SWS is better than the segmenting in pkuseg by roughly 2%.

Summary

The results of the two examples are noticeably different. SWS is 7% better than pkuseg for Example #1, which is an episode of a Chinese TV drama. It is 2% better for Example #2, which is a book. CTA segmented Example #1 better than pkuseg. pkuseg segmented Example #2 better than CTA.

Maybe pkuseg produces better results for news, web, medicine and tourism. I don’t consume these in Chinese, so I don’t have examples to try. This has been an interesting exercise, but I’m going to put my old Mac away again.

Example2.zip Example1.zip

May 11, 2022 at 12:14 AM

Thanks for these writeups, they're very insightful, and also the reason why after all these years I still haven't put much effort in to improving the CTA segmenter. One of CTA's design goals was to make it easy to extract frequently used unknown words from a document, and the current segmenter still performs acceptably in that regard, and would require considerable effort to improve it, for only marginal improvements in overall utility.

This makes the segmenter not a great choice if you are using CTA as a document reader, but still a good choice if you are using it for word extraction or to compare the relative difficulties of text with respect to your current vocabulary.

May 11, 2022 at 02:14 AM

On 5/10/2022 at 7:14 PM, imron said:

Thanks for these writeups, they're very insightful, and also the reason why after all these years I still haven't put much effort in to improving the CTA segmenter.

You’re very welcome! I had to think hard about how to analyze the results in a useful-and-practical way. The two days I spent reading the epic thread https://www.chinese-forums.com/forums/topic/44383-introducing-chinese-text-analyser/ came in handy in an unexpected way. It’ll probably be many years, before I try SWS again, because I’m happy with CTA as it is and I’m early in my Chinese learning/re-learning.

Sign In

Chinese Text Analyser and Stanford Word Segmenter

Recommended Posts

MTH123

Introduction

Stanford Word Segmenter Software

Chinese Text Analyser Software

Example #1: An Episode of a Chinese TV Drama

Example #2: A Book

Summary

jannesan

MTH123

imron

MTH123

Join the conversation