grep with CC-CEDICT? (UTF8)

December 5, 2018 at 10:03 AM

Bit of a techie question here. I'm pretty rusty with command-line grep as I've not used it in a few years now.

I'm doing a bit of file processing that involves finding words in the CC-CEDICT dictionary cedict_ts.u8 which I downloaded from MDBG.

I can't get grep (or egrep) to find spaces in the input, whereas a file I've created myself works fine.

Has anyone come across this before? Something to do with spaces after Hanzi in Unicode that means they don't match the "\s" pattern?

I created a small test file test.u8, and I can find hanzi followed by spaces in that no problem.

See terminal output below. I've added a space before each shell prompt for readability. I'm using bash on OSX 10.11.6.

$ cat test.u8
Weekend News
Weekend
周末快乐 other stuff
周末 other stuff

$ file test.u8
test.u8: UTF-8 Unicode text

$ grep "^周末\s" test.u8
周末 other stuff

$ grep "周末" test.u8
周末快乐 other stuff
周末 other stuff

$ file cedict_ts.u8
cedict_ts.u8: UTF-8 Unicode English text, with very long lines, with CRLF line terminators

$ head -10 cedict_ts.u8
# CC-CEDICT
# Community maintained free Chinese-English dictionary.
#
# Published by MDBG
#
# License:
# Creative Commons Attribution-Share Alike 3.0
# http://creativecommons.org/licenses/by-sa/3.0/
#
# Referenced works:

$ grep "周末" cedict_ts.u8
南方周末南方周末 [Nan2 fang1 Zhou1 mo4] /Southern Weekend (newspaper)/
週末周末 [zhou1 mo4] /weekend/
週末愉快周末愉快 [zhou1 mo4 yu2 kuai4] /Have a nice weekend!/

$ grep "^周末\s" cedict_ts.u8
$ (no output)

Actually this also produces no output, so maybe it's not just a space issue?

$ grep "^周末" cedict_ts.u8

December 5, 2018 at 12:05 PM

It's not a space issue. The format of CC-CEDICT is

Trad Simp [pinyin] /definition/

Your regex starts with ^ so it's searching for 周末 at the start of the line. CC-CEDICT always has the traditional form at the start of the line, so guess what the traditional version of 周末 is? (hint, it's not 周末).

December 5, 2018 at 01:36 PM

GAH!! Thanks @imron, actually I was kind of hoping it was a #PEBKAC

I had been working in the terminal at rather low resolution and hadn't noticed. Also StickyStudy format (which I'm aiming for) has simplified first, traditional second, so somehow that was in my head.

So this works as expected:

$ grep "^.*\s周末\s" cedict_ts.u8

週末周末 [zhou1 mo4] /weekend/

And now, back to your scheduled discussion of scholarships, workplace relationships and other ships...

Cheers!

December 5, 2018 at 02:17 PM

31 minutes ago, mungouk said:

So this works as expected:

$ grep "^.*\s周末\s" cedict_ts.u8

My preference would be

grep "^\S\+\s周末\s" cedict_ts.u8

Which should be slightly faster.

Sign In

grep with CC-CEDICT? (UTF8)

Recommended Posts

mungouk

imron

mungouk

imron

Join the conversation