mungouk Posted December 5, 2018 at 10:03 AM Report Share Posted December 5, 2018 at 10:03 AM Bit of a techie question here. I'm pretty rusty with command-line grep as I've not used it in a few years now. I'm doing a bit of file processing that involves finding words in the CC-CEDICT dictionary cedict_ts.u8 which I downloaded from MDBG. I can't get grep (or egrep) to find spaces in the input, whereas a file I've created myself works fine. Has anyone come across this before? Something to do with spaces after Hanzi in Unicode that means they don't match the "\s" pattern? I created a small test file test.u8, and I can find hanzi followed by spaces in that no problem. See terminal output below. I've added a space before each shell prompt for readability. I'm using bash on OSX 10.11.6. $ cat test.u8 Weekend News Weekend 周末快乐 other stuff 周末 other stuff $ file test.u8 test.u8: UTF-8 Unicode text $ grep "^周末\s" test.u8 周末 other stuff $ grep "周末" test.u8 周末快乐 other stuff 周末 other stuff $ file cedict_ts.u8 cedict_ts.u8: UTF-8 Unicode English text, with very long lines, with CRLF line terminators $ head -10 cedict_ts.u8 # CC-CEDICT # Community maintained free Chinese-English dictionary. # # Published by MDBG # # License: # Creative Commons Attribution-Share Alike 3.0 # http://creativecommons.org/licenses/by-sa/3.0/ # # Referenced works: $ grep "周末" cedict_ts.u8 南方周末 南方周末 [Nan2 fang1 Zhou1 mo4] /Southern Weekend (newspaper)/ 週末 周末 [zhou1 mo4] /weekend/ 週末愉快 周末愉快 [zhou1 mo4 yu2 kuai4] /Have a nice weekend!/ $ grep "^周末\s" cedict_ts.u8 $ (no output) Actually this also produces no output, so maybe it's not just a space issue? $ grep "^周末" cedict_ts.u8 1 Quote Link to comment Share on other sites More sharing options...
imron Posted December 5, 2018 at 12:05 PM Report Share Posted December 5, 2018 at 12:05 PM It's not a space issue. The format of CC-CEDICT is Trad Simp [pinyin] /definition/ Your regex starts with ^ so it's searching for 周末 at the start of the line. CC-CEDICT always has the traditional form at the start of the line, so guess what the traditional version of 周末 is? (hint, it's not 周末). 1 2 Quote Link to comment Share on other sites More sharing options...
mungouk Posted December 5, 2018 at 01:36 PM Author Report Share Posted December 5, 2018 at 01:36 PM GAH!! Thanks @imron, actually I was kind of hoping it was a #PEBKAC I had been working in the terminal at rather low resolution and hadn't noticed. Also StickyStudy format (which I'm aiming for) has simplified first, traditional second, so somehow that was in my head. So this works as expected: $ grep "^.*\s周末\s" cedict_ts.u8 週末 周末 [zhou1 mo4] /weekend/ And now, back to your scheduled discussion of scholarships, workplace relationships and other ships... Cheers! Quote Link to comment Share on other sites More sharing options...
imron Posted December 5, 2018 at 02:17 PM Report Share Posted December 5, 2018 at 02:17 PM 31 minutes ago, mungouk said: So this works as expected: $ grep "^.*\s周末\s" cedict_ts.u8 My preference would be grep "^\S\+\s周末\s" cedict_ts.u8 Which should be slightly faster. 2 Quote Link to comment Share on other sites More sharing options...
Recommended Posts
Join the conversation
You can post now and select your username and password later. If you have an account, sign in now to post with your account.
Note: Your post will require moderator approval before it will be visible.