colorize hanzi and pinyin by tone (Mandarin)

Xyne · 2017-09-09 07:20:36

update (2017-09-27)
The code has been cleaned up, modularized and packaged. In addition to the command-line tool, the package now includes a server for easily copypasting input.

See the project page for up-to-date info: https://xyne.archlinux.ca/projects/hapi/
Screenshot links below may be broken or old. Here are the latest screenshots.

I discovered that some sites for learning Mandarin use different colors as a visual memory aid for learning the tones of characters. I thought the idea was interesting so I ended up writing a script to generate colorized text and charts. This is a toy at the moment but it's already useful for simple tasks. If there's enough interest then I will hurry up and turn it into an actual project and package it.

Disclaimer: For all I know, there's probably much better tools to do all of this and more already. I just felt like playing around with the idea and ended up with this. As usual, I'm sharing it in case someone else finds it amusing or useful.

What it can do so far
Colorize hanzi characters from CLI args, STDIN or a file. When outputing to HTML, it can optionally include aligned pinyin with diacritic tone marks. It also generates a table at the end of the text input with all hanzi, their matching pinyin, their tone, their Unihan definitions, and an audio element to play the sound (single hanzi only).
screenshot.

The command-line output only prints the colorized hanzi (too lazy to deal with aligning pinyin under hanzi on the command-line at the moment).

It can also generate tables for hanzi and pinyin along with the Unihan definitions (provided by the Unicode Consortium) for either hanzi or pinyin output. For example, if you have some characters and you need to know how to pronounce them, you can just pass them on the command-line. Or if you need to find a character, you can use pinyin. HTML output includes audio elements.
screenshot

At the moment the script is provided in an archive with hardcoded relative paths (the Unihan file is expected in dat/unihan and generated HTML pages expect dat/mp3). You can get it here. The README file contains instructions and scripts for downloading the Unihan and audio files, along with a list of deps for the main script and helper scripts. Run the get_all.sh and test.sh scripts to generate example files and output.

More screenshots here: https://xyne.archlinux.ca/misc/hapi/screenshots/

Don't hesitate to post questions or suggestions.

edit: BBCode output example, equivalent to ANSI console output on the command-line

hapi wrote:

$ ./hapy.py --target bbcode --color mdbg --pinyin --file test/input.txt
汉语拼音，在中国大陆常簡稱为拼音，是一種以拉丁字母作普通话（現代標準漢語）標音的方案。是目前的中文羅馬拼音國際標準規範。汉语拼音在中国大陆作为基础教育内容全面使用，是义务教育的重要内容。在海外，特别是常用現代標準漢語的地区如新加坡、马来西亚、菲律宾和美国唐人街等，目前也在汉语教育中进行汉语拼音教学。臺灣自2008年開始，中文譯音使用原則也採用漢語拼音[1]，但舊護照姓名和部分地名仍採用舊式威妥瑪拼音[2]
HànYǔPīnYīn ， ZàiZhōngGúoDàLùChángJiǎnChēngWèiPīnYīn ， ShìYīZhǒngYǐLāDīngZìMǔZùoPǔTōngHuà （ XiànDàiBiāoZhǔnHànYǔ ） BiāoYīnDeFāngÀn 。 ShìMùQiánDeZhōngWénLúoMǎPīnYīnGúoJìBiāoZhǔnGūiFàn 。 HànYǔPīnYīnZàiZhōngGúoDàLùZùoWèiJīChǔJiàoYùNèiRóngQuánMiànShǐYòng ， ShìYìWùJiàoYùDeZhòngYàoNèiRóng 。 ZàiHǎiWài ， TèBiéShìChángYòngXiànDàiBiāoZhǔnHànYǔDeDeQūRúXīnJiāPō 、 MǎLáiXīYà 、 FēiLǜBīnHéMěiGúoTángRénJiēDěng ， MùQiánYěZàiHànYǔJiàoYùZhōngJìnXíngHànYǔPīnYīnJiàoXué 。 TáiWānZì 2008 NiánKāiShǐ ， ZhōngWénYìYīnShǐYòngYuánZéYěCǎiYòngHànYǔPīnYīn [1]， DànJìuHùZhàoXìngMíngHéBùFēnDeMíngRéngCǎiYòngJìuShìWēiTǔoMǎPīnYīn [2]
$ ./hapy.py --target bbcode --color mdbg --chart h2p 教常簡稱
常 cháng 2 common, normal, frequent, regular
教 jiào 4 teach, class
稱 chēng 1 call; name, brand; address; say
簡 jiǎn 3 simple, terse, succinct; letter
$ ./hapy.py --target bbcode --color dummit --chart p2h deng3 cháng qian
㙊 cháng 2 (same as 場) an area of level ground; an open space, a threshing floor, arena for drill, etc. a place, to pile a sand-hills
㦂 cháng 2 (ancient form of 常) constantly, frequently, usually habitually, regular, common, a rule, a principle
䗅 cháng 2 millipede, (of a road) winding; zigzag
䠆 cháng 2 to kowtow; to kneel and make obeisance
䯴 cháng 2 a coiffure with a topknot
偿 cháng 2 repay, recompense;
...
䒭 děng 3 (same as 等) rank; grade, same; equal, to wait, and so on; etc.
戥 děng 3 a small steelyard for weighing money, etc.
朩 děng 3 kwukyel: rank, grade; wait; equal; 'etc.'
等 děng 3 rank, grade; wait; equal; 'etc.'
㗔 qiān 1 (a dialect) joy; happiness
㩃 qiān 1 to take or capture (a city, etc.), to gather or to collect
㪠 qiān 1 (same as 鵮鵮) to peck, poverty; poor, things of the same value, to take; to fetch; to obtain, to select; to choose, (of a bird) to peck
䀒 qiān 1 gloomy; dark; obscure
䇂 qiān 1 (ancient form) fault; sin
...
㦮 qián 2 (abbreviated form of 錢) money; cash, a unit of weight, a Chinese family name
㨜 qián 2 to help each other, to shoulder; to take upon oneself
㩮 qián 2 to lift up or off; to raise high, to unveil
...
㦿 qiǎn 3 a window, a small door, (ancient form 戶) a door, a household
㧄 qiǎn 3 to take; to receive; to fetch; to take hold of
㹂 qiǎn 3 untamed and indocility cattle, huge; big; large
䇜 qiǎn 3 small bamboo, a kind of bamboo
䭤 qiǎn 3 to chew; to eat, to roll round with the hand, cakes; biscuits
...
㐸 qiàn 4 (non-classical form of 欠) to owe money, deficient, to yawn, last name
㜞 qiàn 4 beautiful; pretty, used in girl's name
㟻 qiàn 4 (same as 塹) the moat around a city, a pit; a hole or cavity in the ground (same as 嶄) (of a mountain) high and pointed, novel; new
㯠 qiàn 4 a cross-beam; an axle, etc.
䈴 qiàn 4 a cage; a basket; a noose...
籖 qian 5 tally; lot; marker

Last edited by Xyne (2017-09-27 05:44:33)

Spyhawk · 2017-09-09 12:07:25

This looks great!

You might have come across it already, but this reminds me of the cjklib project. It's in python 2 and sadly unmaintained (doesn't work with recent SQL alchemy), but this might be useful for some of your planned roadmap.
And in case you become interested in characters decomposition in the future, this might very well be worth a look.

Xyne · 2017-09-09 14:50:47

Thanks! The cjklib has given me some ideas. So far I had only looked at Unihan_Readings.txt but now I will probably add information about radicals and strokes (and maybe expand support to other readings). I'lll post an update when I get around to it (probably soon as this is my current go-to distraction when I'm on the computer ).

Xyne · 2017-09-12 06:23:50

Made some changes:

Use HTML ruby and rt tags for aligning Pinyin with Hanzi. This places the Pinyin above the Hanzi. I would prefer to keep it below but I couldn't find a way to do it with CSS. The advantage of using the ruby tags is that text can be copied without newline elements from the divs that I was using to align things before.
Buttons to toggle visibility of the Hanzi and Pinyin in text output. This makes it easy to copy a sequence of Hanzi or Pinyin separately. N.B. The rt elements can be selected but not copied in Firefox. Use Chromium or another browser if you need to copy the Pinyin. The advantage of Firefox is that you don't need to toggle the Pinyin to copy just the Hanzi.
All Hanzi in full-page HTML output now link to the matching page on yellowbridge, which provides English translations of the character and common pairs that include it. The page also shows provides an animation of the stroke order.
Console output now supports different targets (ansi, html, bbcode) for easy insertion of the output (bonus of using the ColorizedText module ).
Added bbcode test output to OP.
Fixed handling of of ü and v which broke some audio elements.
Added a --reset option to reset the database.
Added a full page screenshot of the test page for Pinyin-to-Hanzi ouput with phantomjs (which is why it says that the browser does not support audio elements).

NoSuck · 2017-09-19 16:57:01

Thank you. This is a great tool. I'm glad I stumbled upon it before the resources downloaded by the automatic download script are taken from the web (as has been known to happen). The download script worked like a charm.

If there's enough interest then I will hurry up and turn it into an actual project and package it.

I'd wager that every person in the world who uses both Chinese and the *nix command-line would be interested in this project. Personally, I use sites like Purple Culture and Chinese Converter to obtain zhuyin fuhao (a.k.a. bopomofo) conversions of hanzi. Have you considered adding such functionality? It's basically pinyin with its own set of hanzi-inspired symbols.

Questions and Observations:

Is it "hapi" or "hapy"?
test.sh should probably look in ../dat instead of dat (or be moved elsewhere).
The historical naming scheme of Python packages (python3-colorsysplus) makes more sense than the current, official scheme--which would drop the "3". Explicit is better than implicit. Years from now, when Python 4 is released, perhaps we will realize this.</sidetrack>
Finally, I get no output and non-colorized output for a couple of test commands.

■ ./hapi.py --color mdbg --chart h2p 教常簡稱
INFO:root:database path: /home/grady/.local/share/hapi/hapi.sqlite3
■ ./hapi.py --color mdbg --pinyin --file test/input.txt
INFO:root:database path: /home/grady/.local/share/hapi/hapi.sqlite3
汉语拼音，在中国大陆常簡稱为拼音，是一種以拉丁字母作普通话（現代標準漢語）標音的方案。是目前的中文羅馬拼音國際標準規範。汉语拼音在中国大陆作为基础教育内容全面使用，是义务教育的重要内容。在海外，特别是常用現代標準漢語的地区如新加坡、马来西亚、菲律宾和美国唐人街等，目前也在汉语教育中进行汉语拼音教学。臺灣自2008年開始，中文譯音使用原則也採用漢語拼音[1]，但舊護照姓名和部分地名仍採用舊式威妥瑪拼音[2]


汉语拼音，在中国大陆常簡稱为拼音，是一種以拉丁字母作普通话（現代標準漢語）標音的方案。是目前的中文羅馬拼音國際標準規範。汉语拼音在中国大陆作为基础教育内容全面使用，是义务教育的重要内容。在海外，特别是常用現代標準漢語的地区如新加坡、马来西亚、菲律宾和美国唐人街等，目前也在汉语教育中进行汉语拼音教学。臺灣自2008年開始，中文譯音使用原則也採用漢語拼音[1]，但舊護照姓名和部分地名仍採用舊式威妥瑪拼音[2]

Last edited by NoSuck (2017-09-19 16:58:13)

Spyhawk · 2017-09-19 17:25:24

NoSuck wrote:

Is it "hapi" or "hapy"?

Yeah, forgot to report this: there is an inconsistent echo that uses "hapy" instead of "hapi" in the download script, but that is very minor.
Also, I noticed the building process of the sqlite db file is quite slow (like 30-40min or more). I'm not sure why, but once it's built the script works great.

Last edited by Spyhawk (2017-09-19 17:26:30)

Xyne · 2017-09-23 23:50:09

NoSuck wrote:

Have you considered adding such functionality? It's basically pinyin with its own set of hanzi-inspired symbols.

I hadn't considered it yet because I was only vaguely aware of it. I took a look and I am willing to implement it, but it seems that the conversion is not one-to-one. I will need to write special conversion functions that handle different cases and that will require time and testing to get it right (but not too much). I'll add it to the TODO list.

NoSuck wrote:

Is it "hapi" or "hapy"?

"hapi" (hanzi+pinyin). I have updated the typos in the README and test script.

NoSuck wrote:

test.sh should probably look in ../dat instead of dat (or be moved elsewhere).

Doesn't it? The only reference to the dat dir is a symlink created in the test dir (pointing to ../dat), and all paths are created relative to the parent directory of the script directory. Maybe I'm just stupidly tired right now, but I don't see where the path mismatch would be. The get_all.sh script should ensure that the dat directory is created where expected.

Note that all of the hard-coded paths will be fixed to use appropriate environment variables and command-line parameters.

NoSuck wrote:

The historical naming scheme of Python packages (python3-colorsysplus) makes more sense than the current, official scheme--which would drop the "3". Explicit is better than implicit. Years from now, when Python 4 is released, perhaps we will realize this.</sidetrack>

That's exactly my reasoning and I've argued the point several times. Technical correctness and future-proofing the dependency hierarchy should be design goals, even if it costs an extra character in a few hundred package names.

NoSuck wrote:

Finally, I get no output and non-colorized output for a couple of test commands.

I can't reproduce this. The duplication of hanzi output instead of the hanzi followed by pinyin for the second example makes me suspect that it's due to changes in the database. I have added code to check the existing tables and report an error when they mismatch the expected layout. Update to the latest version and check if it reports a mismatch, then reset the database regardless of what it says and try again.

Spyhawk wrote:

Also, I noticed the building process of the sqlite db file is quite slow (like 30-40min or more). I'm not sure why, but once it's built the script works great.

The database building function is not optimal but it should only take a few seconds, not 30-40 minutes:

time ./hapi.py --reset
INFO:root:database path: ~/.local/share/hapi/hapi.sqlite3
INFO:root:inserted 47202 rows (228753 values) into Readings
INFO:root:inserted 9980 rows (12355 values) into Variants


real	0m6.252s
user	0m4.240s
sys	0m1.960s

Check the Unihan Readings and Variants files. Readings should be 5.5 MB (187564 lines) and Variants 437 kB (12376 lines). hapi.sqlite3 should be 4.2 MB with the same number of rows inserted as above.

If all of those seem to be correct, I will add debugging messages to try to identify the problem.

edit
I have added preliminary support for zhuyin (screenshot). Adding support for zhuyin input (similar to the current p2h functionality along with zhuyin->pinyin and pinyin->zhuyin) is on the TODO list.

I realized that I didn't need to do character conversions following the reduced tables that I had initially found (such as this one) and that I could just dump the full conversion table into the database to ensure correctness and take advantage of more advanced queries.

I'm still curious about an algorithm to interconvert zhuyin to pinyin based on initials, medials and finals. If anyone happens to find one, please post a link here.

A full pinyin-IPA chart would be useful too (Wikipedia's IPA/Mandarin page is nice but the pronunciations change in different words).

Last edited by Xyne (2017-09-24 03:58:59)

Spyhawk · 2017-09-24 15:11:42

Xyne wrote:

The database building function is not optimal but it should only take a few seconds, not 30-40 minutes

time ./hapi.py --reset
INFO:root:database path: ~/.local/share/hapi/hapi.sqlite3
INFO:root:inserted 47202 rows (228753 values) into Readings
INFO:root:inserted 9980 rows (12355 values) into Variants

real	0m6.252s
user	0m4.240s
sys	0m1.960s

Yeah, I figured something isn't quite right. The Unihan source files are correct in size and number of lines.
A new, clean install (removed the existing .local/share/hapi folder) gives the following:

$ ./scripts/test.sh
appending to /home/remy/documents/hapi/test/ansi.txt
INFO:root:database path: /home/remy/.local/share/hapi/hapi.sqlite3
ERROR:root:mismatch in table "phonetics":
  expected fields:
    Pinyin TEXT PRIMARY KEY
    WadeGiles TEXT
    Yale TEXT
    Zhuyin TEXT UNIQUE
  detected fields:
    
You may need to reset the database (--reset).
ERROR:root:mismatch in table "Readings":
  expected fields:
    char TEXT PRIMARY KEY
    kCantonese TEXT
    kDefinition TEXT
    kHangul TEXT
    kHanyuPinlu TEXT
    kHanyuPinyin TEXT
    kJapaneseKun TEXT
    kJapaneseOn TEXT
    kKorean TEXT
    kMandarin TEXT
    kMandarinTW TEXT
    kMandarinTW_tone INTEGER
    kMandarin_tone INTEGER
    kTang TEXT
    kVietnamese TEXT
    kXHC1983 TEXT
  detected fields:
    
You may need to reset the database (--reset).
ERROR:root:mismatch in table "Variants":
  expected fields:
    char TEXT PRIMARY KEY
    kSemanticVariant TEXT
    kSimplifiedVariant TEXT
    kSpecializedSemanticVariant TEXT
    kTraditionalVariant TEXT
    kZVariant TEXT
  detected fields:
    
You may need to reset the database (--reset).
^Cappending to /home/remy/documents/hapi/test/bbcode.txt
INFO:root:database path: /home/remy/.local/share/hapi/hapi.sqlite3
ERROR:root:mismatch in table "Readings":
  expected fields:
    char TEXT PRIMARY KEY
    kCantonese TEXT
    kDefinition TEXT
    kHangul TEXT
    kHanyuPinlu TEXT
    kHanyuPinyin TEXT
    kJapaneseKun TEXT
    kJapaneseOn TEXT
    kKorean TEXT
    kMandarin TEXT
    kMandarinTW TEXT
    kMandarinTW_tone INTEGER
    kMandarin_tone INTEGER
    kTang TEXT
    kVietnamese TEXT
    kXHC1983 TEXT
  detected fields:
    
You may need to reset the database (--reset).
ERROR:root:mismatch in table "Variants":
  expected fields:
    char TEXT PRIMARY KEY
    kSemanticVariant TEXT
    kSimplifiedVariant TEXT
    kSpecializedSemanticVariant TEXT
    kTraditionalVariant TEXT
    kZVariant TEXT
  detected fields:

You may need to reset the database (--reset).
Traceback (most recent call last):
  File "/home/remy/documents/hapi/hapi.py", line 1391, in <module>
    main()
  File "/home/remy/documents/hapi/hapi.py", line 1375, in main
    for hz, *rest in hapidb.lookup_hanzi(hzs, with_variants=True)
  File "/home/remy/documents/hapi/hapi.py", line 1374, in <genexpr>
    (hz, rest)
  File "/home/remy/documents/hapi/hapi.py", line 1053, in lookup_hanzi
    yield from self.lookup_hanzi_with_variants_wrapper(where=where, order_by=order_by)
  File "/home/remy/documents/hapi/hapi.py", line 1037, in lookup_hanzi_with_variants_wrapper
    yield from self.lookup(select, where=where, order_by=order_by)
  File "/home/remy/documents/hapi/hapi.py", line 1021, in lookup
    c.execute(select, where[1])
sqlite3.OperationalError: no such table: Readings

I don't remember having seen that error the previous time. Looks like the new db file is a bit bigger than the previous version, so the generation time is even more ridiculously slow.

$ time ./hapi.py --reset
INFO:root:database path: /home/remy/.local/share/hapi/hapi.sqlite3
INFO:root:inserted 406 rows into phonetics
INFO:root:inserted 47202 rows (311187 values) into Readings
INFO:root:inserted 9980 rows (12355 values) into Variants


real    81m17.497s
user    0m22.144s
sys     0m48.972s

The hapi.sqlite3 file is 4.3 mb as expected. Note: the rows number are similar to yours, but the Readings values count is different (311187 vs 228753).

To me, it looks like sqlite inserts rows one at a time instead of using a bulk insert. Sqlite is known to be quite slow with independant insert operation (max a dozen of insert per sec or so, which is consistent with the time observed here). I have no idea why this happens on my machine only though. I'm using a regular HDD, maybe this doesn't happen when using a SSD?

Xyne · 2017-09-24 17:54:16

I didn't notice anything because my database was being created on a ramdisk. When I switched to an HDD path I ended up with the same slowdown. After some quick reading I discovered that sqlite3 treats each insertion as a transaction (even when using executemany, which the script now does). The "trick" was to insert the entire table a single transaction. It should only take a few seconds now.

Spyhawk · 2017-09-25 08:39:43

Indeed, the script needs only 4 seconds now. What an improvement!

Also, you'll need to adjust the man page:

./hapi.py --color mdbg --file test/input.txt --html --pinyin > test/text.htm (old)
./hapi.py --color mdbg --file test/input.txt --html -p pinyin > test/text.htm (new)

Last edited by Spyhawk (2017-09-25 09:07:36)

Xyne · 2017-09-27 05:43:00

Major code cleanup and modularization. hapi is now a full Python library and a Pacman package (available in my repo and the AUR).

Changes include:

The command-line tool ("hapi") has some new options for selecting phonetics, displaying tones, etc. Check the help message for new usage.
The "p2h" table mode now accepts words in any recognized phonetic system so you can mix and match Pinyin, Zhuyin, Wade-Giles and Yale.
Data paths have been cleaned up and it now uses XDG data directories or an optional custom environment variable (but static HTML output still expects all audio files to be in "dat/audio" relative to the file itself), but....
There is now also a server ("hapi-srv") so you can skip file generation and just copy+paste text directly in the browser. The server provides a form with multiple options. I'll add some CSS to make it look nice later (suggestions welcome).

If you want to avoid redownloading existing Unihan and audio files, create the directory $XDG_DATA_HOME/hapi/dat (defaults to ~/.local/share/hapi/dat) and move the Unihan.zip file and unihan and yabla_mp3 directories there, then "ln -s yabla_mp3 audio" in the same directory.

Otherwise, run "hapi-download_data -uy".

Once the data is in place, run "hapi-srv", open up http://localhost:8000/ and copypaste away.

Arch Linux

#1 2017-09-09 07:20:36

colorize hanzi and pinyin by tone (Mandarin)

#2 2017-09-09 12:07:25

Re: colorize hanzi and pinyin by tone (Mandarin)

#3 2017-09-09 14:50:47

Re: colorize hanzi and pinyin by tone (Mandarin)

#4 2017-09-12 06:23:50

Re: colorize hanzi and pinyin by tone (Mandarin)

#5 2017-09-19 16:57:01

Re: colorize hanzi and pinyin by tone (Mandarin)

#6 2017-09-19 17:25:24

Re: colorize hanzi and pinyin by tone (Mandarin)

#7 2017-09-23 23:50:09

Re: colorize hanzi and pinyin by tone (Mandarin)

#8 2017-09-24 15:11:42

Re: colorize hanzi and pinyin by tone (Mandarin)

#9 2017-09-24 17:54:16

Re: colorize hanzi and pinyin by tone (Mandarin)

#10 2017-09-25 08:39:43

Re: colorize hanzi and pinyin by tone (Mandarin)

#11 2017-09-27 05:43:00

Re: colorize hanzi and pinyin by tone (Mandarin)

Board footer