Add mappings for CJK Extension E/F/G #10

JLHwung · 2021-03-16T02:38:15Z

Fixes #5

This PR adds mappings for these characters. The mappings are copied from https://github.com/Jackchows/Cangjie5 with additional fixes in Jackchows/Cangjie5#209. All mappings are with the following exceptions.

mapping starting with z for CJK compatibility characters
mapping starting with x

The first commit fixes ordering issues in current mappings.
The second commit added new mappings and reordered them consistently with others.

When authoring this PR, I came up with two scripts, feel free to re-use it as Ext. H will be hopefully targeted to 2022. (link of scripts)

sort-dict.mjs sorts the given dict according to the order Block > Cangjie code
merge-cangjie5-txt.mjs merges the mappings defined in Cangjie5.txt of https://github.com/Jackchows/Cangjie5.

Credits to https://github.com/Jackchows/Cangjie5

JLHwung · 2021-03-16T02:43:14Z

cangjie5.dict.yaml

@@ -79372,3 +79371,4 @@ encoder:
 𫜉	ywwlb
 𫐼	yybs
 𫑉	yysb
+𬌗	mhomr


𬌗 is an Ext. E character and thus should be placed after all other characters.

Note that it is OK to move 𬌗 here because mhomr is unique, so it has no observable behaivor changes. In the next commit it is sorted with other Ext. E characters.

LEOYoon-Tsaw · 2021-03-16T20:10:09Z

Objection.
I don't think reordering the characters according to Unicode index is necessary.
Ordering according to Cangjie code makes it easy to see the relative order of same-code characters. Your reordering will break those orders and mess up with users, unless you add explicit character frequency for almost all characters.

JLHwung · 2021-03-16T21:00:45Z

Objection.
I don't think reordering the characters according to Unicode index is necessary.
Ordering according to Cangjie code makes it easy to see the relative order of same-code characters. Your reordering will break those orders and mess up with users, unless you add explicit character frequency for almost all characters.

Can you check out to a local repo and see the diff? Since GitHub does not support showing such a big diff. The current order of code mappings in this PR is consistent to what we have. I didn't reorder the mappings according to Unicode code points.

I have mentioned that

sort-dict.mjs sorts the given dict according to the order Block > Cangjie code

It means to sort the mapping by the CJK Extension Block ( CJK Unified Ideographs > CJK Unified Ideographs Ext. A > Ext. B > ... Ext. G). When we see two mappings within the same block, we then sort it by Cangjie code.

For example, assume we have the following unordered mappings

日 a
𣈁 aad
卜 y
𨓃 yysf

The sort-dict.mjs will sort to

日 a
卜 y
𣈁 aad
𨓃 yysf

Because 𣈁 and 𨓃 is ext. B characters. This is consistent to what we have in cangjie.dict.yaml:

mapping	line
日 a	44
卜 y	20427
𣈁 aad	30866
𨓃 yysf	74984

in which we have already sorted each mappings by the following order

(CJK Unified Ideographs a)
...
(CJK Unified Ideographs y)
(CJK Unified Ideographs Ext. A a)
...
(CJK Unified Ideographs Ext. A y)
(CJK Unified Ideographs Ext. B a)
...
(CJK Unified Ideographs Ext. B y)
(CJK Unified Ideographs Ext. C & D a)
...
(CJK Unified Ideographs Ext. C & D y)

In this PR we have added mappings for CJK Ext. E to G, so the new mappings are sorted by

(CJK Unified Ideographs a)
...
(CJK Unified Ideographs y)
(CJK Unified Ideographs Ext. A a)
...
(CJK Unified Ideographs Ext. A y)
(CJK Unified Ideographs Ext. B a)
...
(CJK Unified Ideographs Ext. B y)
(CJK Unified Ideographs Ext. C & D a)
...
(CJK Unified Ideographs Ext. C & D y)
(CJK Unified Ideographs Ext. E a)
...
(CJK Unified Ideographs Ext. E y)
(CJK Unified Ideographs Ext. F a)
...
(CJK Unified Ideographs Ext. F y)
(CJK Unified Ideographs Ext. G a)
...
(CJK Unified Ideographs Ext. G y)

Note the the new mappings are always appended to current mappings, I don't see how such handling will break current users.

Note that when we have characters sharing the same Cangjie code, the sorter does NOT reorder such characters because V8 sort is a stable. In the new mappings in this PR, the order of shared Cangjie code is ordered by original ordering in the upstream repo.

Consider reopen this PR since the claim that this PR sorts mapping by Unicode index is invalid.

LEOYoon-Tsaw · 2021-03-16T23:24:21Z

Sorting by block is what I mean by sorting by Unicode. A Unicode Ext G character is not necessarily more rare than a Unicode Ext B character.

JLHwung · 2021-03-16T23:44:45Z

I agree, but we have already sorted Ext. B C D, I can reorder mapping if we can reach consensus. But I believe that should be addressed in another PRs, as this PR focuses on adding new maps.

LEOYoon-Tsaw · 2021-03-17T00:13:26Z

Then only add new characters, don’t reorder. We can reopen it once you revert reordering.

JLHwung · 2021-03-17T00:26:22Z

This PR mostly adds new characters. Actually only 3 characters are reordered because they are not ordered by cangjie code, as you can review in the first commit. It's an editorial fix IMO.

JLHwung · 2021-03-17T00:43:54Z

cangjie5.dict.yaml

@@ -27368,7 +27369,6 @@ encoder:
 䩁	lyhuu
 䘻	lyib
 䘪	lyiu
-䫍	lmsyc


䫍 lmsyc should not be placed here according to the Cangjie code order, it is moved from line 27371 to line 27300. No observable changes because lmsyc is unique.

JLHwung · 2021-03-17T00:46:14Z

cangjie5.dict.yaml

 䛁	yrbmm
+䯩	yrbu


䯩 yrbu should be placed after 䛁 yrbmm according to Cangjie code order. No observable changes because there is only one yrbu in Ext. A.

LEOYoon-Tsaw · 2021-03-17T02:21:15Z

Why are there so many codes starting with z?

LEOYoon-Tsaw · 2021-03-17T02:54:16Z

The encoding scheme for new characters isn't compatible with existing codes.

JLHwung · 2021-03-17T04:14:51Z

Why are there so many codes starting with z?

Good catch! These characters are in the block CJK COMPATIBILITY IDEOGRAPHm which should not be included. I have refined the merge script and these characters are now removed. I have also removed ordering across Ext. E/F/G so they are now mixed in a new chunk with the order defined in the upstream repo.

In the newly added mappings, these 6 characters may need a new mapping as the author uses Z for rotation operator (See section 14 of https://github.com/Jackchows/Cangjie5/blob/master/change_summary.md), which is not defined in Cangjie 5 anyway:

𮗙	buhuz
𰒥	izi
𫸪	nnz
𰨇	ozmmf
𰲞	yniz
𬢆	yzbuu

@LEOYoon-Tsaw Can you come up with alternative mappings, this will be very helpful.

I will come up with a new summary script to show how many characters we have added in this PR, so we can confirm it covers all the new extensions.

Note that as a member of project, you can click Request Change in the dropdown button of Review Changes if you want to indicate clearly that this PR is still pending changes. You can also convert this PR to draft since we have more to do before it can be merged.

LEOYoon-Tsaw · 2021-03-17T12:59:30Z

Z for rotation is adopted for Cangjie6, so I really doubt the quality of these encodings as for Cangjie5. It seems it’s just a simple copy paste from cangjie6 without enough inspection
How confident are you about the quality of the newly added encodings?

LEOYoon-Tsaw · 2021-03-17T13:38:03Z

And while some use z for rotated shapes, but some do not. E.g.: 𠄏, 𠄔, 𣀨 does not use z.
I have a NN for automatic encoding based on bitmap of characters. If someone train it on stable encodings, and test on new encodings, it produce a confidence level for each character's encoding. Then based on them, it's easy to pick out those potentially wrong encodings by sorting the confidence number from low to high.

JLHwung · 2021-03-17T13:57:33Z

@LEOYoon-Tsaw I use it quite a lot when I am working on replacing the unrecognied □ in 四庫全書: https://zh.wikisource.org/wiki/Special:%E7%94%A8%E6%88%B7%E8%B4%A1%E7%8C%AE/Jlhwung Based on all the Ext. E - G characters I have successfully summoned, the mapping is good for general use.

It seems it’s just a simple copy paste from cangjie6 without enough inspection

I don't know Cangjie6 so I can't determine whether it is derived from https://github.com/LEOYoon-Tsaw/Cangjie6 or it is authored independently. My personal experience is good but I may be so lucky that I hit only those characters which shares mapping between Cangjie5 and Cangjie6. Maybe @Jackchows can shred some light here

Z for rotation is adopted for Cangjie6

The author gives a complete list of Z operator usage in section 14: https://github.com/Jackchows/Cangjie5/blob/master/change_summary.md, he/she is well aware of that z is used here deliberately because it is easier to encode.

And while some use z for rotated shapes, but some do not. E.g.: 𠄏, 𠄔, 𣀨 does not use z

They are not Ext. E - Ext. G characters, which is not covered in this PR. And that's also why I call for mappings for 𮗙𰒥𫸪𰨇𰲞𬢆 so we can be consistent to what we already have on 𠄏𠄔𣀨. We can discuss whether we should use z on specific characters as a compatible mapping, but this PR the newly added characters should be consistent to current mappings.

I have a NN for automatic encoding based on bitmap of characters.

That's great work! I am working on learning the CJCode from IDS, which is maintained by Andrew West: https://www.babelstone.co.uk/CJK/IDS.TXT. The quality of IDS is quite good since nowadays IRG requires that any new submitted character must have an IDS. I think it can supplement contextual shape information for the NN.

Jackchows/Cangjie5#209 is a summary of incorrect mappings derived from IDS. It is still an early work but it can already contribute to an errata. I will open a similar PR to this project after this one gets merged.

LEOYoon-Tsaw · 2021-03-17T14:31:15Z

@lotem Any ideas? I’ve forgotten many cangjie5 rules, perhaps you want to have a look. And also I’m ok with using z, but I don’t know if cangjie5 users are generally ok with it

JLHwung · 2021-03-17T14:47:55Z

I will come up with a new summary script to show how many characters we have added in this PR, so we can confirm it covers all the new extensions.

I checked current PR and now it covers all the Ext. E - G characters now. However, I also find that we have uncovered characters in Ext. A, Ext. B and URO+. I will open a new issue.

Update: see JLHwung@3ef8d73

@LEOYoon-Tsaw Can you reopen the PR? You can convert this PR to draft or leave a Request Change if it is not meant to get merged soon. But a closed PR will not be get update from the source branch.

LEOYoon-Tsaw · 2021-03-19T14:59:27Z

I can't, the reopen button is greyed out, and I don't know why

JLHwung · 2021-03-19T15:43:10Z

Thanks. I will open a new PR later.

JLHwung added 2 commits March 15, 2021 21:58

fix mapping order

d197805

Add Ext. E, F and G

ffaa58c

Credits to https://github.com/Jackchows/Cangjie5

JLHwung commented Mar 16, 2021

View reviewed changes

LEOYoon-Tsaw closed this Mar 16, 2021

JLHwung commented Mar 17, 2021

View reviewed changes

LEOYoon-Tsaw reopened this Mar 17, 2021

LEOYoon-Tsaw closed this Mar 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mappings for CJK Extension E/F/G #10

Add mappings for CJK Extension E/F/G #10

JLHwung commented Mar 16, 2021 •

edited

Loading

JLHwung Mar 16, 2021

JLHwung Mar 17, 2021 •

edited

Loading

LEOYoon-Tsaw commented Mar 16, 2021

JLHwung commented Mar 16, 2021

LEOYoon-Tsaw commented Mar 16, 2021

JLHwung commented Mar 16, 2021

LEOYoon-Tsaw commented Mar 17, 2021

JLHwung commented Mar 17, 2021

JLHwung Mar 17, 2021

JLHwung Mar 17, 2021

LEOYoon-Tsaw commented Mar 17, 2021

LEOYoon-Tsaw commented Mar 17, 2021

JLHwung commented Mar 17, 2021 •

edited

Loading

LEOYoon-Tsaw commented Mar 17, 2021 •

edited

Loading

LEOYoon-Tsaw commented Mar 17, 2021

JLHwung commented Mar 17, 2021 •

edited

Loading

LEOYoon-Tsaw commented Mar 17, 2021 •

edited

Loading

JLHwung commented Mar 17, 2021 •

edited

Loading

LEOYoon-Tsaw commented Mar 19, 2021

JLHwung commented Mar 19, 2021

Add mappings for CJK Extension E/F/G #10

Add mappings for CJK Extension E/F/G #10

Conversation

JLHwung commented Mar 16, 2021 • edited Loading

JLHwung Mar 16, 2021

Choose a reason for hiding this comment

JLHwung Mar 17, 2021 • edited Loading

Choose a reason for hiding this comment

LEOYoon-Tsaw commented Mar 16, 2021

JLHwung commented Mar 16, 2021

LEOYoon-Tsaw commented Mar 16, 2021

JLHwung commented Mar 16, 2021

LEOYoon-Tsaw commented Mar 17, 2021

JLHwung commented Mar 17, 2021

JLHwung Mar 17, 2021

Choose a reason for hiding this comment

JLHwung Mar 17, 2021

Choose a reason for hiding this comment

LEOYoon-Tsaw commented Mar 17, 2021

LEOYoon-Tsaw commented Mar 17, 2021

JLHwung commented Mar 17, 2021 • edited Loading

LEOYoon-Tsaw commented Mar 17, 2021 • edited Loading

LEOYoon-Tsaw commented Mar 17, 2021

JLHwung commented Mar 17, 2021 • edited Loading

LEOYoon-Tsaw commented Mar 17, 2021 • edited Loading

JLHwung commented Mar 17, 2021 • edited Loading

LEOYoon-Tsaw commented Mar 19, 2021

JLHwung commented Mar 19, 2021

JLHwung commented Mar 16, 2021 •

edited

Loading

JLHwung Mar 17, 2021 •

edited

Loading

JLHwung commented Mar 17, 2021 •

edited

Loading

LEOYoon-Tsaw commented Mar 17, 2021 •

edited

Loading

JLHwung commented Mar 17, 2021 •

edited

Loading

LEOYoon-Tsaw commented Mar 17, 2021 •

edited

Loading

JLHwung commented Mar 17, 2021 •

edited

Loading