-
Notifications
You must be signed in to change notification settings - Fork 16
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add mappings for CJK Extension E/F/G #10
Conversation
cangjie5.dict.yaml
Outdated
@@ -79372,3 +79371,4 @@ encoder: | |||
𫜉 ywwlb | |||
𫐼 yybs | |||
𫑉 yysb | |||
𬌗 mhomr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
𬌗 is an Ext. E character and thus should be placed after all other characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it is OK to move 𬌗 here because mhomr
is unique, so it has no observable behaivor changes. In the next commit it is sorted with other Ext. E characters.
Objection. |
Can you check out to a local repo and see the diff? Since GitHub does not support showing such a big diff. The current order of code mappings in this PR is consistent to what we have. I didn't reorder the mappings according to Unicode code points. I have mentioned that
It means to sort the mapping by the CJK Extension Block ( CJK Unified Ideographs > CJK Unified Ideographs Ext. A > Ext. B > ... Ext. G). When we see two mappings within the same block, we then sort it by Cangjie code. For example, assume we have the following unordered mappings
The sort-dict.mjs will sort to
Because 𣈁 and 𨓃 is ext. B characters. This is consistent to what we have in
in which we have already sorted each mappings by the following order
In this PR we have added mappings for CJK Ext. E to G, so the new mappings are sorted by
Note the the new mappings are always appended to current mappings, I don't see how such handling will break current users. Note that when we have characters sharing the same Cangjie code, the sorter does NOT reorder such characters because V8 sort is a stable. In the new mappings in this PR, the order of shared Cangjie code is ordered by original ordering in the upstream repo. Consider reopen this PR since the claim that this PR sorts mapping by Unicode index is invalid. |
Sorting by block is what I mean by sorting by Unicode. A Unicode Ext G character is not necessarily more rare than a Unicode Ext B character. |
I agree, but we have already sorted Ext. B C D, I can reorder mapping if we can reach consensus. But I believe that should be addressed in another PRs, as this PR focuses on adding new maps. |
Then only add new characters, don’t reorder. We can reopen it once you revert reordering. |
This PR mostly adds new characters. Actually only 3 characters are reordered because they are not ordered by cangjie code, as you can review in the first commit. It's an editorial fix IMO. |
cangjie5.dict.yaml
Outdated
@@ -27368,7 +27369,6 @@ encoder: | |||
䩁 lyhuu | |||
䘻 lyib | |||
䘪 lyiu | |||
䫍 lmsyc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
䫍 lmsyc
should not be placed here according to the Cangjie code order, it is moved from line 27371 to line 27300. No observable changes because lmsyc
is unique.
cangjie5.dict.yaml
Outdated
䛁 yrbmm | ||
䯩 yrbu |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
䯩 yrbu
should be placed after 䛁 yrbmm
according to Cangjie code order. No observable changes because there is only one yrbu
in Ext. A.
Why are there so many codes starting with |
The encoding scheme for new characters isn't compatible with existing codes. |
Good catch! These characters are in the block CJK COMPATIBILITY IDEOGRAPHm which should not be included. I have refined the merge script and these characters are now removed. I have also removed ordering across Ext. E/F/G so they are now mixed in a new chunk with the order defined in the upstream repo. In the newly added mappings, these 6 characters may need a new mapping as the author uses
@LEOYoon-Tsaw Can you come up with alternative mappings, this will be very helpful. I will come up with a new summary script to show how many characters we have added in this PR, so we can confirm it covers all the new extensions. Note that as a member of project, you can click |
Z for rotation is adopted for Cangjie6, so I really doubt the quality of these encodings as for Cangjie5. It seems it’s just a simple copy paste from cangjie6 without enough inspection |
And while some use z for rotated shapes, but some do not. E.g.: 𠄏, 𠄔, 𣀨 does not use z. |
@LEOYoon-Tsaw I use it quite a lot when I am working on replacing the unrecognied □ in 四庫全書: https://zh.wikisource.org/wiki/Special:%E7%94%A8%E6%88%B7%E8%B4%A1%E7%8C%AE/Jlhwung Based on all the Ext. E - G characters I have successfully summoned, the mapping is good for general use.
I don't know Cangjie6 so I can't determine whether it is derived from https://github.com/LEOYoon-Tsaw/Cangjie6 or it is authored independently. My personal experience is good but I may be so lucky that I hit only those characters which shares mapping between Cangjie5 and Cangjie6. Maybe @Jackchows can shred some light here
The author gives a complete list of Z operator usage in section 14: https://github.com/Jackchows/Cangjie5/blob/master/change_summary.md, he/she is well aware of that z is used here deliberately because it is easier to encode.
They are not Ext. E - Ext. G characters, which is not covered in this PR. And that's also why I call for mappings for 𮗙𰒥𫸪𰨇𰲞𬢆 so we can be consistent to what we already have on 𠄏𠄔𣀨. We can discuss whether we should use
That's great work! I am working on learning the CJCode from IDS, which is maintained by Andrew West: https://www.babelstone.co.uk/CJK/IDS.TXT. The quality of IDS is quite good since nowadays IRG requires that any new submitted character must have an IDS. I think it can supplement contextual shape information for the NN. Jackchows/Cangjie5#209 is a summary of incorrect mappings derived from IDS. It is still an early work but it can already contribute to an errata. I will open a similar PR to this project after this one gets merged. |
@lotem Any ideas? I’ve forgotten many cangjie5 rules, perhaps you want to have a look. And also I’m ok with using z, but I don’t know if cangjie5 users are generally ok with it |
I checked current PR and now it covers all the Ext. E - G characters now. However, I also find that we have uncovered characters in Ext. A, Ext. B and URO+. I will open a new issue. Update: see JLHwung@3ef8d73 @LEOYoon-Tsaw Can you reopen the PR? You can convert this PR to draft or leave a |
I can't, the reopen button is greyed out, and I don't know why |
Thanks. I will open a new PR later. |
Fixes #5
This PR adds mappings for these characters. The mappings are copied from https://github.com/Jackchows/Cangjie5 with additional fixes in Jackchows/Cangjie5#209. All mappings are with the following exceptions.
z
for CJK compatibility charactersx
The first commit fixes ordering issues in current mappings.
The second commit added new mappings and reordered them consistently with others.
When authoring this PR, I came up with two scripts, feel free to re-use it as Ext. H will be hopefully targeted to 2022. (link of scripts)
Cangjie5.txt
of https://github.com/Jackchows/Cangjie5.