Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mappings for CJK Extension E/F/G #10

Closed
wants to merge 2 commits into from

Conversation

JLHwung
Copy link

@JLHwung JLHwung commented Mar 16, 2021

Fixes #5

This PR adds mappings for these characters. The mappings are copied from https://github.com/Jackchows/Cangjie5 with additional fixes in Jackchows/Cangjie5#209. All mappings are with the following exceptions.

  • mapping starting with z for CJK compatibility characters
  • mapping starting with x

The first commit fixes ordering issues in current mappings.
The second commit added new mappings and reordered them consistently with others.

When authoring this PR, I came up with two scripts, feel free to re-use it as Ext. H will be hopefully targeted to 2022. (link of scripts)

  • sort-dict.mjs sorts the given dict according to the order Block > Cangjie code
  • merge-cangjie5-txt.mjs merges the mappings defined in Cangjie5.txt of https://github.com/Jackchows/Cangjie5.

@@ -79372,3 +79371,4 @@ encoder:
𫜉 ywwlb
𫐼 yybs
𫑉 yysb
𬌗 mhomr
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

𬌗 is an Ext. E character and thus should be placed after all other characters.

Copy link
Author

@JLHwung JLHwung Mar 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that it is OK to move 𬌗 here because mhomr is unique, so it has no observable behaivor changes. In the next commit it is sorted with other Ext. E characters.

@LEOYoon-Tsaw
Copy link
Member

Objection.
I don't think reordering the characters according to Unicode index is necessary.
Ordering according to Cangjie code makes it easy to see the relative order of same-code characters. Your reordering will break those orders and mess up with users, unless you add explicit character frequency for almost all characters.

@JLHwung
Copy link
Author

JLHwung commented Mar 16, 2021

Objection.
I don't think reordering the characters according to Unicode index is necessary.
Ordering according to Cangjie code makes it easy to see the relative order of same-code characters. Your reordering will break those orders and mess up with users, unless you add explicit character frequency for almost all characters.

Can you check out to a local repo and see the diff? Since GitHub does not support showing such a big diff. The current order of code mappings in this PR is consistent to what we have. I didn't reorder the mappings according to Unicode code points.

I have mentioned that

sort-dict.mjs sorts the given dict according to the order Block > Cangjie code

It means to sort the mapping by the CJK Extension Block ( CJK Unified Ideographs > CJK Unified Ideographs Ext. A > Ext. B > ... Ext. G). When we see two mappings within the same block, we then sort it by Cangjie code.

For example, assume we have the following unordered mappings

日 a
𣈁 aad
卜 y
𨓃 yysf

The sort-dict.mjs will sort to

日 a
卜 y
𣈁 aad
𨓃 yysf

Because 𣈁 and 𨓃 is ext. B characters. This is consistent to what we have in cangjie.dict.yaml:

mapping line
日 a 44
卜 y 20427
𣈁 aad 30866
𨓃 yysf 74984

in which we have already sorted each mappings by the following order

(CJK Unified Ideographs a)
...
(CJK Unified Ideographs y)
(CJK Unified Ideographs Ext. A a)
...
(CJK Unified Ideographs Ext. A y)
(CJK Unified Ideographs Ext. B a)
...
(CJK Unified Ideographs Ext. B y)
(CJK Unified Ideographs Ext. C & D a)
...
(CJK Unified Ideographs Ext. C & D y)

In this PR we have added mappings for CJK Ext. E to G, so the new mappings are sorted by

(CJK Unified Ideographs a)
...
(CJK Unified Ideographs y)
(CJK Unified Ideographs Ext. A a)
...
(CJK Unified Ideographs Ext. A y)
(CJK Unified Ideographs Ext. B a)
...
(CJK Unified Ideographs Ext. B y)
(CJK Unified Ideographs Ext. C & D a)
...
(CJK Unified Ideographs Ext. C & D y)
(CJK Unified Ideographs Ext. E a)
...
(CJK Unified Ideographs Ext. E y)
(CJK Unified Ideographs Ext. F a)
...
(CJK Unified Ideographs Ext. F y)
(CJK Unified Ideographs Ext. G a)
...
(CJK Unified Ideographs Ext. G y)

Note the the new mappings are always appended to current mappings, I don't see how such handling will break current users.

Note that when we have characters sharing the same Cangjie code, the sorter does NOT reorder such characters because V8 sort is a stable. In the new mappings in this PR, the order of shared Cangjie code is ordered by original ordering in the upstream repo.

Consider reopen this PR since the claim that this PR sorts mapping by Unicode index is invalid.

@LEOYoon-Tsaw
Copy link
Member

Sorting by block is what I mean by sorting by Unicode. A Unicode Ext G character is not necessarily more rare than a Unicode Ext B character.

@JLHwung
Copy link
Author

JLHwung commented Mar 16, 2021

I agree, but we have already sorted Ext. B C D, I can reorder mapping if we can reach consensus. But I believe that should be addressed in another PRs, as this PR focuses on adding new maps.

@LEOYoon-Tsaw
Copy link
Member

Then only add new characters, don’t reorder. We can reopen it once you revert reordering.

@JLHwung
Copy link
Author

JLHwung commented Mar 17, 2021

This PR mostly adds new characters. Actually only 3 characters are reordered because they are not ordered by cangjie code, as you can review in the first commit. It's an editorial fix IMO.

@@ -27368,7 +27369,6 @@ encoder:
䩁 lyhuu
䘻 lyib
䘪 lyiu
䫍 lmsyc
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

䫍 lmsyc should not be placed here according to the Cangjie code order, it is moved from line 27371 to line 27300. No observable changes because lmsyc is unique.

䛁 yrbmm
䯩 yrbu
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

䯩 yrbu should be placed after 䛁 yrbmm according to Cangjie code order. No observable changes because there is only one yrbu in Ext. A.

@LEOYoon-Tsaw LEOYoon-Tsaw reopened this Mar 17, 2021
@LEOYoon-Tsaw
Copy link
Member

Why are there so many codes starting with z?

@LEOYoon-Tsaw
Copy link
Member

The encoding scheme for new characters isn't compatible with existing codes.

@JLHwung
Copy link
Author

JLHwung commented Mar 17, 2021

Why are there so many codes starting with z?

Good catch! These characters are in the block CJK COMPATIBILITY IDEOGRAPHm which should not be included. I have refined the merge script and these characters are now removed. I have also removed ordering across Ext. E/F/G so they are now mixed in a new chunk with the order defined in the upstream repo.

In the newly added mappings, these 6 characters may need a new mapping as the author uses Z for rotation operator (See section 14 of https://github.com/Jackchows/Cangjie5/blob/master/change_summary.md), which is not defined in Cangjie 5 anyway:

𮗙	buhuz
𰒥	izi
𫸪	nnz
𰨇	ozmmf
𰲞	yniz
𬢆	yzbuu

@LEOYoon-Tsaw Can you come up with alternative mappings, this will be very helpful.

I will come up with a new summary script to show how many characters we have added in this PR, so we can confirm it covers all the new extensions.

Note that as a member of project, you can click Request Change in the dropdown button of Review Changes if you want to indicate clearly that this PR is still pending changes. You can also convert this PR to draft since we have more to do before it can be merged.

@LEOYoon-Tsaw
Copy link
Member

LEOYoon-Tsaw commented Mar 17, 2021

Z for rotation is adopted for Cangjie6, so I really doubt the quality of these encodings as for Cangjie5. It seems it’s just a simple copy paste from cangjie6 without enough inspection
How confident are you about the quality of the newly added encodings?

@LEOYoon-Tsaw
Copy link
Member

And while some use z for rotated shapes, but some do not. E.g.: 𠄏, 𠄔, 𣀨 does not use z.
I have a NN for automatic encoding based on bitmap of characters. If someone train it on stable encodings, and test on new encodings, it produce a confidence level for each character's encoding. Then based on them, it's easy to pick out those potentially wrong encodings by sorting the confidence number from low to high.

@JLHwung
Copy link
Author

JLHwung commented Mar 17, 2021

@LEOYoon-Tsaw I use it quite a lot when I am working on replacing the unrecognied □ in 四庫全書: https://zh.wikisource.org/wiki/Special:%E7%94%A8%E6%88%B7%E8%B4%A1%E7%8C%AE/Jlhwung Based on all the Ext. E - G characters I have successfully summoned, the mapping is good for general use.

It seems it’s just a simple copy paste from cangjie6 without enough inspection

I don't know Cangjie6 so I can't determine whether it is derived from https://github.com/LEOYoon-Tsaw/Cangjie6 or it is authored independently. My personal experience is good but I may be so lucky that I hit only those characters which shares mapping between Cangjie5 and Cangjie6. Maybe @Jackchows can shred some light here

Z for rotation is adopted for Cangjie6

The author gives a complete list of Z operator usage in section 14: https://github.com/Jackchows/Cangjie5/blob/master/change_summary.md, he/she is well aware of that z is used here deliberately because it is easier to encode.

And while some use z for rotated shapes, but some do not. E.g.: 𠄏, 𠄔, 𣀨 does not use z

They are not Ext. E - Ext. G characters, which is not covered in this PR. And that's also why I call for mappings for 𮗙𰒥𫸪𰨇𰲞𬢆 so we can be consistent to what we already have on 𠄏𠄔𣀨. We can discuss whether we should use z on specific characters as a compatible mapping, but this PR the newly added characters should be consistent to current mappings.

I have a NN for automatic encoding based on bitmap of characters.

That's great work! I am working on learning the CJCode from IDS, which is maintained by Andrew West: https://www.babelstone.co.uk/CJK/IDS.TXT. The quality of IDS is quite good since nowadays IRG requires that any new submitted character must have an IDS. I think it can supplement contextual shape information for the NN.

Jackchows/Cangjie5#209 is a summary of incorrect mappings derived from IDS. It is still an early work but it can already contribute to an errata. I will open a similar PR to this project after this one gets merged.

@LEOYoon-Tsaw
Copy link
Member

LEOYoon-Tsaw commented Mar 17, 2021

@lotem Any ideas? I’ve forgotten many cangjie5 rules, perhaps you want to have a look. And also I’m ok with using z, but I don’t know if cangjie5 users are generally ok with it

@JLHwung
Copy link
Author

JLHwung commented Mar 17, 2021

I will come up with a new summary script to show how many characters we have added in this PR, so we can confirm it covers all the new extensions.

I checked current PR and now it covers all the Ext. E - G characters now. However, I also find that we have uncovered characters in Ext. A, Ext. B and URO+. I will open a new issue.

Update: see JLHwung@3ef8d73

@LEOYoon-Tsaw Can you reopen the PR? You can convert this PR to draft or leave a Request Change if it is not meant to get merged soon. But a closed PR will not be get update from the source branch.

image

@LEOYoon-Tsaw
Copy link
Member

I can't, the reopen button is greyed out, and I don't know why

@JLHwung
Copy link
Author

JLHwung commented Mar 19, 2021

Thanks. I will open a new PR later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

爲擴展E區漢字編碼
2 participants