Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode and newline characters break totable #71

Open
benbernard opened this issue Nov 24, 2015 · 15 comments
Open

Unicode and newline characters break totable #71

benbernard opened this issue Nov 24, 2015 · 15 comments
Labels

Comments

@benbernard
Copy link
Owner

When characters like '…' and newlines are in an outputted column, the columns no longer line up. Should probably transform the output or something an also leave a --no-escape option or similar...

Probably also affects toptable. I think those are the only two that require text alignment...

copy from irc discussion:

10:20 <ben_bernard> ugh, just found an annoying bug with totable
10:21 <ben_bernard> if you have a newline in one of the fields, everything goes to hell
10:23 <amling> if you have a lot of things I think you can arrive at hell
10:23 <amling> any interesting unicode characters e.g.
10:23 <ben_bernard> ugh
10:23 <ben_bernard> also
10:23 <ben_bernard> yeah
10:23 <ben_bernard> elipsis
10:23 <ben_bernard> character
10:24 <ben_bernard> https://www.dropbox.com/s/8ea5w1of8g1sm7z/Screenshot%202015-11-24%2010.24.12.png?dl=0
10:24 <ben_bernard> I feel like this is a conquerable problem
10:25 <amling> oh the twittening
10:25 <amling> you can probably super jackass it with pre-xform '{{text}} = some_perl_module_text_escaper({{text}})'
10:25 <amling> ultimately we might like very clear text output sinks to do that internally
10:25 <amling> just like table already automatically encodes non-strings into JSON (IIRC)
10:26 <ben_bernard> yeah, that is what I was thinking
10:26 <amling> uh
10:26 <amling> there's also color
10:26 <amling> and we may want to leave a super insane --no-encode or something around
10:27 <amling> actually color is also fucked by --no-encode
10:27 <amling> hmm
10:27 <amling> okay, maybe --no-encode is useless
10:27 <amling> anyway, should maybe think about color at the same time
10:27 <ben_bernard> yeah
10:28 <amling> some combination of encoding for text output (I think '\n' is unavoidable) and deciding effective width (color is zero, weird unicode stuff is ... something else)
@benbernard
Copy link
Owner Author

Some additional investigation by @amling and me has yielded this:

use open (':std', ':utf8');

(also maybe use utf8;).

That first one will make the input streams be read as utf8, which should fix the ellipsis encoding issue (still need to handle new lines)

@tsibley
Copy link
Collaborator

tsibley commented Nov 25, 2015

Yeah, I've worked around UTF-8 safeness of recs before using env PERL_UNICODE=SAD, which is roughly equivalent to your use open line.

Note that :encoding(UTF-8) is preferred to :utf8 for the open pragma since :utf8 means Perl's internal Unicode representation (lax, not strict) and :encoding(UTF-8) means the actual UTF-8 standard.

use utf8 only serves to inform perl that your file is encoded as UTF-8 and therefore may contain string literals with UTF-8 characters. It doesn't affect I/O directly.

@bokutin
Copy link
Contributor

bokutin commented May 26, 2016

Hello.

I am a Japanese user.

The other day, I encountered what might be same root of the problem.

screenshot

The value of the CORE::length might be different from the width.

You can get the width of the utf8 in module called AAA.
http://search.cpan.org/perldoc?Text::VisualWidth::PP

I am sure that the work well in the patch below.

# diff -u /usr/local/lib/perl5/site_perl/App/RecordStream/Operation/totable.pm.orig /usr/local/lib/perl5/site_perl/App/RecordStream/Operation/totable.pm
--- /usr/local/lib/perl5/site_perl/App/RecordStream/Operation/totable.pm.orig   2016-04-16 16:13:42.948118000 +0900
+++ /usr/local/lib/perl5/site_perl/App/RecordStream/Operation/totable.pm        2016-04-16 16:20:05.791268000 +0900
@@ -41,6 +41,12 @@
   $this->{'CLEAR'}         = $clear;
 }
+use Encode;
+use Text::VisualWidth::PP;
+sub _length {
+  Text::VisualWidth::PP::width( utf8::valid($_[0]) ? decode_utf8($_[0]) : $_[0] );
+}
+
 sub stream_done {
   my $this = shift;
@@ -58,14 +64,14 @@
         push @$fields, $field;
       }
-      $widths{$field} = max($widths{$field}, length($this->extract_field($record, $field)));
+      $widths{$field} = max($widths{$field}, _length($this->extract_field($record, $field)));
     }
   }
   my $no_header = $this->{'NO_HEADER'};
   if(!$no_header) {
     foreach my $field (keys(%widths)) {
-      $widths{$field} = max($widths{$field}, length($field));
+      $widths{$field} = max($widths{$field}, _length($field));
     }
   }
@@ -164,9 +170,9 @@
     }
     if ( (! $this->{'SPREADSHEET'}) &&
-      (length($field_string) < $widthsr->{$field})) {
+      (_length($field_string) < $widthsr->{$field})) {
-      $field_string .= " " x ($widthsr->{$field} - length($field_string));
+      $field_string .= " " x ($widthsr->{$field} - _length($field_string));
     }
     if($first) {

Please use you were probably used.

@benbernard
Copy link
Owner Author

Awesome! thanks for the patch @bokutin I'll try to get it into master soon (needs tests too), or if you want feel free to provide a pull request.

Very much thanks for the contribution

@benbernard
Copy link
Owner Author

A note here for later: this certainly also affects toptable

@tsibley
Copy link
Collaborator

tsibley commented May 26, 2016

Unfortunately the patch above is fixing a symptom, not the root cause. The root cause is that recs doesn't decode/encode its input/output streams and instead treats them as bytes. This is why length() is returning the number of bytes instead of the number of characters.

Additionally, using the utility functions inside the utf8 pragma is generally considered bad practice. Reading the documentation for the utf8::valid() function (and the related function utf8::is_utf8()) should make it clear that they test the internal-to-Perl state of a string and therefore do not mean what it may seem they mean.

The proper solution is to consistently decode/encode I/O within all of recs, most likely in one of the top-level stream classes. Requiring I/O be UTF-8 is not unreasonable this day in age, and folks can either explicitly convert when passing to recs or recs could gain a global --encoding option.

To learn more about Unicode and Perl, I highly suggest reading Tom Christiansen's Perl Unicode Essentials talk.

@benbernard
Copy link
Owner Author

Yeah, good point, I was going to try your suggested route first, and it
seems like doing that in InputStream would be the beset approach... I'm
also not opposed to just getting a fix out there, but yeah, agreed on the
right way to approach this

-Ben

On Thu, May 26, 2016 at 9:44 AM Thomas Sibley [email protected]
wrote:

Unfortunately the patch above is fixing a symptom, not the root cause. The
root cause is that recs doesn't decode/encode its input/output streams and
instead treats them as bytes. This is why length() is returning the
number of bytes instead of the number of characters.

Additionally, using the utility functions inside the utf8 pragma is
generally considered bad practice. Reading the documentation for the
utf8::valid() function (and the related function utf8::is_utf8()) should
make it clear that they test the internal-to-Perl state of a string and
therefore do not mean what it may seem they mean.

The proper solution is to consistently decode/encode I/O within all of
recs, most likely in one of the top-level stream classes. Requiring I/O be
UTF-8 is not unreasonable this day in age, and folks can either explicitly
convert when passing to recs or recs could gain a global --encoding
option.

To learn more about Unicode and Perl, I highly suggest reading Tom
Christiansen's Perl Unicode Essentials talk
https://tsibley.github.io/tchrist-OSCON2011-Unicode/pue.html.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#71 (comment)

@tsibley
Copy link
Collaborator

tsibley commented May 26, 2016

Nod. The patch may fix the specific case presented, but it introduces other bugs due to treating Perl's internal SV state as a "UTF-8 sniffer." I don't recommend applying it even for the short term.

@tsibley
Copy link
Collaborator

tsibley commented May 26, 2016

However! Text::VisualWidth::PP may still be useful, and I'm glad @bokutin mentioned it. :-)

@benbernard
Copy link
Owner Author

Yep, agreed on both points, you're obviously the encoding expert here :)

On Thu, May 26, 2016 at 9:50 AM Thomas Sibley [email protected]
wrote:

However! Text::VisualWidth::PP may still be useful, and I'm glad @bokutin
https://github.com/bokutin mentioned it. :-)


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#71 (comment)

@bokutin
Copy link
Contributor

bokutin commented May 26, 2016

Yes, my patch I think that it is shortsighted.

On the other hand, in order to resolve correctly, it may be difficult.

  • Correctly, you need to specify the encoding of the input and the encoding of the (terminal) output.
  • Than pick up from LC_CTYPE might not be appropriate.
  • When would decoded by the input layer might like recs-grep is affected.
    utf8

I may wish to prepare the App::RecordStream::Operation::totable::UTF8 for myself.

@tsibley
Copy link
Collaborator

tsibley commented May 26, 2016

Your screenshot with Encode misunderstands how string literals work in Perl. You need to use utf8 if you're going to put UTF-8 characters in string literals. Otherwise, the strings are treated as bytes/ASCII characters. In your last example, the one which prints 0, you decode your byte-literal into a character string and then compare it to the byte-literal. Perl is properly saying they don't match.

Look at this example which compares characters to characters: it decodes a byte-string on the left hand side and declares string literals as UTF-8 characters on the right hand side:

$ perl -MEncode -E 'say decode_utf8("♥") eq do { use utf8; "♥" } ? 1 : 0'
1

Encodings must always be handled explicitly. Handling them implicitly is a recipe for bugs and confusion. APIs must declare if they deal in characters (i.e. decoded bytes) or in bytes (i.e. encoded characters).

For recs and other tools, the sanest choice is picking a default that will cause the least amount of surprise for as many users and allowing users who need something other than the default to change it. In this case, I'm suggesting recs default to doing UTF-8 I/O and provide CLI options (+ environment vars?) to change that default.

@bokutin
Copy link
Contributor

bokutin commented May 26, 2016

Thank you for you pointed out.

I agree to the tsibley's suggestions. (^ ^)

If the impact of the encoding is also like to recs-grep and recs-eval
You might need to include the use utf8 to eval code.

@tsibley
Copy link
Collaborator

tsibley commented May 27, 2016

Any Perl snippets that recs evaluates will likely get an implicit use utf8 added by recs. This will make the default cases Just Work, while still allowing people to declare their snippets as non-UTF-8 with no utf8.

@tsibley tsibley added the bug label Feb 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants