-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
offending filename should be logged on Malformed UTF-8 error #18
Comments
+1 for re-opening issue #13 Perl has different ways of treating utf8 stuff: The ackmate_ack that is shipped with ackmate uses Encode::encode_utf8($buffer) instead of Encode::encode('UTF-8',$buffer). The previous malformed erros I was getting disappeared when I used the latter version of encode, although I don't know why (yet). I thought this could help others investigate this issue. |
13 was closed because people were getting sick of getting emails about it. As for using the latter version, I can't recall where I saw it, but I recall seeing a statement that encode_utf8 was the correct call. It's also odd that encode('UTF-8', $buffer) would result in fewer errors if it's more strict... However, it's been a very long time since I burned any hours on this issue. If you can come up with a few test files that prove it works better then I will gratefully apply any pull request. Trev |
Hi Trev, I'll see if I can come up with something and possibly submit a pull request to your ack fork. Best regards. |
+1 for this issue, I guess demanding a fix for the actual error might be too much but at least outputting the erroneous file name should be easy for anyone versed in Perl. |
Alright folks, I've put together a solution for reporting offending files. Tell me what you think: https://github.com/protocool/AckMate/wiki/Unicode-UTF-8-error-message |
Would be really nice if this could be fixed. As far as I'm aware, all my files are encoded as UTF-8, but given that the kinds of directories I need to search over contain upward of 1,000 files, it's hard to track down the offending article. I've downgraded for now, but you know how we developers like to be on the bleeding edge :) Even if someone could put together a script that will report all the "broken" files (or suggest one that already exists) and of course ideally fix them (wishful thinking I know!) that would be brilliant too. |
I added a similar patch in another location of ackmate_ack: https://gist.github.com/1135457 |
Addd a note to #13 just because it's the first thing you find in google, but I thought I should comment here: Adding these lines to the end of needs_line_scan fixed this issue completely for me:
Granted, there may be better places to do this encoding, but I just needed a quick fix. Hopefully this helps someone else out if they find the tickets. |
At the end of my needs_line_scan method (around line 2597) I changed the end of the method to:
And you have to add 'use utf8;' to the beginning of the script... does this seem reasonable? |
Guys, thanks for taking the lead on sorting this stuff out. I've seen someone mention that setting export LC_CTYPE=en_US.UTF-8 in the environment helps. I'd like to experiment with the various options before merging into AM proper, and at the same time updating AM's version of ack to the latest. Can any of you point me to files which cause the UTF explosions so I can test against it? Thanks, Trev |
@yyyc514 solution works for me @protocool I was getting stung on language files in ext.js |
No errors should be logged because AFAIK they aren't errors, they're files not in UTF-8 format being interpreted as if they are. Here's a patch for the ackmate_ack to solve the issue - read the file with :raw format first, then convert to UTF-8 using the Encode lib. Does not complain. It also uses @protocool's (or @airways ?) encode strategy for the other possible error section, with which I had no issues. I have not yet submitted this to the ack project because I am unsure if it's universally applicable, but it should work for our purposes. To apply, first paste into an
--- ackmate_ack 2012-02-15 13:46:32.000000000 -0500
+++ ackmate_ack 2012-02-15 13:59:25.000000000 -0500
@@ -1553,12 +1553,15 @@
# If there's no extension, or we don't recognize it, check the shebang line
my $fh;
- if ( !open( $fh, '<', $filename ) ) {
+ if ( !open( $fh, '<:raw', $filename ) ) {
App::Ack::warn( "$filename: $!" );
return;
}
my $header = <$fh>;
close $fh;
+ use Encode;
+ $header = encode("UTF-8", $header);
if ( $header =~ /^#!/ ) {
return ($1,TEXT) if $header =~ /\b(ruby|p(?:erl|hp|ython))\b/;
@@ -2607,10 +2610,14 @@
App::Ack::warn( "$self->{filename}: $!" );
return 1;
}
+
return 0 unless $rc && ( $rc == $size || length(Encode::encode_utf8($buffer)) == $size );
my $regex = $opt->{regex};
- return $buffer =~ /$regex/m;
+
+ use Encode;
+ $buffer = encode("UTF-8", $buffer);
+ return $buffer =~ /$regex/m;
}
|
This is basically the same as my patch for the malformed UTF-8 errors, which I posted on #13. I feel like this ticket should just be closed and the patch should be applied. No one has pointed out any issues with it, so I'm not sure why this hasn't been resolved already. I agree that the request in this ticket seems pointless, we don't need logging for the error - we just need it to go away since ack and ackmate shouldn't tell us what format our files are supposed to be in. |
Correct. I've forked the project and attempted to apply, but the problem, you see, is that ackmate_ack is actually generated from http://github.com/petdance/ack - see: https://github.com/trisweb/AckMate/blob/master/bundle_extras/ackmate_ack.autogenerated So it's not exactly simple. |
@protocool I wget this file http://www.fiscontex.com.br/legislacao/ICMS/aliquotainternaicms.htm and the problem happens with it. This file seens to be written on Windows FrontPage and have weird windows encoding chars. Hope it helps. |
@trisweb 's patch totally work in my case. |
also, this bug was closed for a bogus reason and should be reopened: https://github.com/protocool/AckMate/issues/closed#issue/13
The text was updated successfully, but these errors were encountered: