forked from nijel/enca
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathFAQ
66 lines (56 loc) · 3.51 KB
/
FAQ
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
#============================================================================
# Enca v1.19-dev (2016-01-07) guess and convert encoding of text files
# Copyright (C) 2000-2003 David Necas (Yeti) <[email protected]>
# Copyright (C) 2009-2016 Michal Cihar <[email protected]>
#============================================================================
Frequently Asked Questions about Enca
* Q: What obscure encoding the FAQ, THANKS, and other files use?
* A: Run Enca on them and you'll see.
* Q: How do I specify input encoding?
* A: You can't. If you know the encoding, you don't need Enca. Use some
fully-fledged converter.
* Q: Why Enca can't detect both language and encoding?
* A: Because this is impossible. Well, it's possible for natural, long
enough texts. But Enca can detect encoding of nonsense and people
are used to it. No program can tell you “ěčřýáíéú” is both Czech
and e.g. ISO-8859-2. Incidentally, interpreting the same bytes as
Win-1251 one gets “мишэбнйъ”, which is equally good Russian nonsense.
* Q: Why Enca can't recognise encoding of all-uppercase texts?
* A: Mostly again because there's a trade-off between nonsense detection and
low-probability natural text detection. But it's possible to detect
both, I'm working on that.
* Q: Why Enca needs LC_CTYPE to be set?
* A: No, it doesn't need it. But you need to use “-L language-code” then.
It's possible to put it into ENCAOPT environment variable, if you want
to make your life easier. (Never versions of Enca try to guess your
language from other locale settings too.)
* Q: Why “enca -x ascii” doesn't work?
* A: Unfortunately there are several different things people call “conversion
to ASCII”. Consider following characters: ě (Latin small letter e with
caron) and ≫ (Much greater-than). By conversion to ASCII you may mean:
1a. Omitting these characters in output, because they are not
representable in ASCII.
1b. Keeping these characters intact in output, because they are not
representable in ASCII.
1c. Failing (possibly damaging the file), because they are not
representable in ASCII
2. Approximating them with single, close ASCII characters; in this case
probably plain “e” and “>”.
3. Expanding them to sequences of ASCII characters which may be less
readable then the approximations above, but generally allow
reconstruction of the original characters. Such sequences could
be e.g. RFC-1345 mnemonics “e<” and “>>”.
What happens when you run “enca -x ascii” on something, depends on
the converter used.
The usual scenario is: Enca uses librecode or libiconv, which do (1),
and you get upset. The only tool doing (2) known to me is cstocs, and
does it only for Latin2 characters (install cstocs and specify cstocs
as the converter to be used, if you want (2)). AFAIK, there's no tool
doing reasonably (3), though recode can expand to mnemonics (use rfc1345
as target charset instead of ascii).
* Q: Why “enca -E cstocs” doesn't work in my RedHat/Fedora?
* A: This is a cstocs problem. Cstocs is broken for Perl ≥ 5.8 and UTF-8
locales. Perl Unicode handling changes with every version so an
advanced charset converter working in mutiple Perl versions is something
between impossible and a big mess. Either set your locales to non-UTF-8
(e.g. ISO-8859-2) or don't use cstocs until this is resolved somehow.