-
-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Meaningful difference between single-byte and UTF8 strings #8329
Comments
Two queries have different results SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- 23401 -- wrong
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- 65465
-- Execute time = 218ms |
Please edit the first comment and explicitly state what the problem is. Don't just expect us to infer things from reading the code and the timings in the comment (someone may interpret it differently than you are because of making different assumptions). |
|
And:
So what? Where is the error? |
The slow operation is ICU Do you have a better alternative? |
The basic premise is that there is a lot of loss when dealing with strings. |
The strings contain only Latin letters and numbers. First query with WIN1251_CI must be return same result. Or am I mistaken and collate is used incorrectly? |
My fault, I didn't explain it well at first.
+ |
Not all old collations correctly supports options like CASE INSENSITIVE and ACCENT INSENSITIVE, and PXW_INTL is one of them. It's why we have created option to use collations name in format |
libre2 does not support accent insensitive patterns, so we need to call icu before call re2. It's more operations, so certainly more slow. The major problem is that ICU transform is very slow. |
Do you mean Is this should work or am I doing something wrong again? CREATE COLLATION WIN1251_EX_CI
FOR WIN1251
FROM EXTERNAL ('WIN1251_UNICODE')
CASE INSENSITIVE
ACCENT SENSITIVE
-- Latin letters
SELECT
(_WIN1251 'AZ' COLLATE PXW_CYRL = _WIN1251 'az' COLLATE PXW_CYRL)||''
FROM RDB$DATABASE
-- FALSE
SELECT
(_WIN1251 'AZ' COLLATE PXW_CYRL = _WIN1251 'az' COLLATE WIN1251_EX_CI)||''
FROM RDB$DATABASE
-- TRUE
-- Cyrillic letters
SELECT
(_WIN1251 'ФЯ' COLLATE PXW_CYRL = _WIN1251 'фя' COLLATE PXW_CYRL)||''
FROM RDB$DATABASE
-- FALSE
SELECT
(_WIN1251 'ФЯ' COLLATE PXW_CYRL = _WIN1251 'ФЯ' COLLATE WIN1251_EX_CI)||''
FROM RDB$DATABASE
-- TRUE
|
The main point is precisely the slowness of the ICU. ICU is a standard for many projects. Is there any way to change this in future versions of Firebird? |
There is no point in comparing what works correctly and what works incorrectly in terms of speed. When the result is identical, then comparisons in terms of speed still make sense. |
test_generate_data.zip
The text was updated successfully, but these errors were encountered: