Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Meaningful difference between single-byte and UTF8 strings #8329

Open
AlexBekhtin opened this issue Nov 25, 2024 · 13 comments
Open

Meaningful difference between single-byte and UTF8 strings #8329

AlexBekhtin opened this issue Nov 25, 2024 · 13 comments

Comments

@AlexBekhtin
Copy link

-- Firebird-5.0.1.1469-0-windows-x64
-- Database page size = 16384
-- Connection charset = WIN1251

CREATE TABLE TEST_WIN1251 (
  VAL VARCHAR(1024) CHARACTER SET WIN1251
)

CREATE TABLE TEST_UTF8 (
  VAL VARCHAR(1024) CHARACTER SET UTF8
)

-- Import/Generate 100'000 records with 1024 characters
-- Strings like 'OXIkQmIVcGdGNlQSIW7Pg4Q6FIbImfZct5wWCaMug8Rr7lzjxnAX0CPhKRz...'
-- Use test_generate_data.sql

SELECT COUNT(*) FROM TEST_WIN1251
-- COUNT
-- 100 000
SELECT COUNT(*) FROM TEST_UTF8
-- COUNT
-- 100 000

-- All next queries return the same result ~65465

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL CONTAINING 'az';
-- Execute time = 938ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL CONTAINING 'az'
-- Execute time = 922ms

SELECT COUNT(*) FROM TEST_WIN1251 WHERE CAST(VAL AS VARCHAR(1024) CHARACTER SET NONE) CONTAINING 'az';
-- Execute time = 203ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE CAST(VAL AS VARCHAR(1024) CHARACTER SET NONE) CONTAINING 'az';
-- Execute time = 204ms

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL SIMILAR TO '%[Aa][Zz]%'
-- Execute time = 250ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL SIMILAR TO '%[Aa][Zz]%'
-- Execute time = 156ms (!)

SELECT COUNT(*) FROM TEST_WIN1251 WHERE LOWER(VAL) LIKE '%az%'
-- Execute time = 985ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE LOWER(VAL) LIKE '%az%'
-- Execute time = 1s 63ms

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL LIKE '%az%'
                                     OR VAL LIKE '%Az%'
                                     OR VAL LIKE '%aZ%'
                                     OR VAL LIKE '%AZ%'
-- Execute time = 422ms

SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL LIKE '%az%'
                                  OR VAL LIKE '%Az%'
                                  OR VAL LIKE '%aZ%'
                                  OR VAL LIKE '%AZ%'
-- Execute time = 812ms

CREATE COLLATION WIN1251_CI
  FOR WIN1251
  FROM PXW_CYRL
  CASE INSENSITIVE
  ACCENT SENSITIVE

CREATE COLLATION WIN1251_CI_AI
  FOR WIN1251
  FROM PXW_CYRL
  CASE INSENSITIVE
  ACCENT INSENSITIVE  

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL COLLATE WIN1251_CI LIKE '%az%'
-- Execute time = 281ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL COLLATE WIN1251_CI_AI LIKE '%az%'
-- Execute time = 235ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL COLLATE UNICODE_CI LIKE '%az%'
-- Execute time = 907ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL COLLATE UNICODE_CI_AI LIKE '%az%'
-- Execute time = 8s 344ms (!)


SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN
1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- Execute time = 218ms (!)
SELECT COUNT(*) FROM TEST_UTF8 WHERE POSITION('az', VAL COLLATE UNICODE_CI) > 0
-- Execute time = 891ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE POSITION('az', VAL COLLATE UNICODE_CI_AI) > 0
-- Execute time = 8s 281ms

test_generate_data.zip

@AlexBekhtin
Copy link
Author

Two queries have different results

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- 23401 -- wrong
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- 65465
-- Execute time = 218ms

@mrotteveel
Copy link
Member

Please edit the first comment and explicitly state what the problem is. Don't just expect us to infer things from reading the code and the timings in the comment (someone may interpret it differently than you are because of making different assumptions).

@asfernandes
Copy link
Member

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL SIMILAR TO '%[Aa][Zz]%'
-- Execute time = 250ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL SIMILAR TO '%[Aa][Zz]%'
-- Execute time = 156ms (!)

SIMILAR TO is done with libre2 using UTF-8. So the first case is converted to UTF-8, it's expected to be slower.

@asfernandes
Copy link
Member

Two queries have different results

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- 23401 -- wrong
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- 65465
-- Execute time = 218ms

And:

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0 or POSITION('AZ', VAL COLLATE WIN1251_CI) > 0 or POSITION('Az', VAL COLLATE WIN1251_CI) > 0 or POSITION('aZ', VAL COLLATE WIN1251_CI) > 0;
-- 65316

So what? Where is the error?

@asfernandes
Copy link
Member

SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL COLLATE UNICODE_CI_AI LIKE '%az%'
-- Execute time = 8s 344ms (!)
SELECT COUNT(*) FROM TEST_UTF8 WHERE POSITION('az', VAL COLLATE UNICODE_CI_AI) > 0
-- Execute time = 8s 281ms

The slow operation is ICU utrans_transUChars for removal of accents.

Do you have a better alternative?

@AlexBekhtin
Copy link
Author

Please edit the first comment and explicitly state what the problem is. Don't just expect us to infer things from reading the code and the timings in the comment (someone may interpret it differently than you are because of making different assumptions).

The basic premise is that there is a lot of loss when dealing with strings.
The internal DBMS engine is slower than libre2. In my opinion, the regular expression engine is more complex and potentially slower. But it outperforms the internal DBMS engine methods, even the simplest ones like position.

@AlexBekhtin
Copy link
Author

Two queries have different results

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- 23401 -- wrong
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- 65465
-- Execute time = 218ms

And:

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0 or POSITION('AZ', VAL COLLATE WIN1251_CI) > 0 or POSITION('Az', VAL COLLATE WIN1251_CI) > 0 or POSITION('aZ', VAL COLLATE WIN1251_CI) > 0;
-- 65316

So what? Where is the error?

The strings contain only Latin letters and numbers. First query with WIN1251_CI must be return same result. Or am I mistaken and collate is used incorrectly?

@AlexBekhtin
Copy link
Author

AlexBekhtin commented Nov 28, 2024

My fault, I didn't explain it well at first.

  1. UNICODE_CI_AI is very slow
  2. libre2 outperforms internal DBMS mechanisms, although its template search is formally more complicated

+ POSITION with COLLATE WIN1251_CI produces unexpected results

@asfernandes
Copy link
Member

The strings contain only Latin letters and numbers. First query with WIN1251_CI must be return same result. Or am I mistaken and collate is used incorrectly?

Not all old collations correctly supports options like CASE INSENSITIVE and ACCENT INSENSITIVE, and PXW_INTL is one of them.

It's why we have created option to use collations name in format <charset>_unicode.

@asfernandes
Copy link
Member

2. libre2 outperforms internal DBMS mechanisms, although its template search is formally more complicated

libre2 does not support accent insensitive patterns, so we need to call icu before call re2. It's more operations, so certainly more slow. The major problem is that ICU transform is very slow.

@AlexBekhtin
Copy link
Author

Not all old collations correctly supports options like CASE INSENSITIVE and ACCENT INSENSITIVE, and PXW_INTL is one of them.

It's why we have created option to use collations name in format <charset>_unicode.

Do you mean FOR EXTERNAL clause?

Is this should work or am I doing something wrong again?

CREATE COLLATION WIN1251_EX_CI
  FOR WIN1251
  FROM EXTERNAL ('WIN1251_UNICODE')
  CASE INSENSITIVE
  ACCENT SENSITIVE

-- Latin letters
SELECT
  (_WIN1251 'AZ' COLLATE PXW_CYRL = _WIN1251 'az' COLLATE PXW_CYRL)||''
FROM RDB$DATABASE
-- FALSE

SELECT
  (_WIN1251 'AZ' COLLATE PXW_CYRL = _WIN1251 'az' COLLATE WIN1251_EX_CI)||''
FROM RDB$DATABASE
-- TRUE

-- Cyrillic letters
SELECT
  (_WIN1251 'ФЯ' COLLATE PXW_CYRL = _WIN1251 'фя' COLLATE PXW_CYRL)||''
FROM RDB$DATABASE
-- FALSE

SELECT
  (_WIN1251 'ФЯ' COLLATE PXW_CYRL = _WIN1251 'ФЯ' COLLATE WIN1251_EX_CI)||''
FROM RDB$DATABASE
-- TRUE

Case-Insensitive Searching

For strings in a character set that has a case-insensitive collation available, you can apply the collation, to compare the search argument and the searched strings directly. For example, using the WIN1251 character set, the collation PXW_CYRL is case-insensitive for this purpose:

@AlexBekhtin
Copy link
Author

libre2 does not support accent insensitive patterns, so we need to call icu before call re2. It's more operations, so certainly more slow. The major problem is that ICU transform is very slow.

The main point is precisely the slowness of the ICU. ICU is a standard for many projects. Is there any way to change this in future versions of Firebird?

@sim1984
Copy link

sim1984 commented Nov 29, 2024

There is no point in comparing what works correctly and what works incorrectly in terms of speed. When the result is identical, then comparisons in terms of speed still make sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants