Meaningful difference between single-byte and UTF8 strings #8329

AlexBekhtin · 2024-11-25T18:38:33Z

-- Firebird-5.0.1.1469-0-windows-x64
-- Database page size = 16384
-- Connection charset = WIN1251

CREATE TABLE TEST_WIN1251 (
  VAL VARCHAR(1024) CHARACTER SET WIN1251
)

CREATE TABLE TEST_UTF8 (
  VAL VARCHAR(1024) CHARACTER SET UTF8
)

-- Import/Generate 100'000 records with 1024 characters
-- Strings like 'OXIkQmIVcGdGNlQSIW7Pg4Q6FIbImfZct5wWCaMug8Rr7lzjxnAX0CPhKRz...'
-- Use test_generate_data.sql

SELECT COUNT(*) FROM TEST_WIN1251
-- COUNT
-- 100 000
SELECT COUNT(*) FROM TEST_UTF8
-- COUNT
-- 100 000

-- All next queries return the same result ~65465

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL CONTAINING 'az';
-- Execute time = 938ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL CONTAINING 'az'
-- Execute time = 922ms

SELECT COUNT(*) FROM TEST_WIN1251 WHERE CAST(VAL AS VARCHAR(1024) CHARACTER SET NONE) CONTAINING 'az';
-- Execute time = 203ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE CAST(VAL AS VARCHAR(1024) CHARACTER SET NONE) CONTAINING 'az';
-- Execute time = 204ms

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL SIMILAR TO '%[Aa][Zz]%'
-- Execute time = 250ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL SIMILAR TO '%[Aa][Zz]%'
-- Execute time = 156ms (!)

SELECT COUNT(*) FROM TEST_WIN1251 WHERE LOWER(VAL) LIKE '%az%'
-- Execute time = 985ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE LOWER(VAL) LIKE '%az%'
-- Execute time = 1s 63ms

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL LIKE '%az%'
                                     OR VAL LIKE '%Az%'
                                     OR VAL LIKE '%aZ%'
                                     OR VAL LIKE '%AZ%'
-- Execute time = 422ms

SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL LIKE '%az%'
                                  OR VAL LIKE '%Az%'
                                  OR VAL LIKE '%aZ%'
                                  OR VAL LIKE '%AZ%'
-- Execute time = 812ms

CREATE COLLATION WIN1251_CI
  FOR WIN1251
  FROM PXW_CYRL
  CASE INSENSITIVE
  ACCENT SENSITIVE

CREATE COLLATION WIN1251_CI_AI
  FOR WIN1251
  FROM PXW_CYRL
  CASE INSENSITIVE
  ACCENT INSENSITIVE  

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL COLLATE WIN1251_CI LIKE '%az%'
-- Execute time = 281ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL COLLATE WIN1251_CI_AI LIKE '%az%'
-- Execute time = 235ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL COLLATE UNICODE_CI LIKE '%az%'
-- Execute time = 907ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL COLLATE UNICODE_CI_AI LIKE '%az%'
-- Execute time = 8s 344ms (!)


SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN
1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- Execute time = 218ms (!)
SELECT COUNT(*) FROM TEST_UTF8 WHERE POSITION('az', VAL COLLATE UNICODE_CI) > 0
-- Execute time = 891ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE POSITION('az', VAL COLLATE UNICODE_CI_AI) > 0
-- Execute time = 8s 281ms

test_generate_data.zip

AlexBekhtin · 2024-11-25T19:34:02Z

Two queries have different results

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- 23401 -- wrong
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- 65465
-- Execute time = 218ms

mrotteveel · 2024-11-26T12:40:16Z

Please edit the first comment and explicitly state what the problem is. Don't just expect us to infer things from reading the code and the timings in the comment (someone may interpret it differently than you are because of making different assumptions).

asfernandes · 2024-11-27T11:09:22Z

SELECT COUNT(*) FROM TEST_WIN1251 WHERE VAL SIMILAR TO '%[Aa][Zz]%'
-- Execute time = 250ms
SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL SIMILAR TO '%[Aa][Zz]%'
-- Execute time = 156ms (!)

SIMILAR TO is done with libre2 using UTF-8. So the first case is converted to UTF-8, it's expected to be slower.

asfernandes · 2024-11-27T11:11:55Z

Two queries have different results

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- 23401 -- wrong
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- 65465
-- Execute time = 218ms

And:

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0 or POSITION('AZ', VAL COLLATE WIN1251_CI) > 0 or POSITION('Az', VAL COLLATE WIN1251_CI) > 0 or POSITION('aZ', VAL COLLATE WIN1251_CI) > 0;
-- 65316

So what? Where is the error?

asfernandes · 2024-11-27T11:39:39Z

SELECT COUNT(*) FROM TEST_UTF8 WHERE VAL COLLATE UNICODE_CI_AI LIKE '%az%'
-- Execute time = 8s 344ms (!)

SELECT COUNT(*) FROM TEST_UTF8 WHERE POSITION('az', VAL COLLATE UNICODE_CI_AI) > 0
-- Execute time = 8s 281ms

The slow operation is ICU utrans_transUChars for removal of accents.

Do you have a better alternative?

AlexBekhtin · 2024-11-28T10:33:12Z

Please edit the first comment and explicitly state what the problem is. Don't just expect us to infer things from reading the code and the timings in the comment (someone may interpret it differently than you are because of making different assumptions).

The basic premise is that there is a lot of loss when dealing with strings.
The internal DBMS engine is slower than libre2. In my opinion, the regular expression engine is more complex and potentially slower. But it outperforms the internal DBMS engine methods, even the simplest ones like position.

AlexBekhtin · 2024-11-28T10:48:00Z

Two queries have different results

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0
-- 23401 -- wrong
-- Execute time = 562ms
SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI_AI) > 0
-- 65465
-- Execute time = 218ms

And:

SELECT COUNT(*) FROM TEST_WIN1251 WHERE POSITION('az', VAL COLLATE WIN1251_CI) > 0 or POSITION('AZ', VAL COLLATE WIN1251_CI) > 0 or POSITION('Az', VAL COLLATE WIN1251_CI) > 0 or POSITION('aZ', VAL COLLATE WIN1251_CI) > 0;
-- 65316

So what? Where is the error?

The strings contain only Latin letters and numbers. First query with WIN1251_CI must be return same result. Or am I mistaken and collate is used incorrectly?

AlexBekhtin · 2024-11-28T10:53:52Z

My fault, I didn't explain it well at first.

UNICODE_CI_AI is very slow
libre2 outperforms internal DBMS mechanisms, although its template search is formally more complicated

+ POSITION with COLLATE WIN1251_CI produces unexpected results

asfernandes · 2024-11-28T11:07:57Z

The strings contain only Latin letters and numbers. First query with WIN1251_CI must be return same result. Or am I mistaken and collate is used incorrectly?

Not all old collations correctly supports options like CASE INSENSITIVE and ACCENT INSENSITIVE, and PXW_INTL is one of them.

It's why we have created option to use collations name in format <charset>_unicode.

asfernandes · 2024-11-28T11:09:20Z

2. libre2 outperforms internal DBMS mechanisms, although its template search is formally more complicated

libre2 does not support accent insensitive patterns, so we need to call icu before call re2. It's more operations, so certainly more slow. The major problem is that ICU transform is very slow.

AlexBekhtin · 2024-11-28T12:04:02Z

Not all old collations correctly supports options like CASE INSENSITIVE and ACCENT INSENSITIVE, and PXW_INTL is one of them.

It's why we have created option to use collations name in format <charset>_unicode.

Do you mean FOR EXTERNAL clause?

Is this should work or am I doing something wrong again?

CREATE COLLATION WIN1251_EX_CI
  FOR WIN1251
  FROM EXTERNAL ('WIN1251_UNICODE')
  CASE INSENSITIVE
  ACCENT SENSITIVE

-- Latin letters
SELECT
  (_WIN1251 'AZ' COLLATE PXW_CYRL = _WIN1251 'az' COLLATE PXW_CYRL)||''
FROM RDB$DATABASE
-- FALSE

SELECT
  (_WIN1251 'AZ' COLLATE PXW_CYRL = _WIN1251 'az' COLLATE WIN1251_EX_CI)||''
FROM RDB$DATABASE
-- TRUE

-- Cyrillic letters
SELECT
  (_WIN1251 'ФЯ' COLLATE PXW_CYRL = _WIN1251 'фя' COLLATE PXW_CYRL)||''
FROM RDB$DATABASE
-- FALSE

SELECT
  (_WIN1251 'ФЯ' COLLATE PXW_CYRL = _WIN1251 'ФЯ' COLLATE WIN1251_EX_CI)||''
FROM RDB$DATABASE
-- TRUE

Case-Insensitive Searching

For strings in a character set that has a case-insensitive collation available, you can apply the collation, to compare the search argument and the searched strings directly. For example, using the WIN1251 character set, the collation PXW_CYRL is case-insensitive for this purpose:

AlexBekhtin · 2024-11-28T12:09:23Z

libre2 does not support accent insensitive patterns, so we need to call icu before call re2. It's more operations, so certainly more slow. The major problem is that ICU transform is very slow.

The main point is precisely the slowness of the ICU. ICU is a standard for many projects. Is there any way to change this in future versions of Firebird?

sim1984 · 2024-11-29T10:33:15Z

There is no point in comparing what works correctly and what works incorrectly in terms of speed. When the result is identical, then comparisons in terms of speed still make sense.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Meaningful difference between single-byte and UTF8 strings #8329

Meaningful difference between single-byte and UTF8 strings #8329

AlexBekhtin commented Nov 25, 2024

AlexBekhtin commented Nov 25, 2024

mrotteveel commented Nov 26, 2024

asfernandes commented Nov 27, 2024

asfernandes commented Nov 27, 2024

asfernandes commented Nov 27, 2024

AlexBekhtin commented Nov 28, 2024

AlexBekhtin commented Nov 28, 2024

AlexBekhtin commented Nov 28, 2024 •

edited

Loading

asfernandes commented Nov 28, 2024

asfernandes commented Nov 28, 2024

AlexBekhtin commented Nov 28, 2024

AlexBekhtin commented Nov 28, 2024

sim1984 commented Nov 29, 2024 •

edited

Loading

Meaningful difference between single-byte and UTF8 strings #8329

Meaningful difference between single-byte and UTF8 strings #8329

Comments

AlexBekhtin commented Nov 25, 2024

AlexBekhtin commented Nov 25, 2024

mrotteveel commented Nov 26, 2024

asfernandes commented Nov 27, 2024

asfernandes commented Nov 27, 2024

asfernandes commented Nov 27, 2024

AlexBekhtin commented Nov 28, 2024

AlexBekhtin commented Nov 28, 2024

AlexBekhtin commented Nov 28, 2024 • edited Loading

asfernandes commented Nov 28, 2024

asfernandes commented Nov 28, 2024

AlexBekhtin commented Nov 28, 2024

AlexBekhtin commented Nov 28, 2024

sim1984 commented Nov 29, 2024 • edited Loading

AlexBekhtin commented Nov 28, 2024 •

edited

Loading

sim1984 commented Nov 29, 2024 •

edited

Loading