Index navigation for columns that are encoded using "narrow" (single byte per character) charset: performance of some queries can be poor #8369

pavel-zotov · 2024-12-28T19:57:55Z

Number of fetches during navigation on index that was built on text column can strongly vary, depending on character that is searched.
Below is description and results of test which checks only NARROW character sets (i.e. one byte per character).
Possibly problem has same nature as was described in #6915 or in sourceforge - but they both deal with UNICODE charset.

Following character sets are involved: iso8859_1; iso8859_2; win1250; win1251; win1252; win1253; win1254; win1257; koi8r; koi8u.
For each character set few collations are checked that are defined in $FB_HOME/intl/fbintl.conf.
Each of them must not match to the name of charset and must not contain '_UNICODE', e.g. for

    intl_module = fbintl
    collation = WIN1251
    collation = WIN1251_UNICODE
    collation = PXW_CYRL
    collation = WIN1251_UA
}

-- only two collations will be checked: PXW_CYRL and WIN1251_UA.
An unique combination of {charset, collation} is the PK for table 'tcsets', this table is used as source for 'main loop' in the test.

For each character set full list of its characters was taken from www.fileformat.info/info/charset/<charset_name>/list.htm and they are stored in the table tchars (with PK = cset, symb), using ISQL with connection charset = UTF8.
Beside of this, list of ascii-only characters is stored also for each handled charset (lower case only is used) .

Next, test has used query select cset, coll from tcsets and generated 42 tables to store symbols, with DDL like this:

recreate table tmp_win1250_win_cz ( id int primary key, f varchar(1) character set win1250 collate win_cz );
create UNIQUE index unq_win1250_win_cz_asc on tmp_win1250_win_cz(f);
create UNIQUE DESCENDING index unq_win1250_win_cz_dec on tmp_win1250_win_cz(f);

After this, each such table has been fulfilled with characters that are suitable for its charset (table tchars was source for that).
Note.
Although some characters with accents / diaeresis etc look differ for us, they can be consideres as SAME from some collation POV.
In order to prevent 'Unique violation' exception, MERGE statement is used to store only first of such characters.
But this can not prevent from raising 335544321 / 335544565 ("Cannot transliterate character... "), so we have to suppress such gdscodes during this "TMP_*" tables are filled.

After this, preparation is over and we can measure performance of misc queries related to index scan.

It was decided to check only queries that have INDEX RANGE in explained plan, e.g..:
Index "UNQ_ISO8859_1_DA_DA_ASC" Range Scan (lower bound: 1/1, upper bound: 1/1)
or
Index "UNQ_ISO8859_1_DA_DA_ASC" Range Scan (full match)

select count(*) from (select 1 x from tmp_iso8859_1_da_da where F >= 'a' and F < 'ª' order by f asc)
select count(*) from (select 1 x from tmp_iso8859_1_da_da where F starting with 'a' order by f asc)
select count(*) from (select 1 x from tmp_iso8859_1_da_da where F like 'a%' order by f asc)
select count(*) from (select 1 x from tmp_iso8859_1_da_da where F similar to 'a%' order by f asc)

Derived table in each of these queries will return only one record.
So, every such query must (ideally) produce only one indexed reads, and this really occurs (if we see in tha trace) lot of times:

select /* trace_tag */ count(*) from (select 1 x from tmp_win1251_win1251_ua where F similar to 'ѕ%' /* seq_no=42703 */ order by f desc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Select Expression
    -> Aggregate
        -> Filter
            -> Table "TMP_WIN1251_WIN1251_UA" Access By ID
                -> Index "UNQ_WIN1251_WIN1251_UA_DEC" Range Scan (full match)
1 records fetched
      0 ms, 4 fetch(es)

Table                              Natural     Index    Update    Insert    Delete   Backout     Purge   Expunge
****************************************************************************************************************
TMP_WIN1251_WIN1251_UA                             1

But! Unfortunately, this is not so for all queries. There some weird cases like these:

select /* trace_tag */ count(*) from (select 1 x from tmp_win1252_pxw_span where F starting with 'l' /* seq_no=30669 */ order by f desc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Select Expression
    -> Aggregate
        -> Filter
            -> Table "TMP_WIN1252_PXW_SPAN" Access By ID
                -> Index "UNQ_WIN1252_PXW_SPAN_DEC" Range Scan (full match)
1 records fetched
      0 ms, 296 fetch(es)

Table                              Natural     Index    Update    Insert    Delete   Backout     Purge   Expunge
****************************************************************************************************************
TMP_WIN1252_PXW_SPAN                             147

(yes, l in ... starting with 'l' is ASCII character, "L" in lowercase; same result for 'starting with', 'like' and 'similar to'; for this charset such outcome can also be seen for letter 'c').

One more example:

select /* trace_tag */ count(*) from (select 1 x from tmp_win1250_pxw_hundc where F starting with 'z' /* seq_no=20477 */ order by f desc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Select Expression
    -> Aggregate
        -> Filter
            -> Table "TMP_WIN1250_PXW_HUNDC" Access By ID
                -> Index "UNQ_WIN1250_PXW_HUNDC_DEC" Range Scan (full match)
1 records fetched
      0 ms, 294 fetch(es)

Table                              Natural     Index    Update    Insert    Delete   Backout     Purge   Expunge
****************************************************************************************************************
TMP_WIN1250_PXW_HUNDC                            146

(same for starting with 'z', like 'z%' and similar to 'z%'; same for ascii-only characters c, d, g, l, n, t, s ; same for both ascending and descending indices).
As explained here, Hungarian alphabet has several digraphs: cs, dz, dzs, gy, ly, ny, sz, ty, zs - so maybe this somehow matters.

But one may see such trouble not only for alphabets with digraphs.
For example, charset ISO8859_2 has collation ISO_PLK (Polish collation), and it also has such cases, e.g.:

select /* trace_tag */ count(*) from (select 1 x from tmp_iso8859_2_iso_plk where F starting with 'z' /* seq_no=17269 */ order by f desc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Select Expression
    -> Aggregate
        -> Filter
            -> Table "TMP_ISO8859_2_ISO_PLK" Access By ID
                -> Index "UNQ_ISO8859_2_ISO_PLK_DEC" Range Scan (full match)
1 records fetched
      0 ms, 244 fetch(es)

Table                              Natural     Index    Update    Insert    Delete   Backout     Purge   Expunge
****************************************************************************************************************
TMP_ISO8859_2_ISO_PLK                            121

(same for STARTING WITH, LIKE, SIMILAR TO; same for characters c, d, l, n, s, t and z; same for both ascending and descending indices).

Result ordering (by number of fetches, desc) shows that indexed navigation works ideally for all cases only when GREATER_OR_EQUALS + LESS_THEN comparison is used, e.g. when we use query with where F >= 'h' and F < 'i' (for search rows which starts from 'h'). In this case number of fetches is in range 4...6 (as expected).

I've checked on fresh 4.x , 5.x and 6.x
5.x and 6.x has same behaviour.
4.x has more weirds, e.g.

select /* trace_tag */ count(*) from (select 1 x from tmp_win1251_pxw_cyrl where F similar to 'v%' /* seq_no=40967 */ order by f desc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Select Expression
    -> Aggregate
        -> Filter
            -> Table "TMP_WIN1251_PXW_CYRL" Access By ID
                -> Index "UNQ_WIN1251_PXW_CYRL_DEC" Full Scan
1 records fetched
      0 ms, 306 fetch(es)

Table                              Natural     Index    Update    Insert    Delete   Backout     Purge   Expunge
****************************************************************************************************************
TMP_WIN1251_PXW_CYRL                             152

(win1251 / pxw_cyrl is used for russian lang which for sure has no digraphs)

Here is compressed scripts and results:
fb-charsets-idx-scans_-_check-preformance.zip

fb-charsets-idx-scans-prepare.sql -- this is STARTING script. Open it in editor and change name of used DB ("R:\temp\tmp4test.fdb"). It is supposed that ISC_USER and ISC_PASSWORD variables are defined;
fb-charsets-idx-scans-runme.py -- this must be launched after fb-charsets-idx-scans-prepare.sql completes. One need to open it in editor and adjust this: DB_DSN = r'localhost:R:\temp\tmp4test.fdb';
example of run:
c:\python3x\python.exe fb-charsets-idx-scans-runme.py C:\FB\FB60\fbclient.dll 1>fb-charsets-idx-scans-runme-py.6x.console.log 2>&1
fb-charsets-idx-scans-runme-py.NN.console.log (NN = 4x; 5x; 6x) -- output produced by fb-charsets-idx-scans-runme.py; to see most problematic cases, open any of this logs and search text: Saving data in TPERF table. Lines after this will show queries which have greatest number of fetches (294 ... 306).
fb-charsets-idx-scans-runme-py.NN.trace.log (NN = 4x; 5x; 6x) -- content of trace logs that were launched when script fb-charsets-idx-scans-runme.py was working
fb-charsets-idx-scans-perf.NN.fbk (NN = 4x; 5x; 6x) -- backups of databases which did store source data and results.
It is easy to get 'top problematic' queries using any of these databases.
Just connect to it in IBE or any other IDE and run:
select * from tperf p order by p.fetches_cnt desc, cset_coll;
You will see somewhat like this:

Click in some line on column SEARCH_SQL and copy its content to clipboard (e.g. select count(*) from ( select 1 x from tmp_win1252_pxw_span where F starting with 'l' /* seq_no=15137 */ )).
Then run it and see 'Performance tab':

The text was updated successfully, but these errors were encountered:

pavel-zotov · 2024-12-28T20:38:48Z

PS.
BTW, select count(*) from tmp_win1252_pxw_span will show: 147 :-)
Something forces engine to traverse through all nodes of index leaf level.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index navigation for columns that are encoded using "narrow" (single byte per character) charset: performance of some queries can be poor #8369

Index navigation for columns that are encoded using "narrow" (single byte per character) charset: performance of some queries can be poor #8369

pavel-zotov commented Dec 28, 2024 •

edited

Loading

pavel-zotov commented Dec 28, 2024

Index navigation for columns that are encoded using "narrow" (single byte per character) charset: performance of some queries can be poor #8369

Index navigation for columns that are encoded using "narrow" (single byte per character) charset: performance of some queries can be poor #8369

Comments

pavel-zotov commented Dec 28, 2024 • edited Loading

pavel-zotov commented Dec 28, 2024

pavel-zotov commented Dec 28, 2024 •

edited

Loading