You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Number of fetches during navigation on index that was built on text column can strongly vary, depending on character that is searched.
Below is description and results of test which checks only NARROW character sets (i.e. one byte per character).
Possibly problem has same nature as was described in #6915 or in sourceforge - but they both deal with UNICODE charset.
Following character sets are involved: iso8859_1; iso8859_2; win1250; win1251; win1252; win1253; win1254; win1257; koi8r; koi8u.
For each character set few collations are checked that are defined in $FB_HOME/intl/fbintl.conf.
Each of them must not match to the name of charset and must not contain '_UNICODE', e.g. for
-- only two collations will be checked: PXW_CYRL and WIN1251_UA.
An unique combination of {charset, collation} is the PK for table 'tcsets', this table is used as source for 'main loop' in the test.
For each character set full list of its characters was taken from www.fileformat.info/info/charset/<charset_name>/list.htm and they are stored in the table tchars (with PK = cset, symb), using ISQL with connection charset = UTF8.
Beside of this, list of ascii-only characters is stored also for each handled charset (lower case only is used) .
Next, test has used query select cset, coll from tcsets and generated 42 tables to store symbols, with DDL like this:
recreate table tmp_win1250_win_cz ( id int primary key, f varchar(1) character set win1250 collate win_cz );
create UNIQUE index unq_win1250_win_cz_asc on tmp_win1250_win_cz(f);
create UNIQUE DESCENDING index unq_win1250_win_cz_dec on tmp_win1250_win_cz(f);
After this, each such table has been fulfilled with characters that are suitable for its charset (table tchars was source for that).
Note.
Although some characters with accents / diaeresis etc look differ for us, they can be consideres as SAME from some collation POV.
In order to prevent 'Unique violation' exception, MERGE statement is used to store only first of such characters.
But this can not prevent from raising 335544321 / 335544565 ("Cannot transliterate character... "), so we have to suppress such gdscodes during this "TMP_*" tables are filled.
After this, preparation is over and we can measure performance of misc queries related to index scan.
It was decided to check only queries that have INDEX RANGE in explained plan, e.g..: Index "UNQ_ISO8859_1_DA_DA_ASC" Range Scan (lower bound: 1/1, upper bound: 1/1)
or Index "UNQ_ISO8859_1_DA_DA_ASC" Range Scan (full match)
select count(*) from (select 1 x from tmp_iso8859_1_da_da where F >= 'a' and F < 'ª' order by f asc)
select count(*) from (select 1 x from tmp_iso8859_1_da_da where F starting with 'a' order by f asc)
select count(*) from (select 1 x from tmp_iso8859_1_da_da where F like 'a%' order by f asc)
select count(*) from (select 1 x from tmp_iso8859_1_da_da where F similar to 'a%' order by f asc)
Derived table in each of these queries will return only one record.
So, every such query must (ideally) produce only one indexed reads, and this really occurs (if we see in tha trace) lot of times:
select /* trace_tag */ count(*) from (select 1 x from tmp_win1251_win1251_ua where F similar to 'ѕ%' /* seq_no=42703 */ order by f desc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Select Expression
-> Aggregate
-> Filter
-> Table "TMP_WIN1251_WIN1251_UA" Access By ID
-> Index "UNQ_WIN1251_WIN1251_UA_DEC" Range Scan (full match)
1 records fetched
0 ms, 4 fetch(es)
Table Natural Index Update Insert Delete Backout Purge Expunge
****************************************************************************************************************
TMP_WIN1251_WIN1251_UA 1
But! Unfortunately, this is not so for all queries. There some weird cases like these:
select /* trace_tag */ count(*) from (select 1 x from tmp_win1252_pxw_span where F starting with 'l' /* seq_no=30669 */ order by f desc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Select Expression
-> Aggregate
-> Filter
-> Table "TMP_WIN1252_PXW_SPAN" Access By ID
-> Index "UNQ_WIN1252_PXW_SPAN_DEC" Range Scan (full match)
1 records fetched
0 ms, 296 fetch(es)
Table Natural Index Update Insert Delete Backout Purge Expunge
****************************************************************************************************************
TMP_WIN1252_PXW_SPAN 147
(yes, l in ... starting with 'l' is ASCII character, "L" in lowercase; same result for 'starting with', 'like' and 'similar to'; for this charset such outcome can also be seen for letter 'c').
One more example:
select /* trace_tag */ count(*) from (select 1 x from tmp_win1250_pxw_hundc where F starting with 'z' /* seq_no=20477 */ order by f desc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Select Expression
-> Aggregate
-> Filter
-> Table "TMP_WIN1250_PXW_HUNDC" Access By ID
-> Index "UNQ_WIN1250_PXW_HUNDC_DEC" Range Scan (full match)
1 records fetched
0 ms, 294 fetch(es)
Table Natural Index Update Insert Delete Backout Purge Expunge
****************************************************************************************************************
TMP_WIN1250_PXW_HUNDC 146
(same for starting with 'z', like 'z%' and similar to 'z%'; same for ascii-only characters c, d, g, l, n, t, s ; same for both ascending and descending indices).
As explained here, Hungarian alphabet has several digraphs: cs, dz, dzs, gy, ly, ny, sz, ty, zs - so maybe this somehow matters.
But one may see such trouble not only for alphabets with digraphs.
For example, charset ISO8859_2 has collation ISO_PLK (Polish collation), and it also has such cases, e.g.:
select /* trace_tag */ count(*) from (select 1 x from tmp_iso8859_2_iso_plk where F starting with 'z' /* seq_no=17269 */ order by f desc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Select Expression
-> Aggregate
-> Filter
-> Table "TMP_ISO8859_2_ISO_PLK" Access By ID
-> Index "UNQ_ISO8859_2_ISO_PLK_DEC" Range Scan (full match)
1 records fetched
0 ms, 244 fetch(es)
Table Natural Index Update Insert Delete Backout Purge Expunge
****************************************************************************************************************
TMP_ISO8859_2_ISO_PLK 121
(same for STARTING WITH, LIKE, SIMILAR TO; same for characters c, d, l, n, s, t and z; same for both ascending and descending indices).
Result ordering (by number of fetches, desc) shows that indexed navigation works ideally for all cases only when GREATER_OR_EQUALS + LESS_THEN comparison is used, e.g. when we use query with where F >= 'h' and F < 'i' (for search rows which starts from 'h'). In this case number of fetches is in range 4...6 (as expected).
I've checked on fresh 4.x , 5.x and 6.x
5.x and 6.x has same behaviour.
4.x has more weirds, e.g.
select /* trace_tag */ count(*) from (select 1 x from tmp_win1251_pxw_cyrl where F similar to 'v%' /* seq_no=40967 */ order by f desc)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Select Expression
-> Aggregate
-> Filter
-> Table "TMP_WIN1251_PXW_CYRL" Access By ID
-> Index "UNQ_WIN1251_PXW_CYRL_DEC" Full Scan
1 records fetched
0 ms, 306 fetch(es)
Table Natural Index Update Insert Delete Backout Purge Expunge
****************************************************************************************************************
TMP_WIN1251_PXW_CYRL 152
(win1251 / pxw_cyrl is used for russian lang which for sure has no digraphs)
fb-charsets-idx-scans-prepare.sql -- this is STARTING script. Open it in editor and change name of used DB ("R:\temp\tmp4test.fdb"). It is supposed that ISC_USER and ISC_PASSWORD variables are defined;
fb-charsets-idx-scans-runme.py -- this must be launched after fb-charsets-idx-scans-prepare.sql completes. One need to open it in editor and adjust this: DB_DSN = r'localhost:R:\temp\tmp4test.fdb';
example of run: c:\python3x\python.exe fb-charsets-idx-scans-runme.py C:\FB\FB60\fbclient.dll 1>fb-charsets-idx-scans-runme-py.6x.console.log 2>&1
fb-charsets-idx-scans-runme-py.NN.console.log (NN = 4x; 5x; 6x) -- output produced by fb-charsets-idx-scans-runme.py; to see most problematic cases, open any of this logs and search text: Saving data in TPERF table. Lines after this will show queries which have greatest number of fetches (294 ... 306).
fb-charsets-idx-scans-runme-py.NN.trace.log (NN = 4x; 5x; 6x) -- content of trace logs that were launched when script fb-charsets-idx-scans-runme.py was working
fb-charsets-idx-scans-perf.NN.fbk (NN = 4x; 5x; 6x) -- backups of databases which did store source data and results.
It is easy to get 'top problematic' queries using any of these databases.
Just connect to it in IBE or any other IDE and run: select * from tperf p order by p.fetches_cnt desc, cset_coll;
You will see somewhat like this:
Click in some line on column SEARCH_SQL and copy its content to clipboard (e.g. select count(*) from ( select 1 x from tmp_win1252_pxw_span where F starting with 'l' /* seq_no=15137 */ )).
Then run it and see 'Performance tab':
The text was updated successfully, but these errors were encountered:
Number of fetches during navigation on index that was built on text column can strongly vary, depending on character that is searched.
Below is description and results of test which checks only NARROW character sets (i.e. one byte per character).
Possibly problem has same nature as was described in #6915 or in sourceforge - but they both deal with UNICODE charset.
Following character sets are involved:
iso8859_1; iso8859_2; win1250; win1251; win1252; win1253; win1254; win1257; koi8r; koi8u
.For each character set few collations are checked that are defined in $FB_HOME/intl/fbintl.conf.
Each of them must not match to the name of charset and must not contain '_UNICODE', e.g. for
-- only two collations will be checked: PXW_CYRL and WIN1251_UA.
An unique combination of {charset, collation} is the PK for table 'tcsets', this table is used as source for 'main loop' in the test.
For each character set full list of its characters was taken from
www.fileformat.info/info/charset/<charset_name>/list.htm
and they are stored in the tabletchars
(with PK =cset, symb
), using ISQL with connection charset = UTF8.Beside of this, list of ascii-only characters is stored also for each handled charset (lower case only is used) .
Next, test has used query
select cset, coll from tcsets
and generated 42 tables to store symbols, with DDL like this:After this, each such table has been fulfilled with characters that are suitable for its charset (table
tchars
was source for that).Note.
Although some characters with accents / diaeresis etc look differ for us, they can be consideres as SAME from some collation POV.
In order to prevent 'Unique violation' exception, MERGE statement is used to store only first of such characters.
But this can not prevent from raising 335544321 / 335544565 ("Cannot transliterate character... "), so we have to suppress such gdscodes during this "TMP_*" tables are filled.
After this, preparation is over and we can measure performance of misc queries related to index scan.
It was decided to check only queries that have INDEX RANGE in explained plan, e.g..:
Index "UNQ_ISO8859_1_DA_DA_ASC" Range Scan (lower bound: 1/1, upper bound: 1/1)
or
Index "UNQ_ISO8859_1_DA_DA_ASC" Range Scan (full match)
Derived table in each of these queries will return only one record.
So, every such query must (ideally) produce only one indexed reads, and this really occurs (if we see in tha trace) lot of times:
But! Unfortunately, this is not so for all queries. There some weird cases like these:
(yes,
l
in... starting with 'l'
is ASCII character, "L" in lowercase; same result for 'starting with', 'like' and 'similar to'; for this charset such outcome can also be seen for letter 'c').One more example:
(same for
starting with 'z'
,like 'z%'
andsimilar to 'z%'
; same for ascii-only charactersc
,d
,g
,l
,n
,t
,s
; same for both ascending and descending indices).As explained here, Hungarian alphabet has several digraphs:
cs, dz, dzs, gy, ly, ny, sz, ty, zs
- so maybe this somehow matters.But one may see such trouble not only for alphabets with digraphs.
For example, charset ISO8859_2 has collation ISO_PLK (Polish collation), and it also has such cases, e.g.:
(same for
STARTING WITH
,LIKE
,SIMILAR TO
; same for charactersc
,d
,l
,n
,s
,t
andz
; same for both ascending and descending indices).Result ordering (by number of fetches, desc) shows that indexed navigation works ideally for all cases only when GREATER_OR_EQUALS + LESS_THEN comparison is used, e.g. when we use query with
where F >= 'h' and F < 'i'
(for search rows which starts from 'h'). In this case number of fetches is in range 4...6 (as expected).I've checked on fresh 4.x , 5.x and 6.x
5.x and 6.x has same behaviour.
4.x has more weirds, e.g.
(win1251 / pxw_cyrl is used for russian lang which for sure has no digraphs)
Here is compressed scripts and results:
fb-charsets-idx-scans_-_check-preformance.zip
fb-charsets-idx-scans-prepare.sql
-- this is STARTING script. Open it in editor and change name of used DB ("R:\temp\tmp4test.fdb"). It is supposed that ISC_USER and ISC_PASSWORD variables are defined;fb-charsets-idx-scans-runme.py
-- this must be launched afterfb-charsets-idx-scans-prepare.sql
completes. One need to open it in editor and adjust this:DB_DSN = r'localhost:R:\temp\tmp4test.fdb'
;example of run:
c:\python3x\python.exe fb-charsets-idx-scans-runme.py C:\FB\FB60\fbclient.dll 1>fb-charsets-idx-scans-runme-py.6x.console.log 2>&1
fb-charsets-idx-scans-runme-py.NN.console.log
(NN = 4x; 5x; 6x) -- output produced byfb-charsets-idx-scans-runme.py
; to see most problematic cases, open any of this logs and search text:Saving data in TPERF table
. Lines after this will show queries which have greatest number of fetches (294 ... 306).fb-charsets-idx-scans-runme-py.NN.trace.log
(NN = 4x; 5x; 6x) -- content of trace logs that were launched when scriptfb-charsets-idx-scans-runme.py
was workingfb-charsets-idx-scans-perf.NN.fbk
(NN = 4x; 5x; 6x) -- backups of databases which did store source data and results.It is easy to get 'top problematic' queries using any of these databases.
Just connect to it in IBE or any other IDE and run:
select * from tperf p order by p.fetches_cnt desc, cset_coll;
You will see somewhat like this:
Click in some line on column
SEARCH_SQL
and copy its content to clipboard (e.g.select count(*) from ( select 1 x from tmp_win1252_pxw_span where F starting with 'l' /* seq_no=15137 */ )
).Then run it and see 'Performance tab':
The text was updated successfully, but these errors were encountered: