Optimizing SQL with LIKE expression – 3

So far I’ve written two number of blog notes about how to optimize sql with LIKE expression (here and here). A couple of days ago I built another funny case including two different columns in the WHERE clause of SQL statement, the former is a like expression and the latter is a constant expression. Hence this post will focus on talking about how to optimize this kind of SQL query.

Here is a demo (same as my previous two posts) creating a test table and quickly inserting some random data with 3e5 lines.

CREATE TABLE person
( id   NUMBER GENERATED ALWAYS AS IDENTITY
, name VARCHAR2(6) NOT NULL
, sex  VARCHAR2(1) NOT NULL
, flag VARCHAR2(1)
, pwd  VARCHAR2(6)
, CONSTRAINT person_pk PRIMARY KEY(id)
);

EXEC DBMS_RANDOM.seed(0);

INSERT /*+APPEND*/ INTO person (name, sex, flag, pwd)
SELECT DBMS_RANDOM.string('A', 6) AS name
     , CASE MOD(ROWNUM, 2) WHEN 0 THEN 'F'
                           WHEN 1 THEN 'M'
       END sex
     , CASE MOD(ROWNUM, 3) WHEN 0 THEN 'T'
                           WHEN 1 THEN NULL
                           WHEN 2 THEN 'F'
       END flag
     , DBMS_RANDOM.string ('p', 6) AS pwd
FROM dual
CONNECT BY level <= 3e5;

COMMIT;

EXEC DBMS_STATS.gather_table_stats(ownname => NULL, tabname => 'PERSON');

As known as wondering a running SQL query is whether more efficient or not, it’s best to check its row-source execution plan that acquired from memory. The simple SQL query is as below.

SELECT *
  FROM person
 WHERE name LIKE 'qw%'
   AND sex = 'F'
;

Now I will use the following three type of cases to optimize the preceding SQL query.

  • creating two seprarate normal indexes for two columns
  • creating a normal index and a bitmap index for two columns
  • creating a composite index for two columns

case 1 : two separate normal indexes

Firstly let us create these two normal indexes respectively on column “name” and “sex” of table “person”.

CREATE INDEX idx_person_name ON person (name);
CREATE INDEX idx_person_sex  ON person (sex);

SET LINESIZE 300
SET PAGESIZE 150
COLUMN index_name FORMAT a25
COLUMN index_type FORMAT a22

SELECT index_name
     , index_type
     , visibility
     , clustering_factor
     , num_rows
FROM   user_indexes
WHERE  table_name = 'PERSON'
;

INDEX_NAME                INDEX_TYPE             VISIBILIT CLUSTERING_FACTOR   NUM_ROWS
------------------------- ---------------------- --------- ----------------- ----------
PERSON_PK                 NORMAL                 VISIBLE                1148     300000
IDX_PERSON_NAME           NORMAL                 VISIBLE              299764     300000
IDX_PERSON_SEX            NORMAL                 VISIBLE                2296     300000

Secondly I will run the SQL mentioned previously twice (so that cbo is able to get an awesome statistics feedback) and pulled their execution plan from memory as well as observe their slight difference.

SET SERVEROUTPUT OFF
ALTER SESSION SET statistics_level = all;

SELECT *
  FROM person
 WHERE name LIKE 'qw%'
   AND sex = 'F'
;

SELECT *
  FROM table(DBMS_XPLAN.display_cursor(null, null, 'cost allstats last'))
;

-- ------------------------------------------------------------------------------------------------------------------------------------
-- SQL_ID  90t4k1zb3fjx2, child number 0
-- -------------------------------------
-- SELECT *   FROM person  WHERE name LIKE 'qw%'    AND sex = 'F'
-- 
-- Plan hash value: 1105672549
-- 
-- ------------------------------------------------------------------------------------------------------------------------------
-- | Id  | Operation                           | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-- ------------------------------------------------------------------------------------------------------------------------------
-- |   0 | SELECT STATEMENT                    |                 |      1 |        |     9 (100)|     53 |00:00:00.01 |     115 |
-- |*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |      1 |      3 |     9   (0)|     53 |00:00:00.01 |     115 |
-- |*  2 |   INDEX RANGE SCAN                  | IDX_PERSON_NAME |      1 |      5 |     3   (0)|    108 |00:00:00.01 |       7 |
-- ------------------------------------------------------------------------------------------------------------------------------
-- 
-- Predicate Information (identified by operation id):
-- ---------------------------------------------------
-- 
--    1 - filter("SEX"='F')
--    2 - access("NAME" LIKE 'qw%')
--        filter("NAME" LIKE 'qw%')
-- 
-- -------------------------------------
-- SQL_ID  90t4k1zb3fjx2, child number 1
-- -------------------------------------
-- SELECT *   FROM person  WHERE name LIKE 'qw%'    AND sex = 'F'
-- 
-- Plan hash value: 1105672549
-- 
-- ------------------------------------------------------------------------------------------------------------------------------
-- | Id  | Operation                           | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-- ------------------------------------------------------------------------------------------------------------------------------
-- |   0 | SELECT STATEMENT                    |                 |      1 |        |   111 (100)|     53 |00:00:00.01 |     115 |
-- |*  1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |      1 |     53 |   111   (0)|     53 |00:00:00.01 |     115 |
-- |*  2 |   INDEX RANGE SCAN                  | IDX_PERSON_NAME |      1 |    108 |     3   (0)|    108 |00:00:00.01 |       7 |
-- ------------------------------------------------------------------------------------------------------------------------------
-- 
-- Predicate Information (identified by operation id):
-- ---------------------------------------------------
-- 
--    1 - filter("SEX"='F')
--    2 - access("NAME" LIKE 'qw%')
--        filter("NAME" LIKE 'qw%')
-- 
-- Note
-- -----
--    - statistics feedback used for this statement

As you can see from the previous two number of execution plan (child number 0 and 1) cbo both uses INDEX RANGE SCAN (on the highlighting line 26 and 48) using the index “idx_person_name” with single column “name” as well as firstly filter and secondly access “qw%” from that index and then retrieve the required data of “sex” is equal to “F” from real table (on the highlighting line “32-34” and “54-56”). Unfortunately the former execution plan has a big difference on E-Rows (“5”) and A-Rows (“108”), but the latter one they are both “108” (because optimizer does a statistics feedback which you can see on the highlighting line “60”). Hence the cost value “9” (the highlighting line “24”) is incorrect on the former execution plan and “111” (the highlighting line “46”) is relatively accurate on the latter one.

Although oracle optimizer is enough clever to only use the index with single column “name”. You might notice I have also created an index with single column “sex”. Why cbo doesn’t use this index? Perhaps its cost is more expensive. Let us use the similar hint “index_combine(alias index1 index2)” to force cbo to concurrently choose those two indexes. The following is the SQL query adding the hint “index_combine” and its execution plan.

SELECT /*+ index_combine(person idx_person_name idx_person_sex) */ *
  FROM person
 WHERE name LIKE 'qw%'
   AND sex = 'F'
;

SELECT *
  FROM table(DBMS_XPLAN.display_cursor(null, null, 'cost allstats last'))
;

-------------------------------------
SQL_ID  c3y0jbbq9928r, child number 0
-------------------------------------
SELECT /*+ index_combine(person idx_person_name idx_person_sex) */ *
FROM person  WHERE name LIKE 'qw%'    AND sex = 'F'

Plan hash value: 3157233861

---------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
---------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                 |      1 |        |   283 (100)|     53 |00:00:00.05 |     329 |       |       |          |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |      1 |      3 |   283   (2)|     53 |00:00:00.05 |     329 |       |       |          |
|   2 |   BITMAP CONVERSION TO ROWIDS       |                 |      1 |        |            |     53 |00:00:00.05 |     277 |       |       |          |
|   3 |    BITMAP AND                       |                 |      1 |        |            |      1 |00:00:00.05 |     277 |       |       |          |
|   4 |     BITMAP CONVERSION FROM ROWIDS   |                 |      1 |        |            |      1 |00:00:00.01 |       3 |       |       |          |
|   5 |      SORT ORDER BY                  |                 |      1 |        |            |    108 |00:00:00.01 |       3 | 13312 | 13312 |12288  (0)|
|*  6 |       INDEX RANGE SCAN              | IDX_PERSON_NAME |      1 |        |     3   (0)|    108 |00:00:00.01 |       3 |       |       |          |
|   7 |     BITMAP CONVERSION FROM ROWIDS   |                 |      1 |        |            |      3 |00:00:00.05 |     274 |       |       |          |
|*  8 |      INDEX RANGE SCAN               | IDX_PERSON_SEX  |      1 |        |   276   (1)|    150K|00:00:00.03 |     274 |       |       |          |
---------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   6 - access("NAME" LIKE 'qw%')
       filter(("NAME" LIKE 'qw%' AND "NAME" LIKE 'qw%'))
   8 - access("SEX"='F')

At this moment optimizer eventually uses these two indexes “idx_person_name” and “idx_person_sex” but total steps on “id” is much too complex (such as, sort order by, bitmap conversion from rowids, bitmap and, bitmap conversion to rowids) especially the cost value is “283” that is obviously bigger than the previous “111”. In other words cbo was actually doing too much work for too little benefit. By the way there appears the filter predicate (“NAME” LIKE ‘qw%’) twice on id “6” (you can see the highlighting line “37”). This is a real oddity.

case 2 : a normal index and a bitmap index

The same step as case 1, this time I’ll create a bitmap index.

-- CREATE INDEX idx_person_sex ON person (sex);
DROP INDEX idx_person_sex;

CREATE BITMAP INDEX idx_person_sex ON person (sex);

COLUMN index_name FORMAT a25
COLUMN index_type FORMAT a22

SELECT index_name
     , index_type
     , visibility
     , clustering_factor
     , num_rows
FROM   user_indexes
WHERE  table_name = 'PERSON'
;

INDEX_NAME                INDEX_TYPE             VISIBILIT CLUSTERING_FACTOR   NUM_ROWS
------------------------- ---------------------- --------- ----------------- ----------
PERSON_PK                 NORMAL                 VISIBLE                1148     300000
IDX_PERSON_NAME           NORMAL                 VISIBLE              299764     300000
IDX_PERSON_SEX            BITMAP                 VISIBLE                  26         26

As you can see from the prior index information the clustering_factor of bitmap index is smallest amongst those three number of indexes and the distinct value of column “sex” has only 2 (“F” and “M”) so it’s suitable to create a bitmap index.

Here I omit the execution plan of the SQL statement no adding hint “index_combine” because it is the same as the case 1 (still using the index “idx_person_name” and cost value is “111”). Nevertheless its real execution plan looks like very well when adding the hint “index_combine” into that SQL query.

-------------------------------------
SQL_ID  c3y0jbbq9928r, child number 1
-------------------------------------
SELECT /*+ index_combine(person idx_person_name idx_person_sex) */ *
FROM person  WHERE name LIKE 'qw%'    AND sex = 'F

Plan hash value: 3364957711

------------------------------------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |  OMem |  1Mem | Used-Mem |
------------------------------------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                 |      1 |        |    12 (100)|     53 |00:00:00.04 |      63 |     12 |       |       |          |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |      1 |      3 |    12   (9)|     53 |00:00:00.04 |      63 |     12 |       |       |          |
|   2 |   BITMAP CONVERSION TO ROWIDS       |                 |      1 |        |            |     53 |00:00:00.04 |      11 |     12 |       |       |          |
|   3 |    BITMAP AND                       |                 |      1 |        |            |      1 |00:00:00.04 |      11 |     12 |       |       |          |
|   4 |     BITMAP CONVERSION FROM ROWIDS   |                 |      1 |        |            |      1 |00:00:00.01 |       3 |      0 |       |       |          |
|   5 |      SORT ORDER BY                  |                 |      1 |        |            |    108 |00:00:00.01 |       3 |      0 | 13312 | 13312 |12288  (0)|
|*  6 |       INDEX RANGE SCAN              | IDX_PERSON_NAME |      1 |        |     3   (0)|    108 |00:00:00.01 |       3 |      0 |       |       |          |
|*  7 |     BITMAP INDEX SINGLE VALUE       | IDX_PERSON_SEX  |      1 |        |            |     13 |00:00:00.04 |       8 |     12 |       |       |          |
------------------------------------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   6 - access("NAME" LIKE 'qw%')
       filter(("NAME" LIKE 'qw%' AND "NAME" LIKE 'qw%'))
   7 - access("SEX"='F')

As you can see above execution plan cbo certainly chooses “BITMAP INDEX SINGLE VALUE” on id “7” (on the highlighting line “19”) and the value of A-Rows is “13” (but it is “150K” in case 1). As you can also see the cost value is “12” that is apparently smaller than “283” in case 1.

Since the execution plan using the previous hint is more better (cost value “12” is smaller than “111”) than the one with no hint why cbo still prefers to use the index with single column “name”? Of course, they are the different SQL statement you might notice that their SQL_ID is really different – the former is c3y0jbbq9928r (with hint) and the latter is 90t4k1zb3fjx2, so it doesn’t make sense to compare their execution plan. We just said that if adding this hint will make the original SQL statement choose more better execution plan in this case but the situation of case 1 is more worse.

case 3 : a composite index

I will create a composite index “idx_person_ns” with two columns “name” and “sex” on the test table “person” in this case. Please take a look at the following processes creating and checking index and the corresponding execution plan of SQL query.

CREATE INDEX idx_person_ns ON person (name, sex);

COLUMN index_name FORMAT a25
COLUMN index_type FORMAT a22

SELECT index_name
     , index_type
     , visibility
     , clustering_factor
     , num_rows
FROM   user_indexes
WHERE  table_name = 'PERSON'
;

INDEX_NAME                INDEX_TYPE             VISIBILIT CLUSTERING_FACTOR   NUM_ROWS
------------------------- ---------------------- --------- ----------------- ----------
PERSON_PK                 NORMAL                 VISIBLE                1148     300000
IDX_PERSON_NAME           NORMAL                 VISIBLE              299764     300000
IDX_PERSON_SEX            BITMAP                 VISIBLE                  26         26
IDX_PERSON_NS             NORMAL                 VISIBLE              299764     300000

SELECT *
  FROM person
 WHERE name LIKE 'qw%'
   AND sex = 'F'
;

SELECT *
  FROM table(DBMS_XPLAN.display_cursor(null, null, 'cost allstats last'))
;

PLAN_TABLE_OUTPUT
-------------------------------------
SQL_ID  90t4k1zb3fjx2, child number 2
-------------------------------------
SELECT *   FROM person  WHERE name LIKE 'qw%'    AND sex = 'F'

Plan hash value: 2016045479

-------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name          | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
-------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |               |      1 |        |     6 (100)|     53 |00:00:00.01 |      61 |      3 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON        |      1 |      3 |     6   (0)|     53 |00:00:00.01 |      61 |      3 |
|*  2 |   INDEX RANGE SCAN                  | IDX_PERSON_NS |      1 |      3 |     3   (0)|     53 |00:00:00.01 |       8 |      3 |
-------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("NAME" LIKE 'qw%' AND "SEX"='F')
       filter(("SEX"='F' AND "NAME" LIKE 'qw%'))

Very nice! As you can see from the previous execution plan optimizer uses the composite index “idx_person_ns” created previously. But the value of E-Rows (“3”) and A-Rows (“53”) on id “2” (the highlighting line “45”) has a big difference. As you can also see above the predicate information about the filter section (on the highlighting line “52”) firstly filter “sex” and next filter “name” but it’s opposite in the access section of predicate information.

Since the composite index contains two number of columns we should create an extended statistics of column group (‘name’, ‘sex’) to help optimizer better estimate the value of E-Rows on id “2” on the execution plan. The following is my operation steps.

COLUMN column_name FORMAT a20

SELECT column_name
     , histogram
FROM   user_tab_col_statistics
WHERE  table_name = 'PERSON'
AND    column_name IN ('NAME', 'SEX')
;

COLUMN_NAME          HISTOGRAM
-------------------- ---------------
NAME                 HYBRID
SEX                  FREQUENCY

SELECT DBMS_STATS.create_extended_stats(ownname => NULL, tabname => 'PERSON', extension => '(name, sex)')
FROM dual
;

DBMS_STATS.CREATE_EXTENDED_STATS(OWNNAME=>NULL,TABNAME=>'PERSON',EXTENSION=>'(NAME,SEX)')
-----------------------------------------------------------------------------------------
SYS_STUO7RHV9FY611QRHL3IZ_EI$T

COLUMN column_name FORMAT a35

SELECT column_name
     , histogram
FROM   user_tab_col_statistics
WHERE  table_name = 'PERSON'
;

COLUMN_NAME                         HISTOGRAM
----------------------------------- ---------------
ID                                  NONE
NAME                                HYBRID
SEX                                 FREQUENCY
FLAG                                NONE
PWD                                 NONE
SYS_STUO7RHV9FY611QRHL3IZ_EI$T      NONE

BEGIN
  DBMS_STATS.gather_table_stats (
       ownname => NULL,
       tabname => 'PERSON',
       method_opt => 'FOR ALL COLUMNS SIZE AUTO FOR COLUMNS SIZE 254 SYS_STUO7RHV9FY611QRHL3IZ_EI$T'
  );
END;
/

COLUMN column_name FORMAT a35

SELECT column_name
     , histogram
FROM   user_tab_col_statistics
WHERE  table_name = 'PERSON'
;

COLUMN_NAME                         HISTOGRAM
----------------------------------- ---------------
ID                                  NONE
NAME                                HYBRID
SEX                                 FREQUENCY
FLAG                                NONE
PWD                                 NONE
SYS_STUO7RHV9FY611QRHL3IZ_EI$T      HYBRID

SELECT *
  FROM person
 WHERE name LIKE 'qw%'
   AND sex = 'F'
;

SELECT *
  FROM table(DBMS_XPLAN.display_cursor(null, null, 'cost allstats last'))
;

-------------------------------------
SQL_ID  90t4k1zb3fjx2, child number 3
-------------------------------------
SELECT *   FROM person  WHERE name LIKE 'qw%'    AND sex = 'F'

Plan hash value: 2016045479

----------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name          | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |               |      1 |        |     6 (100)|     53 |00:00:00.01 |      61 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON        |      1 |      3 |     6   (0)|     53 |00:00:00.01 |      61 |
|*  2 |   INDEX RANGE SCAN                  | IDX_PERSON_NS |      1 |      3 |     3   (0)|     53 |00:00:00.01 |       8 |
----------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("NAME" LIKE 'qw%' AND "SEX"='F')
       filter(("SEX"='F' AND "NAME" LIKE 'qw%'))

So weird. As you can see above steps I have created extended statistics unfortunately cbo still inexactly estimate the value of E-Rows of real execution plan.

Cheers, hope this 3 number of funny cases can benefit you.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.