Optimizing SQL with LIKE expression

This post is focused on talking about how to optimize several SQL statements with LIKE expression in oracle database. Now let me quote some important knowledge point about the pattern of LIKE condition on 20c.

The LIKE conditions specify a test involving pattern matching. Whereas the equality operator (=) exactly matches one character value to another, the LIKE conditions match a portion of one character value to another by searching the first value for the pattern specified by the second.

The pattern can contain special pattern-matching characters:

An underscore (_) in the pattern matches exactly one character (as opposed to one byte in a multibyte character set) in the value.

A percent sign (%) in the pattern can match zero or more characters (as opposed to bytes in a multibyte character set) in the value. The pattern ‘%’ cannot match a null.

In general a LIKE expression has four number of situation in a SQL query statement. Take, for example, some random combinations of between ‘%’ and character strings ‘qw’.

  • qw%
  • %qw
  • %qw%
  • q%w

For better simulating this case I quickly build a demo on oracle 19.3, it is as follows.

CREATE TABLE person
( id   NUMBER GENERATED ALWAYS AS IDENTITY
, name VARCHAR2(6) NOT NULL
, sex  VARCHAR2(1) NOT NULL
, flag VARCHAR2(1)
, pwd  VARCHAR2(6)
, CONSTRAINT person_pk PRIMARY KEY(id)
);

EXEC DBMS_RANDOM.seed(0); -- In order to ensure that the random data generated each time is completely consistent. Thanks Jonathan Lewis prompts me on Twitter.

INSERT /*+APPEND*/ INTO person (name, sex, flag, pwd)
SELECT DBMS_RANDOM.string('A', 6) AS name
     , CASE MOD(ROWNUM, 2) WHEN 0 THEN 'F'
                           WHEN 1 THEN 'M'
       END sex
     , CASE MOD(ROWNUM, 3) WHEN 0 THEN 'T'
                           WHEN 1 THEN NULL
                           WHEN 2 THEN 'F'
       END flag
     , DBMS_RANDOM.string ('p', 6) AS pwd
FROM dual
CONNECT BY level <= 3e5;

COMMIT;

EXEC DBMS_STATS.gather_table_stats(ownname => NULL, tabname => 'PERSON');

Now I will respectively write 4 number of SQL query statements with 4 different LIKE expression mentioned previously and run them and next check their real execution plan that pulled from memory as well as try my best to adjust/optimize those execution plan in particular when optimizer chooses poor access path.

Case 1 – “qw%”

I go to select all columns of table “person” but column “name” only starts with “qw” and next what follows up is any number of characters. Hence this SQL is like this.

SELECT *
  FROM person
 WHERE name LIKE 'qw%'
;

As you can see above SQL I need filter out “qw%” from column “name” of table “person”. If creating an index based on “name” perhaps optimizer will choose the access path that is related to INDEX rather than FULL TABLE SCAN. Let’s try it as below.

CREATE INDEX idx_person_name ON person (name);

In the meantime I run the prior SQL statement (here omitting the query result) as well as check its row-source execution plan.

SET LINESIZE 300
SET PAGESIZE 150
SET SERVEROUTPUT OFF
  
ALTER SESSION SET statistics_level = all;

SELECT *
  FROM person
 WHERE name LIKE 'qw%'
;

SELECT *
  FROM table(DBMS_XPLAN.display_cursor(null, null, 'cost allstats last'))
;

-- Plan hash value: 1105672549
-- 
-- ---------------------------------------------------------------------------------------------------------------------------------------
-- | Id  | Operation                           | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
-- ---------------------------------------------------------------------------------------------------------------------------------------
-- |   0 | SELECT STATEMENT                    |                 |      1 |        |    25 (100)|    108 |00:00:00.03 |     119 |      2 |
-- |   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |      1 |     21 |    25   (0)|    108 |00:00:00.03 |     119 |      2 |
-- |*  2 |   INDEX RANGE SCAN                  | IDX_PERSON_NAME |      1 |     21 |     3   (0)|    108 |00:00:00.03 |      11 |      2 |
-- ---------------------------------------------------------------------------------------------------------------------------------------
-- 
-- Predicate Information (identified by operation id):
-- ---------------------------------------------------
-- 
--    2 - access("NAME" LIKE 'qw%')
--        filter("NAME" LIKE 'qw%')

Very perfect. As you can see from above highlighting line “23” oracle optimizer exactly chooses INDEX RANGE SCAN – that’s to say, adopting the previous index “IDX_PERSON_NAME” I’ve pre-defined.

Case 2 – “%qw”

This case is similar to the former one just it is “%qw” on LIKE expression.

SELECT *
  FROM person
 WHERE name LIKE '%qw'
;

But the execution plan that acquired from memory won’t use INDEX scan even though I have pre-created the INDEX on column “name” because optimizer doesn’t know the first character (non-deterministic) in that LIKE expression where is located in the prior SQL statement. The following is this real execution plan.

Plan hash value: 1493655343

---------------------------------------------------------------------------------------------------
| Id  | Operation         | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |      1 |        |   327 (100)|    113 |00:00:00.05 |    1160 |
|*  1 |  TABLE ACCESS FULL| PERSON |      1 |  15000 |   327   (3)|    113 |00:00:00.05 |    1160 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("NAME" LIKE '%qw')

Let us do a comparison between “qw%” and “%qw” – firstly observing themselves’ difference and secondly their execution plan. Since the SQL with “qw%” is able to use INDEX SCAN, but why not “%qw”? At this very moment maybe you have a good idea about how to better optimize the previous SQL. Of course, there is a REVERSE index starting from oracle 8.0.4, if we create a reverse REVERSE() index on column “name” for this case with “%qw” optimizer will automactically convert the LIKE expression “%qw” to be “wq%” in the background. To make a long story short, please take a (closer) look at the following operation steps and the corresponding execution plan.

CREATE INDEX idx_person_name_reverse ON person (REVERSE(name));

SELECT *
  FROM person
 WHERE REVERSE(name) LIKE REVERSE('%qw')
;

SELECT *
  FROM table(DBMS_XPLAN.display_cursor(null, null, 'cost allstats last'))
;

-- Plan hash value: 3646499062
-- 
-- -----------------------------------------------------------------------------------------------------------------------------------------------
-- | Id  | Operation                           | Name                    | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
-- -----------------------------------------------------------------------------------------------------------------------------------------------
-- |   0 | SELECT STATEMENT                    |                         |      1 |        |   321 (100)|    113 |00:00:00.11 |     124 |      2 |
-- |   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON                  |      1 |  15000 |   321   (1)|    113 |00:00:00.11 |     124 |      2 |
-- |*  2 |   INDEX RANGE SCAN                  | IDX_PERSON_NAME_REVERSE |      1 |   2700 |     9   (0)|    113 |00:00:00.11 |      11 |      2 |
-- -----------------------------------------------------------------------------------------------------------------------------------------------
-- 
-- Predicate Information (identified by operation id):
-- ---------------------------------------------------
-- 
--    2 - access("PERSON"."SYS_NC00006$" LIKE 'wq%')
--        filter("PERSON"."SYS_NC00006$" LIKE 'wq%')

As you can see from my previous SQL statement (on the highlighting line “5”) I separately added the very important keyword REVERSE() function whose parameter is “name” and “%qw”. You can also see there appears INDEX RANGE SCAN on the highlighting line “19” but E-Rows and A-Rows‘ value has a big difference (the former is 2700 and the latter is 113). Don’t worry about it, I rerun that SQL query and check its execution plan again, which is as below.

Plan hash value: 3646499062

--------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name                    | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
--------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                         |      1 |        |   122 (100)|    113 |00:00:00.01 |     124 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON                  |      1 |    113 |   122   (0)|    113 |00:00:00.01 |     124 |
|*  2 |   INDEX RANGE SCAN                  | IDX_PERSON_NAME_REVERSE |      1 |    113 |     9   (0)|    113 |00:00:00.01 |      11 |
--------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("PERSON"."SYS_NC00006$" LIKE 'wq%')
       filter("PERSON"."SYS_NC00006$" LIKE 'wq%')

Note
-----
   - statistics feedback used for this statement

This time it’s ok, E-Rows and A-Rows are both “113” (very accurate, really there is total 113 rows to return). By the way optimizer has used a statistics feedback for this statement as you can see from the highlighting line “19” on the Note section.

Notice: If you still use the original SQL query (the 1st one in this case) then there is also still FULL TABLE SCAN in the execution plan because optimizer is not able to recognise the function-base index (fbi) “idx_person_name_reverse” with keyword REVERSE except you modify that LIKE expression to invoke the same REVERSE function on the WHERE clause.

Case 3 – “%qw%”

It is the similar method based on the preceding case. The following is the original SQL query.

SELECT *
  FROM person
 WHERE name LIKE '%qw%'
;

Certainly its execution plan that pulled from memory is as follows (absolute FTS).

Plan hash value: 1493655343

---------------------------------------------------------------------------------------------------
| Id  | Operation         | Name   | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |        |      1 |        |   327 (100)|    547 |00:00:00.06 |    1189 |
|*  1 |  TABLE ACCESS FULL| PERSON |      1 |  15000 |   327   (3)|    547 |00:00:00.06 |    1189 |
---------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("NAME" LIKE '%qw%')

In fact for this case with LIKE expression “%qw%” oracle specially uses an internal function INSTR to optimize it. Since we are not sure that the specific location in which “qw” is on column “name” we can firstly regard “name” as a source string and similarly regard “qw” as a target string, secondly calling INSTR function (whose parameters are source string and target) and will return a number larger than zero (in other words it is the exact location). Now we’re able to rewrite the prior SQL that is as below.

SELECT *
  FROM person
 WHERE INSTR(name, 'qw') > 0
;

Here we can assume that we have pre-created this funny index “idx_person_name_instr”. Otherwise the following execution plan won’t choose INDEX RANGE SCAN because of non-existing the related Function-Based Index.

CREATE INDEX idx_person_name_instr ON person (INSTR(name, 'qw'));

SELECT *
  FROM person
 WHERE INSTR(name, 'qw') > 0
; 

-- here omitting the query result of above SQL.

SELECT *
  FROM table(DBMS_XPLAN.display_cursor(null, null, 'cost allstats last'))
;

-- Plan hash value: 2185790021
-- 
-- ---------------------------------------------------------------------------------------------------------------------------------------------
-- | Id  | Operation                           | Name                  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
-- ---------------------------------------------------------------------------------------------------------------------------------------------
-- |   0 | SELECT STATEMENT                    |                       |      1 |        |    23 (100)|    547 |00:00:00.04 |     565 |      5 |
-- |   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON                |      1 |  15000 |    23   (0)|    547 |00:00:00.04 |     565 |      5 |
-- |*  2 |   INDEX RANGE SCAN                  | IDX_PERSON_NAME_INSTR |      1 |   2700 |     7   (0)|    547 |00:00:00.04 |      41 |      5 |
-- ---------------------------------------------------------------------------------------------------------------------------------------------
-- 
-- Predicate Information (identified by operation id):
-- ---------------------------------------------------
-- 
--    2 - access("PERSON"."SYS_NC00007$">0)

As you can see from the preceding execution plan E-Rows is “2700” (inaccurate) and A-Rows is “547” (exactly right) on id “2” (the highlighting line “21”). Let’s run that SQL statement again, which will be okay. In other words optimizer will automatically enable statistics feedback. The following is the new execution plan.

Plan hash value: 2185790021

------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name                  | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                       |      1 |        |    11 (100)|    547 |00:00:00.01 |     565 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON                |      1 |    547 |    11   (0)|    547 |00:00:00.01 |     565 |
|*  2 |   INDEX RANGE SCAN                  | IDX_PERSON_NAME_INSTR |      1 |    547 |     7   (0)|    547 |00:00:00.01 |      41 |
------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("PERSON"."SYS_NC00007$">0)

Note
-----
   - statistics feedback used for this statement

Very well. This time the value of E-Rows and A-Rows are both “547” (pretty accurate) just because of the feature “statistics feedback” as you can see from the Note section (the last line).

Case 4 – “q%w”

This case’s SQL statement is very simple like this.

SELECT *
  FROM person
 WHERE name LIKE 'q%w'
;

According to the characteristic of previous SQL and Case -1 / 2 we might be able to rewrite the following form.

SELECT *
  FROM person
 WHERE name LIKE 'q%'
   AND name LIKE '%w'
;

Of course I deliberately omit above those two SQL statements’ execution plan – optimizer will use FTS.

Since we finished rewriting SQL, then you might notice that we can give “%w” to add a REVERSE function – making optimizer internally converts “%w” to be “w%”. So final SQL query should be as follows.

SELECT *
  FROM person
 WHERE name LIKE 'q%'
   AND REVERSE(name) LIKE REVERSE('%w')
;

Then its real execution plan is like this as below.

Plan hash value: 1105672549

---------------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers | Reads  |
---------------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                 |      1 |        |   278 (100)|    114 |00:00:00.02 |     139 |     19 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |      1 |    262 |   278   (0)|    114 |00:00:00.02 |     139 |     19 |
|*  2 |   INDEX RANGE SCAN                  | IDX_PERSON_NAME |      1 |    262 |    16   (0)|    114 |00:00:00.02 |      25 |     19 |
---------------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("NAME" LIKE 'q%')
       filter(("NAME" LIKE 'q%' AND REVERSE("NAME") LIKE 'w%'))

As you can see above execution plan optimizer nicely uses the index “IDX_PERSON_NAME” with column “name” and you can also see from the highlighting line “15” optimizer has done a REVERSE operation to “%w” (now which shows “w%”), but E-Rows and A-Rows’ value on id “2” is a bit different. Actually we rerun this SQL, optimizer will automatically correct this problem.

Plan hash value: 1105672549

------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                           | Name            | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                    |                 |      1 |        |   130 (100)|    114 |00:00:00.01 |     139 |
|   1 |  TABLE ACCESS BY INDEX ROWID BATCHED| PERSON          |      1 |    114 |   130   (0)|    114 |00:00:00.01 |     139 |
|*  2 |   INDEX RANGE SCAN                  | IDX_PERSON_NAME |      1 |    114 |    16   (0)|    114 |00:00:00.01 |      25 |
------------------------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("NAME" LIKE 'q%')
       filter(("NAME" LIKE 'q%' AND REVERSE("NAME") LIKE 'w%'))

Note
-----
   - statistics feedback used for this statement

At this moment E-Rows and A-Rows are both “114” and optimizer also does a statistics feedback.

P.S: If adding a hint “/*+ index(person idx_person_name) */” on the 2nd SQL of this case, whose execution plan is also very well. I would like to leave it give interesting readers to test.

Summary

Overall, optimizing SQL with LIKE expression usually has 4 number of cases, we are able to adjust their execution plan by creating related FBI (reverse function) and rewriting LIKE expression.

As for all of those cases’ SQL demo you can read my this SQL script on Github.

Cheers, really hope this help you.

Update Mar, 30, 2020

A few days ago Jonathan Lewis prompts me a good idea on Twitter – before invoking dbms_random package to insert some random data into a test table it’s best to firstly call a seed function to ensure that the random data generated each time is completely consistent. Now I have added seed function into my demo like this: “EXEC DBMS_RANDOM.seed(0);” and subsequently also replaced with new execution plan shows up in my each case.

Notice: You might observe that I use a function-based index with REVERSE() to optimize that SQL in case 2. Unfortunately you always find the REVERSE key index rather than REVERSE() on oracle manuals. So reverse() should haven’t been used on the production system.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.