Can ICP Be Helpful in the SQL Optimization Phase? | SQLFlash

Can ICP Be Helpful in the SQL Optimization Phase?

HIfox2012
8 min read
Can ICP Be Helpful in the SQL Optimization Phase?

Can ICP Be Helpful in the SQL Optimization Phase?

What is ICP (Index Condition Pushdown)?

ICP, the full name of which is Index Condition Pushdown, is also commonly known as index condition pushdown.

When looking up data using a secondary index, for the conditions in the WHERE clause that are part of the index but cannot use the index, MySQL will push this part of the condition down to the storage engine layer, filter it, and then do the table lookup. In this way, the number of table lookups is reduced.

For example, for such an index idx_test(birth_date,first_name,hire_date) and the query statement select * from employees where birth_date >= '1957-05-23' and birth_date <='1960-06-01' and hire_date>'1998-03-22';, the execution process is as follows:

  1. Look up data from the idx_test index based on the condition birth_date >= '1957-05-23' and birth_date <='1960-06-01'. Assume 100,000 rows of data are returned;

  2. The 100,000 rows of data found contain the hire_date field. MySQL will push the condition hire_date >'1998-03-22' down to the storage engine to further filter the data. Assume there are 1,000 rows left;

  3. Since all field values need to be queried, and the previous 1,000 rows of data only contain the three fields birth_date,first_name,hire_date, it is necessary to do a table lookup to find the values of all fields. The process of table lookup is to take out the primary key values of these 1,000 rows of data, and look them up one by one on the primary key index (you can also enable MRR and take a batch of primary key values for table lookup). The number of table lookups is 1,000. Without ICP, the number of table lookups would be 100,000.

Obviously, in the execution stage, ICP can reduce the number of table lookups. In the cost-based optimizer, it can reduce the execution cost. However, when the optimizer selects the optimal execution plan in the optimization stage, can it really consider that ICP can reduce the cost? Let’s answer this question through an experiment.

Experiment

First, prepare some data. Download the Employees Sample Database and import it into MySQL:

Still taking the above example, create a composite index:

1
alter table employees add index idx_test(birth_date,first_name,hire_date);

Execute the following SQL:

1
2
3
4
5
SELECT *
FROM employees
WHERE birth_date >= '1957-05-23'
    AND birth_date <= '1960-06-01'
    AND hire_date > '1998-03-22';

EXPLAIN:

1
2
3
4
5
6
mysql [localhost:5735] {msandbox} (employees) > explain select * from employees where birth_date >= '1957-05-23' and birth_date <='1960-06-01' and hire_date>'1998-03-22';
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
| id | select_type | table     | partitions | type | possible_keys | key  | key_len | ref  | rows   | filtered | Extra       |
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+
|  1 | SIMPLE      | employees | NULL       | ALL  | idx_test      | NULL | NULL    | NULL | 298980 |    15.74 | Using where |
+----+-------------+-----------+------------+------+---------------+------+---------+------+--------+----------+-------------+

It can be seen that the idx_test index is not used. However, if a hint is added to force the use of the idx_test index, we know that ICP can be used. The execution plan is as follows:

1
2
3
4
5
6
7
mysql [localhost:5735] {msandbox} (employees) > explain select * from employees force index(idx_test) where birth_date >= '1957-05-23' and birth_date <='1960-06-01' and hire_date>'1998-03-22';
+----+-------------+-----------+------------+-------+---------------+----------+---------+------+--------+----------+-----------------------+
| id | select_type | table     | partitions | type  | possible_keys | key      | key_len | ref  | rows   | filtered | Extra                 |
+----+-------------+-----------+------------+-------+---------------+----------+---------+------+--------+----------+-----------------------+
|  1 | SIMPLE      | employees | NULL       | range | idx_test      | idx_test | 3       | NULL | 141192 |    33.33 | Using index condition |
+----+-------------+-----------+------------+-------+---------------+----------+---------+------+--------+----------+-----------------------+
1 row in set, 1 warning (0.00 sec)

Let’s open the slow log to check the real execution efficiency: For full table scan, 300,024 rows need to be scanned, and the execution time is 0.15 seconds.

For using the idx_test index, 141,192 rows need to be scanned (Rows_examined: 1065 is a bug. This is obviously not the number of scanned rows. The number of scanned rows can be seen from the execution plan. In this example, the rows in the execution plan are the real number of scanned rows, not an estimate. This knowledge point does not affect the understanding of this article). Because there are no other conditions, we can also know from the number of returned result rows that the number of table lookups is 1065, and the execution time is only 0.037 seconds.

1
2
3
4
5
6
7
8
# Time: 2022-11-24T18:02:01.001734+08:00
# Query_time: 0.146939  Lock_time: 0.000850 Rows_sent: 1065  Rows_examined: 300024
SET timestamp=1669284095;
select * from employees where birth_date >= '1957-05-23' and birth_date <='1960-06-01' and hire_date>'1998-03-22';
# Time: 2022-11-24T18:01:09.001223+08:00
# Query_time: 0.037211  Lock_time: 0.001649 Rows_sent: 1065  Rows_examined: 1065
SET timestamp=1669284032;
select * from employees force index(idx_test) where birth_date >= '1957-05-23' and birth_date <='1960-06-01' and hire_date>'1998-03-22';

Obviously, using the idx_test index is more efficient than full table scan. Then why doesn’t the optimizer choose to use the idx_test index? A fail-safe statement is that the optimizer has its algorithm and does not make selections based on the speed of time as perceived by humans. This time, we’ll get to the bottom of it. What is the algorithm of the optimizer?

The answer is cost. When choosing the optimal execution plan, the optimizer will calculate the cost of all available execution plans and then select the one with the lowest cost. There is a clear calculation method for the cost, and the cost of the execution plan can also be displayed through explain format=json. Therefore, we use this to prove whether ICP can affect the cost of the execution plan. Regarding explain format=json, this article will not elaborate too much.

Cost Calculation

1. I/O Cost

Both the data and indexes of the table are stored on the disk. When we want to query the records in the table, we need to load the data or index into the memory first and then operate. The time consumed in this loading process from the disk to the memory is called the I/O cost.

2. CPU Cost

The time consumed in reading and detecting whether the records meet the corresponding search conditions, sorting the result set, etc. is called the CPU cost.

3. Cost Constants

For the InnoDB storage engine, the page is the basic unit of interaction between the disk and the memory. In MySQL 5.7, the default cost of reading a page is 1.0, and the default cost of reading and detecting whether a record meets the search condition is 0.2. These numbers such as 1.0 and 0.2 are called cost constants (they may be different in different versions and can be viewed through mysql.server_cost and mysql.engine_cost).

Without interference, the optimizer chooses full table scan, and the total cost is “query_cost”: “60725.00”, and the calculation formula is:

  • I/O Cost: 929 * 1 = 929 (929 is the number of pages of the primary key index, obtained through Data_length/pagesize in the table’s statistical information)

  • CPU Cost: 298980 * 0.2 = 59796 (298980 is the number of scanned rows. This is an estimate when performing full table scan, that is, Rows in the table’s statistical information)

Total Cost = I/O Cost + CPU Cost = 929 + 59796 = 60725

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
mysql [localhost:5735] {msandbox} (employees) > explain format=json select * from employees  where birth_date >= '1957-05-23' and birth_date <='1960-06-01' and hire_date>'1998-03-22'\G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "60725.00"
    },
    "table": {
      "table_name": "employees",
      "access_type": "ALL",
      "possible_keys": [
        "idx_test"
      ],
      "rows_examined_per_scan": 298980,
      "rows_produced_per_join": 47059,
      "filtered": "15.74",
      "cost_info": {
        "read_cost": "51313.14",
        "eval_cost": "9411.86",
        "prefix_cost": "60725.00",
        "data_read_per_join": "6M"
      },
      "used_columns": [
        "emp_no",
        "birth_date",
        "first_name",
        "last_name",
        "gender",
        "hire_date"
      ],
      "attached_condition": "((`employees`.`employees`.`birth_date` >= '1957-05-23') and (`employees`.`employees`.`birth_date` <= '1960-06-01') and (`employees`.`employees`.`hire_date` > '1998-03-22'))"
    }
  }
}
1 row in set, 1 warning (0.00 sec)

Summary

From the cost results of the previous step, the cost of full table scan is 60725, while the cost of using the idx_test index is 197669.81. Therefore, the optimizer chooses full table scan. In fact, ICP can reduce the number of table lookups. The actual number of table lookups when using the idx_test index is 1065, and the cost should be:

  • I/O Cost: 1065 * 1 = 1065
  • CPU Cost: 1065 * 0.2 = 213

However, when the optimizer calculates the cost of table lookups, it obviously does not consider ICP and directly takes the number of rows scanned in the index, 141192, as the number of table lookups. Therefore, the obtained cost of table lookups is huge, and the total cost is much greater than the cost of full table scan.

Therefore, the conclusion we can draw is: ICP can improve the execution efficiency at the execution stage, but it cannot improve the execution plan at the optimization stage.

The SQL Optimization Course

The SQL optimization knowledge

👋 See you in the next lesson.

What is SQLFlash?

SQLFlash is your AI-powered SQL Optimization Partner.

Based on AI models, we accurately identify SQL performance bottlenecks and optimize query performance, freeing you from the cumbersome SQL tuning process so you can fully focus on developing and implementing business logic.

How to use SQLFlash in a database?

Ready to elevate your SQL performance?

Join us and experience the power of SQLFlash today!.