关于mysql索引的一些探索

复习索引的时候看了小林coding的博客,造了一些数据数据,方法如下:

CREATE TABLE person(
    id int(10) NOT NULL AUTO_INCREMENT PRIMARY KEY comment '主键',
    person_id tinyint not null comment '用户id',
    person_name VARCHAR(200) comment '用户名称',
    gmt_create datetime comment '创建时间',
    gmt_modified datetime comment '修改时间'
) comment '人员信息表';

insert into person values(1, 1,'user_1', NOW(), now());

select (@i:=@i+1) as rownum, person_name from person, (select @i:=100) as init; 
set @i=1;

insert into person(id, person_id, person_name, gmt_create, gmt_modified)
select @i:=@i+1,
left(rand()*10,10) as person_id,
concat('user_',@i%2048),
date_add(gmt_create,interval + @i*cast(rand()*100 as signed) SECOND),
date_add(date_add(gmt_modified,interval +@i*cast(rand()*100 as signed) SECOND), interval + cast(rand()*1000000 as signed) SECOND)
from person;

# 可以成倍的增加数据,用这种方法制造了200w行的记录
mysql> select count(*) from person ;
+----------+
| count(*) |
+----------+
|  2097152 |
+----------+

问题一:explain是如何判断rows:

mysql> explain select count(*) from person ;
+----+-------------+--------+------------+-------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table  | partitions | type  | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+--------+------------+-------+---------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | person | NULL       | index | NULL          | name | 803     | NULL | 3581169 |   100.00 | Using index |
+----+-------------+--------+------------+-------+---------------+------+---------+------+---------+----------+-------------+

既然只有209w行记录,那为什么会预估需要扫358w行呢

这个问题很快得到了解答:使用 ANALYZE TABLE 让优化器重新分析一下表就可以了。重新分析过后很快得到了正确答案。

问题二:关于索引失效的问题

mysql> show keys from person\G;
*************************** 1. row ***************************
        Table: person
   Non_unique: 0
     Key_name: PRIMARY
 Seq_in_index: 1
  Column_name: id
    Collation: A
  Cardinality: 3258015
     Sub_part: NULL
       Packed: NULL
         Null: 
   Index_type: BTREE
      Comment: 
Index_comment: 
      Visible: YES
   Expression: NULL
*************************** 2. row ***************************
        Table: person
   Non_unique: 1
     Key_name: name
 Seq_in_index: 1
  Column_name: person_name
    Collation: A
  Cardinality: 2049
     Sub_part: NULL
       Packed: NULL
         Null: YES
   Index_type: BTREE
      Comment: 
Index_comment: 
      Visible: YES
   Expression: NULL

在name列上建了一个索引。

  • 建索引前:
mysql> select * from person where person_name like 'user_1915%';
1024 rows in set (1.66 sec)

mysql> select * from person where person_name like '%user_1915%';
1024 rows in set (1.76 sec)

mysql> select * from person where person_name = 'user_1915';
1024 rows in set (1.55 sec)


#可以看得出,四种查询耗时几乎一样,也符合理解:此时是全表扫描,大部分耗时用在了扫描身上,所以也能看出来尽量减少扫描是很重要的
mysql> explain select * from person where person_name = 'user_1915';
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table  | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | person | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 3581169 |    10.00 | Using where |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
  • 建索引后:
#在person_name字段建立索引
mysql> create index name on person (person_name);

#重复上面的实验
mysql> select * from person where person_name like 'user_1915%';
1024 rows in set (1.61 sec)
mysql> explain select * from person where person_name like 'user_1915%';
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table  | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | person | NULL       | ALL  | name          | NULL | NULL    | NULL | 3581169 |    50.00 | Using where |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+

mysql> select * from person where person_name like '%user_1915%';
1024 rows in set (1.89 sec)
mysql> explain select * from person where person_name like '%user_1915%';
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
| id | select_type | table  | partitions | type | possible_keys | key  | key_len | ref  | rows    | filtered | Extra       |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+
|  1 | SIMPLE      | person | NULL       | ALL  | NULL          | NULL | NULL    | NULL | 3581169 |    11.11 | Using where |
+----+-------------+--------+------------+------+---------------+------+---------+------+---------+----------+-------------+

mysql> select * from person where person_name = 'user_1915';
1024 rows in set (0.01 sec)
mysql> explain select * from person where person_name = 'user_1915';
+----+-------------+--------+------------+------+---------------+------+---------+-------+------+----------+-------+
| id | select_type | table  | partitions | type | possible_keys | key  | key_len | ref   | rows | filtered | Extra |
+----+-------------+--------+------------+------+---------------+------+---------+-------+------+----------+-------+
|  1 | SIMPLE      | person | NULL       | ref  | name          | name | 803     | const | 1024 |   100.00 | NULL  |
+----+-------------+--------+------------+------+---------------+------+---------+-------+------+----------+-------+

#可以看得出,只有第三句使用了索引,ref指使用非唯一索引,这才是正常的。可是问题在于为什么第一句不会走索引呢。理论来讲第一个不应该使用name索引吗?

番外:
mysql> select * from person where person_name like 'user_1915%';
  ~ docker exec -it mysql bash
Error response from daemon: container 467bcb6435a831ac52c9b0c72391e5a45971a81c29151a8e4628c15f9b240180 is not running
不知道为什么,测试的时候查着查着mysql就g了
更新:破案了,是内存莫名其妙占满了,reboot就行了,但具体问题还没找出来

在提出上面的问题后问了许多学长,更奇怪的是使用了force index后速度也很慢,所以这并不完全是优化器的锅,更深层的原因需要figuare out为什么遍历索引的时候会如此慢。

这个表主要有这样的特征:所有的name数据都是以user_开头,所以一个猜测是可能因为有共同的前缀导致以上问题,但其实并不能完全说通,干扰优化器可以理解,但force之后还这么慢就不能理解了(force之后仅仅快了300-400ms左右)

其实衍生出来的还有一些问题:

  • select * from person where person_name like 'user_1915';为什么不会直接被优化成=呢。

    随后换了一张表进行测试,600w数据,完全随机数据。

    得到的结果是这样的:

    mysql> explain select * from test where name like 'Zhu Lan';
    +----+-------------+-------+------------+-------+---------------+------+---------+------+------+----------+-----------------------+
    | id | select_type | table | partitions | type  | possible_keys | key  | key_len | ref  | rows | filtered | Extra                 |
    +----+-------------+-------+------------+-------+---------------+------+---------+------+------+----------+-----------------------+
    |  1 | SIMPLE      | test  | NULL       | range | name          | name | 258     | NULL | 1046 |   100.00 | Using index condition |
    +----+-------------+-------+------------+-------+---------------+------+---------+------+------+----------+-----------------------+
    1 row in set, 1 warning (0.01 sec)
    
    mysql> explain select * from test where name = 'Zhu Lan';
    +----+-------------+-------+------------+------+---------------+------+---------+-------+------+----------+-------+
    | id | select_type | table | partitions | type | possible_keys | key  | key_len | ref   | rows | filtered | Extra |
    +----+-------------+-------+------------+------+---------------+------+---------+-------+------+----------+-------+
    |  1 | SIMPLE      | test  | NULL       | ref  | name          | name | 258     | const | 1046 |   100.00 | NULL  |
    +----+-------------+-------+------------+------+---------------+------+---------+-------+------+----------+-------+
    1 row in set, 1 warning (0.00 sec)
    

    发现正常情况下也会不一样,like还是走的range索引,而且不会有索引下推。

在有上面的猜想之后,新建了一个600w完全随机数据的表,重复了上面的实验,完全符合预期。所以目前上面猜测可能性还是很大的,目前唯一的问题是强制使用索引后为什么依然速度很差。希望有一天能update解决这个问题,先写一下实验经过记录一下。


分割线,2024-07-15更新

在看一个关于mysql慢查询博客的时候突然想到了这个问题,突然发现之前没有考虑回表查询的损耗。因为查询的是*,而且区分度不明显,可能会产生大量回表查询,所以不如直接全表扫描。