MYSQL : JOIN vs IN clause performance

I work on an analytics platform and recently I wanted to create test database from our live database which is around 600GB. We are using mysql version 5.6. Actions table contains billions of records. I wanted to take slice of that table. I wrote a query using IN clause. For some reason IN queries look very simple to comprehend compared to a similar query written using JOINs. For the first time I’ve understood, IN clause may come with a cost.

CREATE TABLE test SELECT * FROM actions_live_table where id IN (SELECT id from <reference_table created from union>);

The reference table contained only 38328 records with list of action ids that I was interested in. For example I wanted my test table to only contain id 3,4,5..so on 38328 records.

It took forever to run so I ran explain on it. Turns out, mysql is scanning 17977082 rows.

*************************** 1. row ***************************id: 1select_type: PRIMARYtable: actions_live_tabletype: ALLpossible_keys: NULLkey: NULLkey_len: NULLref: NULLrows: 17977082Extra: Using where*************************** 2. row ***************************id: 2select_type: DEPENDENT SUBQUERYtable: <table created from union>type: ALLpossible_keys: NULLkey: NULLkey_len: NULLref: NULLrows: 38328Extra: Using where

Then I rewrote the query to use a LEFT JOIN as follows.

CREATE TABLE test SELECT * FROM actions_live_table LEFT JOIN 
(SELECT id from <table created from union>) as ref_table on ref_table.id=actions_live_table.id where ref_table.id is not null

The number of scan reduces to 38328. The query ran within seconds.

*************************** 1. row ***************************id: 1select_type: PRIMARYtable: visits_tabletype: ALLpossible_keys: NULLkey: NULLkey_len: NULLref: NULLrows: 38328Extra: Using where*************************** 2. row ***************************id: 1select_type: PRIMARYtable: actions_live_tabletype: eq_refpossible_keys: PRIMARYkey: PRIMARYkey_len: 4ref: action_table.idrows: 1Extra: NULL

I rewrote the query because I read about how IN clause can be very slow in certain situations. This happened to be one of those cases. I’m not sure why mysql does this. In case of LEFT JOIN it goes and just picks out the rows we are looking for. In case of IN clause, it matches a billion rows again IN clause. This whole finding has forced me to consider every select query I’ve written while building the application. I better rewrite them if they are slow.