Tuesday, April 24, 2007

You Should Never Use IN In SQL To JOIN With Another Table

Do you use the following syntax?

SELECT *
FROM TABLE
WHERE CoLumn IN (SELECT Column FROM TABLE)

OR this?

SELECT *
FROM TABLE
WHERE CoLumn NOT IN (SELECT Column FROM TABLE)

Do NOT use that, it will cause problems sooner or later. Don't believe me? Let's take a look

First create these 2 tables and populate them with some sample data

CREATE TABLE TestTable1 (id1 int)
CREATE TABLE TestTable2 (id2 int)

INSERT TestTable1 VALUES(1)
INSERT TestTable1 VALUES(2)
INSERT TestTable1 VALUES(3)


INSERT TestTable2 VALUES(1)
INSERT TestTable2 VALUES(2)


Now let's run the IN query

SELECT *
FROM TestTable1
WHERE id1 IN (SELECT id2 FROM TestTable2)
-----
id1

1
2


No problems here right?


What if by mistake you wrote id1 instead of id2?

SELECT *
FROM TestTable1
WHERE id1 IN (SELECT id1 FROM TestTable2)
-----
id1
1
2
3


Oops all 3 rows are returned, if you just run this SELECT id1 FROM TestTable2 you will get this error
Server: Msg 207, Level 16, State 3, Line 1
Invalid column name 'id1'.

So what happens? SQL Server sees column id1 and says "yes I have that it is in the TestTable1 table, I can use that"
What can we do? Use EXISTS because you will get an error instead of a wrong resultset

SELECT *
FROM t1
WHERE EXISTS (SELECT * FROM TestTable2 t2 WHERE t2.id2 = t1.id1 )
-----
id1

1
2



A JOIN will do the same as EXISTS

SELECT t1.*
FROM TestTable1 t1
JOIN TestTable2 t2 ON t2.id2 = t1.id1
-----
id1

1
2




Now let's try NOT IN


SELECT *
FROM TestTable1
WHERE id1 NOT IN (SELECT id2 FROM TestTable2)
-----
id1
3


No problem right?

Add a NULL value to the TestTable2 table

INSERT TestTable2 VALUES(NULL)


Let's try running it again

SELECT *
FROM TestTable1
WHERE id1 NOT IN (SELECT id2 FROM TestTable2)

Where are my rows? Nowhere, since NULL is not equal to anything including another NULL SQL just returns nothing

What happens when you use NOT EXISTS?
SELECT *
FROM TestTable1 t1
WHERE NOT EXISTS (SELECT * FROM TestTable2 t2 WHERE t2.id2 = t1.id1 )
-----
id1
3


That works without a problem



What about a LEFT JOIN?

SELECT t1.*
FROM TestTable1 t1
LEFT JOIN TestTable2 t2 ON t2.id2 = t1.id1
WHERE t2.id2 IS NULL
-----
id1
3



That works without a problem also



So from now on use EXISTS, NOT EXISTS, JOIN and LEFT JOIN

DO NOT use IN or NOT IN ever again. Okay?

19 comments:

leszek said...

I beg to differ. The examples you gave are not very convincing - one is an obvious mistake nobody should make, and the second - null value in an a key column - is a mistake even bigger.

The *real* reason to use NOT EXISTS over NOT IN is performance: with NOT IN the database engine cannot use indexes, and needs to perform a full table scan.

Cavallo said...

Nice tips! Thanks!

Matthew said...

......or, instead of abandoning a useful feature, write proper sql.

Anonymous said...

Wow, this has to be one of the worst sql suggestions I've ever seen. IN is there for a reason, and the reasons you've given against are ridiculous. You could argue for some kind of JOIN purity or from perf, but instead came up with what?

Anonymous said...

What's the point of this? Sorry, I don't get it..

Anonymous said...

Ehr... interesting title, disappointing post :-(

Anonymous said...

hmm..i'll have to try this tomorrow.. i had just refactored a group of stored procedures from using a join to using IN..

the join was causing lots and lots of disk I/O as the joined table is very very large but the # of records needed was very small..

converting to IN allowed more efficient use of the existing indexes and lessened disk i/o..the resulting query plan used a left-semi join if i remember correctly..

i'll have to see if there is a difference with EXISTS..

SQL said...

leszek, true not exists will perform better than not in however it will be an index scan not a table scan. You can't have NULLs in a PK column

>>one is an obvious mistake nobody should make
Yes but if you do make it you are not warned (no error is generated)

Anonymous said...

I think you should avoid using ALL the SQL keywords... that would eliminate a whole host of analagous errors. Might want to avoid using all keywords in all languages, actually.

Erik Eckhardt said...

Use aliases and it will protect you against typing errors:

SELECT *
FROM TestTable1 t1
WHERE t1.id1 IN (SELECT t2.id1 FROM TestTable2 t2)

Now you get an error instead of a silent wrong operation.

Anonymous said...

I'm sorry, but this article is absolutely absurd.

Your argument is because you might type something wrong? Please. Remove this article, it's pointless.

oh noes! I typed SLEGRPPP instead of SELECT, never use 'SELECT' again!

SQL said...

>>oh noes! I typed SLEGRPPP instead of SELECT, never use 'SELECT' again!

You completely missed the point, when you mistype select you will get an error, when you mistype the column name you won't

Anonymous said...

Geez, these developers are friggin' morons. Such a simple article, and they don't seem to get it.

The issue, as stated numerous times in the article, is that no error is generated, NOT THAT THE SQL IS "POOR".

I RARELY use the "IN" syntax when joining, but sometimes just as a quick one-off I do. That's when I noticed this behaviour, and found this great article.

Anonymous said...

pretty interesting article..

Anonymous said...

The suggestion to use LEFT JOIN and IS NULL is extremely bad in SQL server, since SQL Server will not recognize this as an anti-join. NOT EXISTS/NOT IN should be identical performance-wise, DBMSs have improved since the 80s

Anonymous said...

Not in vs. Not Exists is an interesting discussion. In vs. Exists is a non-issue other than the parser "error" which is so trivial it is insignificant.

CJ Randolph said...

Thanks for this, the idea that you get a 'compile time' error is as equally compelling to me as performance. In a large result set an extra row value being returned could be easily missed and end up skewing production data.

We all make mistakes, having the syntax parser catch as many as possible is a good thing in my estimation.

Erik's suggestion to always use aliases is one to remember as well.

Futile said...

Good article. Blunt on the summary but a great argument and good advice nonetheless. I think a very interesting point is when the NULL value was inserted. Yes, it should never happen in an "ID" field but that's only happening here because of the simplicity of the example (though it can happen in the real world too, but that's a topic for another day). It's one of those "gotcha" situations if you aren't careful and really supports SQL's post on this blog. Here's a great article that goes into a little more depth with Three-valued logic and the NULL value.

http://www.sqlservercentral.com/articles/Advanced+Querying/fourrulesfornulls/1915/

Futile said...

Good article. Blunt on the summary but a great argument and good advice nonetheless. I think a very interesting point is when the NULL value was inserted. Yes, it should never happen in an "ID" field but that's only happening here because of the simplicity of the example (though it can happen in the real world too, but that's a topic for another day). It's one of those "gotcha" situations if you aren't careful and really supports SQL's post on this blog. Here's a great article that goes into a little more depth with Three-valued logic and the NULL value.

http://www.sqlservercentral.com/articles/Advanced+Querying/fourrulesfornulls/1915/