Google
Google Interview Questions | SQL Server Software | Microsoft SQL Server Books | Denis Gobo's personal blog | Pro SQL server 2005 (Apress)
Top 10 Articles | Add Yourself To My Frappr Map | Ken Henderson Interview | Louis Davidson Interview

Tuesday, April 24, 2007

You Should Never Use IN In SQL To JOIN With Another Table

Do you use the following syntax?

SELECT *
FROM TABLE
WHERE CoLumn IN (SELECT Column FROM TABLE)

OR this?

SELECT *
FROM TABLE
WHERE CoLumn NOT IN (SELECT Column FROM TABLE)

Do NOT use that, it will cause problems sooner or later. Don't believe me? Let's take a look

First create these 2 tables and populate them with some sample data

CREATE TABLE TestTable1 (id1 int)
CREATE TABLE TestTable2 (id2 int)

INSERT TestTable1 VALUES(1)
INSERT TestTable1 VALUES(2)
INSERT TestTable1 VALUES(3)


INSERT TestTable2 VALUES(1)
INSERT TestTable2 VALUES(2)


Now let's run the IN query

SELECT *
FROM TestTable1
WHERE id1 IN (SELECT id2 FROM TestTable2)
-----
id1

1
2


No problems here right?


What if by mistake you wrote id1 instead of id2?

SELECT *
FROM TestTable1
WHERE id1 IN (SELECT id1 FROM TestTable2)
-----
id1
1
2
3


Oops all 3 rows are returned, if you just run this SELECT id1 FROM TestTable2 you will get this error
Server: Msg 207, Level 16, State 3, Line 1
Invalid column name 'id1'.

So what happens? SQL Server sees column id1 and says "yes I have that it is in the TestTable1 table, I can use that"
What can we do? Use EXISTS because you will get an error instead of a wrong resultset

SELECT *
FROM t1
WHERE EXISTS (SELECT * FROM TestTable2 t2 WHERE t2.id2 = t1.id1 )
-----
id1

1
2



A JOIN will do the same as EXISTS

SELECT t1.*
FROM TestTable1 t1
JOIN TestTable2 t2 ON t2.id2 = t1.id1
-----
id1

1
2




Now let's try NOT IN


SELECT *
FROM TestTable1
WHERE id1 NOT IN (SELECT id2 FROM TestTable2)
-----
id1
3


No problem right?

Add a NULL value to the TestTable2 table

INSERT TestTable2 VALUES(NULL)


Let's try running it again

SELECT *
FROM TestTable1
WHERE id1 NOT IN (SELECT id2 FROM TestTable2)

Where are my rows? Nowhere, since NULL is not equal to anything including another NULL SQL just returns nothing

What happens when you use NOT EXISTS?
SELECT *
FROM TestTable1 t1
WHERE NOT EXISTS (SELECT * FROM TestTable2 t2 WHERE t2.id2 = t1.id1 )
-----
id1
3


That works without a problem



What about a LEFT JOIN?

SELECT t1.*
FROM TestTable1 t1
LEFT JOIN TestTable2 t2 ON t2.id2 = t1.id1
WHERE t2.id2 IS NULL
-----
id1
3



That works without a problem also



So from now on use EXISTS, NOT EXISTS, JOIN and LEFT JOIN

DO NOT use IN or NOT IN ever again. Okay?

Labels:

18 Comments:

Blogger leszek said...

I beg to differ. The examples you gave are not very convincing - one is an obvious mistake nobody should make, and the second - null value in an a key column - is a mistake even bigger.

The *real* reason to use NOT EXISTS over NOT IN is performance: with NOT IN the database engine cannot use indexes, and needs to perform a full table scan.

12:49 AM  
Blogger Cavallo said...

Nice tips! Thanks!

1:06 AM  
Blogger Matthew said...

......or, instead of abandoning a useful feature, write proper sql.

10:24 AM  
Anonymous Anonymous said...

Wow, this has to be one of the worst sql suggestions I've ever seen. IN is there for a reason, and the reasons you've given against are ridiculous. You could argue for some kind of JOIN purity or from perf, but instead came up with what?

7:50 PM  
Anonymous Anonymous said...

What's the point of this? Sorry, I don't get it..

10:53 PM  
Anonymous Anonymous said...

Ehr... interesting title, disappointing post :-(

11:05 PM  
Anonymous Anonymous said...

hmm..i'll have to try this tomorrow.. i had just refactored a group of stored procedures from using a join to using IN..

the join was causing lots and lots of disk I/O as the joined table is very very large but the # of records needed was very small..

converting to IN allowed more efficient use of the existing indexes and lessened disk i/o..the resulting query plan used a left-semi join if i remember correctly..

i'll have to see if there is a difference with EXISTS..

11:20 PM  
Blogger SQL said...

leszek, true not exists will perform better than not in however it will be an index scan not a table scan. You can't have NULLs in a PK column

>>one is an obvious mistake nobody should make
Yes but if you do make it you are not warned (no error is generated)

3:04 AM  
Anonymous Anonymous said...

I think you should avoid using ALL the SQL keywords... that would eliminate a whole host of analagous errors. Might want to avoid using all keywords in all languages, actually.

10:08 AM  
Anonymous Erik Eckhardt said...

Use aliases and it will protect you against typing errors:

SELECT *
FROM TestTable1 t1
WHERE t1.id1 IN (SELECT t2.id1 FROM TestTable2 t2)

Now you get an error instead of a silent wrong operation.

12:21 PM  
Anonymous Anonymous said...

I'm sorry, but this article is absolutely absurd.

Your argument is because you might type something wrong? Please. Remove this article, it's pointless.

oh noes! I typed SLEGRPPP instead of SELECT, never use 'SELECT' again!

10:16 PM  
Blogger SQL said...

>>oh noes! I typed SLEGRPPP instead of SELECT, never use 'SELECT' again!

You completely missed the point, when you mistype select you will get an error, when you mistype the column name you won't

3:17 AM  
Anonymous Anonymous said...

Geez, these developers are friggin' morons. Such a simple article, and they don't seem to get it.

The issue, as stated numerous times in the article, is that no error is generated, NOT THAT THE SQL IS "POOR".

I RARELY use the "IN" syntax when joining, but sometimes just as a quick one-off I do. That's when I noticed this behaviour, and found this great article.

2:24 PM  
Anonymous Anonymous said...

pretty interesting article..

10:28 PM  
Anonymous Anonymous said...

The suggestion to use LEFT JOIN and IS NULL is extremely bad in SQL server, since SQL Server will not recognize this as an anti-join. NOT EXISTS/NOT IN should be identical performance-wise, DBMSs have improved since the 80s

4:45 AM  
Anonymous Anonymous said...

Not in vs. Not Exists is an interesting discussion. In vs. Exists is a non-issue other than the parser "error" which is so trivial it is insignificant.

9:38 AM  
Anonymous CJ Randolph said...

Thanks for this, the idea that you get a 'compile time' error is as equally compelling to me as performance. In a large result set an extra row value being returned could be easily missed and end up skewing production data.

We all make mistakes, having the syntax parser catch as many as possible is a good thing in my estimation.

Erik's suggestion to always use aliases is one to remember as well.

8:11 AM  
Blogger Futile said...

Good article. Blunt on the summary but a great argument and good advice nonetheless. I think a very interesting point is when the NULL value was inserted. Yes, it should never happen in an "ID" field but that's only happening here because of the simplicity of the example (though it can happen in the real world too, but that's a topic for another day). It's one of those "gotcha" situations if you aren't careful and really supports SQL's post on this blog. Here's a great article that goes into a little more depth with Three-valued logic and the NULL value.

http://www.sqlservercentral.com/articles/Advanced+Querying/fourrulesfornulls/1915/

6:25 AM  

Post a Comment

<< Home