Showing posts with label BackToBasics. Show all posts
Showing posts with label BackToBasics. Show all posts

Monday, February 29, 2016

#BackToBasics returning rows with maximum values from a table



In the A challenge for 2016....... accepted! post I said that I would write a post once every month that would explain very basic stuff about SQL Server. Today is the second post, in this post we will take a look at how you can return the row with the maximum value for a group.

In order to explain the code in this post, I decided to grab the top largest cities in the world as the data. You can find the data here List of cities proper by population

 Below is the SQL code to create a table and also the data to populate this table

create table #Temp( Rank int not null, 
 City varchar(100) not null,
 Population int not null,
 Density int not null,
 Country varchar(100))

INSERT #Temp
SELECT 1,'Shanghai',24256800,3826,'China' union all
SELECT 2,'Karachi',23500000,6663,'Pakistan' union all
SELECT 3,'Beijing',21516000,1311,'China' union all
SELECT 4,'Lagos',21324000,18206,'Nigeria' union all
SELECT 5,'Delhi',16787941,11320,'India' union all
SELECT 6,'Tianjin',15200000,1293,'China' union all
SELECT 7,'Istanbul',14160467,2593,'Turkey' union all
SELECT 8,'Tokyo',13297629,6075,'Japan' union all
SELECT 9,'Guangzhou',13080500,1759,'China' union all
SELECT 10,'Mumbai',12478447,20680,'India' union all
SELECT 11,'Moscow',12197596,4859,'Russia' union all
SELECT 12,'São Paulo',11895893,7821,'Brazil' union all
SELECT 13,'Shenzhen',10467400,5256,'China' union all
SELECT 14,'Jakarta',10075310,15171,'Indonesia' union all
SELECT 15,'Seoul',10048850,17134,'South Korea' 

Now that we have the data in our table, let's run a simple query to return the row with the most populous city in the world. You can accomplish this in SQL Server by using the TOP operator and ordering the data by population in descending order.

SELECT TOP 1 * FROM #Temp
ORDER BY population DESC

Here is the output from that query


Rank City  Population Density Country
1 Shanghai 24256800 3826 China


What we want to do next is return one row the most populous city per country.

So let's take it a step at a time, the query below will give you the maximum population by country

SELECT MAX(population) as BiggestPopulation,Country
FROM #Temp
GROUP BY Country
ORDER BY BiggestPopulation DESC

Here is the output

BiggestPopulation Country
24256800  China
23500000  Pakistan
21324000  Nigeria
16787941  India
14160467  Turkey
13297629  Japan
12197596  Russia
11895893  Brazil
10075310  Indonesia
10048850  South Korea

What we can do is join the query we have above with the table itself on the country and population columns. The query will look like this

SELECT T.* FROM(
SELECT MAX(population) as BiggestPopulation,Country
FROM #Temp
GROUP BY Country) X
JOIN #Temp t on x.BiggestPopulation = t.Population
AND t.Country = x.Country
ORDER BY t.Population DESC

Here is the output

+------+-----------+------------+---------+-------------+
| Rank | City      | Population | Density | Country     |
+------+-----------+------------+---------+-------------+
| 1    | Shanghai  | 24256800   | 3826    | China       |
+------+-----------+------------+---------+-------------+
| 2    | Karachi   | 23500000   | 6663    | Pakistan    |
+------+-----------+------------+---------+-------------+
| 4    | Lagos     | 21324000   | 18206   | Nigeria     |
+------+-----------+------------+---------+-------------+
| 5    | Delhi     | 16787941   | 11320   | India       |
+------+-----------+------------+---------+-------------+
| 7    | Istanbul  | 14160467   | 2593    | Turkey      |
+------+-----------+------------+---------+-------------+
| 8    | Tokyo     | 13297629   | 6075    | Japan       |
+------+-----------+------------+---------+-------------+
| 11   | Moscow    | 12197596   | 4859    | Russia      |
+------+-----------+------------+---------+-------------+
| 12   | São Paulo | 11895893   | 7821    | Brazil      |
+------+-----------+------------+---------+-------------+
| 14   | Jakarta   | 10075310   | 15171   | Indonesia   |
+------+-----------+------------+---------+-------------+
| 15   | Seoul     | 10048850   | 17134   | South Korea |
+------+-----------+------------+---------+-------------+


SQL Server 2005 introduced a bunch of ranking/windowing functions which make writing these queries a little simpler. We are going to use ROW_NUMBER for this post. Here is what the query will look like


SELECT ROW_NUMBER() OVER (PARTITION By [what do you want it grouped/partitioned by?] ORDER BY [what do you want it ordered by?]) AS Row,* 
FROM Table

In our case, we want to partition the result set by country, we also want to order each partition by population descending. The query looks like this


SELECT ROW_NUMBER() OVER (PARTITION By Country ORDER BY population DESC) AS Row,*
FROM #Temp
ORDER BY Row,Population DESC

Here is what that query returns

+-----+------+-----------+------------+---------+-------------+
| Row | Rank | City      | Population | Density | Country     |
+-----+------+-----------+------------+---------+-------------+
| 1   | 1    | Shanghai  | 24256800   | 3826    | China       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 2    | Karachi   | 23500000   | 6663    | Pakistan    |
+-----+------+-----------+------------+---------+-------------+
| 1   | 4    | Lagos     | 21324000   | 18206   | Nigeria     |
+-----+------+-----------+------------+---------+-------------+
| 1   | 5    | Delhi     | 16787941   | 11320   | India       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 7    | Istanbul  | 14160467   | 2593    | Turkey      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 8    | Tokyo     | 13297629   | 6075    | Japan       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 11   | Moscow    | 12197596   | 4859    | Russia      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 12   | São Paulo | 11895893   | 7821    | Brazil      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 14   | Jakarta   | 10075310   | 15171   | Indonesia   |
+-----+------+-----------+------------+---------+-------------+
| 1   | 15   | Seoul     | 10048850   | 17134   | South Korea |
+-----+------+-----------+------------+---------+-------------+
| 2   | 3    | Beijing   | 21516000   | 1311    | China       |
+-----+------+-----------+------------+---------+-------------+
| 2   | 10   | Mumbai    | 12478447   | 20680   | India       |
+-----+------+-----------+------------+---------+-------------+
| 3   | 6    | Tianjin   | 15200000   | 1293    | China       |
+-----+------+-----------+------------+---------+-------------+
| 4   | 9    | Guangzhou | 13080500   | 1759    | China       |
+-----+------+-----------+------------+---------+-------------+
| 5   | 13   | Shenzhen  | 10467400   | 5256    | China       |
+-----+------+-----------+------------+---------+-------------+


What we really want is the top 10 rows. You might think that you can just add WHERE Row = 1 to the query. This is not allowed since the WHERE clause gets executed before the ROW_Number function as part of the SELECT. What you can do is use a common table expression and then specify WHERE Row = 1 when selecting from the CTE

Here is what it looks like

;WITH CTE AS(
SELECT ROW_NUMBER() OVER (PARTITION By Country ORDER BY population DESC) AS Row,*
FROM #Temp)

SELECT Rank, City,Population,Density,Country
FROM CTE
WHERE Row = 1
ORDER BY Population DESC

And here is our output, for each country we have only one row, you can also see that 4 cities from the top 10 are not in this list, those cities are in either China or India

+------+-----------+------------+---------+-------------+--+
| Rank | City      | Population | Density | Country     |  |
+------+-----------+------------+---------+-------------+--+
| 1    | Shanghai  | 24256800   | 3826    | China       |  |
+------+-----------+------------+---------+-------------+--+
| 2    | Karachi   | 23500000   | 6663    | Pakistan    |  |
+------+-----------+------------+---------+-------------+--+
| 4    | Lagos     | 21324000   | 18206   | Nigeria     |  |
+------+-----------+------------+---------+-------------+--+
| 5    | Delhi     | 16787941   | 11320   | India       |  |
+------+-----------+------------+---------+-------------+--+
| 7    | Istanbul  | 14160467   | 2593    | Turkey      |  |
+------+-----------+------------+---------+-------------+--+
| 8    | Tokyo     | 13297629   | 6075    | Japan       |  |
+------+-----------+------------+---------+-------------+--+
| 11   | Moscow    | 12197596   | 4859    | Russia      |  |
+------+-----------+------------+---------+-------------+--+
| 12   | São Paulo | 11895893   | 7821    | Brazil      |  |
+------+-----------+------------+---------+-------------+--+
| 14   | Jakarta   | 10075310   | 15171   | Indonesia   |  |
+------+-----------+------------+---------+-------------+--+
| 15   | Seoul     | 10048850   | 17134   | South Korea |  |
+------+-----------+------------+---------+-------------+--+



That is all for this month, see you next month. In the meantime, make sure to explore windowing/ranking functions as well as Common Table Expressions

Monday, January 18, 2016

#BackToBasics: Returning all rows from one table that do not exists in another table



In the A challenge for 2016....... accepted! post I said that I would write a post once every month that would explain very basic stuff about SQL Server. Today is the first post, in this post we will take a look at how you can return rows from one table which do not exist in another table.

First we need to create our two tables, these tables will be very simple, each will have only 1 column

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
CREATE TABLE MainData (ID int)
GO

INSERT MainData values(1)
INSERT MainData values(2)
INSERT MainData values(4)
INSERT MainData values(5)
INSERT MainData values(7)
GO


CREATE TABLE SomeData (ID2 int)
GO

INSERT SomeData values(1)
INSERT SomeData values(2)
INSERT SomeData values(3)
INSERT SomeData values(4)
INSERT SomeData values(6)
INSERT SomeData values(8)
GO

Now that our tables are set up it is time to look at the various queries.

NOT IN

This is one of the simpler ways, it is almost a direct translation from English,  select all from this table which is not in the other table
Here is what the query looks like

1
2
SELECT t2.* FROM SomeData t2
WHERE ID2 NOT IN(SELECT ID FROM MainData)

This is the output

ID2
--------
3
6
8

The output is correct, our MainData table has only these values: 1,2,4,5 and 7. The SomeData table has the values 3, 6 and 8, these values do not exist in our MainData table. As you can see the NOT IN query is very simple and will work most of the time, there are however two scenarios where it could be problematic using NOT IN. We will look at these two scenarios at the end of this post.

NOT EXISTS

Using NOT EXISTS is very similar to NOT IN, one addition you have to make is adding a WHERE clause, you are in essence doing a JOIN condition but not returning anything that satisfied this condition. Here is what the query looks like

1
2
SELECT t2.* FROM SomeData t2
WHERE NOT EXISTS (SELECT ID FROM MainData t1 WHERE t1.ID = t2.ID2)

The results for this query are the same as the NOT IN query

LEFT JOIN

A LEFT JOIN query returns all the data from both table, it will return NULL values for the rows that don't have a matched in the outer joined table. If you run the following query


1
2
SELECT * FROM SomeData t2
LEFT JOIN MainData t1 ON t1.ID = t2.ID2

Here is what the results look like

ID2 ID
----    -----
1 1
2 2
3 NULL
4 4
6 NULL
8 NULL


As you can see the rows with the values 3,6 and 8 in the 1st column have a NULL value in the 2nd column.

The LEFT JOIN query is more complex compared to NOT IN and NOT EXISTS. You also need to know  that this query will return data from both tables, this is why we need to specify SELECT t2.*. I also specified t2.* in the other two queries, this was however not needed since  NOT IN and NOT EXISTS only return data from one table. Like in the NOT EXISTS query, you also specify a JOIN condition in the LEFT JOIN query. Finally in the WHERE clause you are filtering out the data which does exists by asking for all the rows where t1.ID is NULL
Here is what the query looks like and the results are the same as for the NOT IN and the NOT EXISTS queries

1
2
3
SELECT t2.* FROM SomeData t2
LEFT JOIN MainData t1 ON t1.ID = t2.ID2
WHERE t1.ID IS NULL


EXCEPT

Using EXCEPT is very easy, the query is basically return every from one table EXCEPT what is returned by the bottom query. Here is what the query looks like

1
2
3
SELECT t2.* FROM SomeData t2
EXCEPT
SELECT ID FROM MainData

This query will return the same data as the other queries. Using EXCEPT is pretty simple and straightforward but I have to warn you, EXCEPT does a sort and is the worst performing query of all the ones mentioned here. I don't recommend using EXCEPT for any big tables


Some problems you might encounter when using NOT IN 


 Take a look at the query below, do you see anything wrong? Run the following query

1
2
SELECT t2.* FROM SomeData t2
WHERE ID2 NOT IN(SELECT ID2 FROM MainData)

You don't get any errors but the query returns nothing. The problem is that the MainData table does not contain a column named ID2, it is named ID. Since the ID2 table does exists in the SomeData table SQL Server does not throw an error. If you use LEFT JOIN or NOT EXISTS, you cannot make this mistake. Run the following two queries to see what happens


1
2
3
4
5
6
SELECT t2.* FROM SomeData t2
WHERE NOT EXISTS (SELECT ID FROM MainData t1 WHERE t1.ID2 = t2.ID2)

SELECT t2.* FROM SomeData t2
LEFT JOIN MainData t1 ON t1.ID2 = t2.ID2
WHERE t1.ID IS NULL
Here is the output

Msg 207, Level 16, State 1, Line 75
Invalid column name 'ID2'.
Msg 207, Level 16, State 1, Line 79
Invalid column name 'ID2'.

As you can see you got an error.

NULL values will also cause a problem when using NOT IN. Add the following NULL value to the MainData table

1
INSERT MainData values(NULL)

If you run all the 3 different queries again, you will notice something

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
SELECT t2.ID2 as 'Not in' FROM SomeData t2
WHERE ID2 NOT IN(SELECT ID FROM MainData)

SELECT t2.ID2 as 'Not Exists' FROM SomeData t2
WHERE NOT EXISTS (SELECT ID FROM MainData t1 WHERE t1.ID = t2.ID2)

SELECT t2.ID2 as 'Left Join' FROM SomeData t2
LEFT JOIN MainData t1 ON t1.ID = t2.ID2
WHERE t1.ID IS NULL

SELECT t2.ID2 as 'Except' FROM SomeData t2
EXCEPT
SELECT ID FROM MainData
Here are the results

Not in
-----------

Not Exists
-----------
3
6
8

Left Join
-----------
3
6
8

Except
-----------
3
6
8

Do you see what happened? The NOT IN query is not returning anything, a NULL value is not equal to anything, not even to another NULL value.

So that's it for this post, I showed you four ways to return values from one table which do not exists in another table. I also showed you why NOT IN might cause some problems. In general I like to use EXISTS and NOT EXISTS unless I need data from both table, in that case I will use a JOIN.

Till next month. If you want me to cover a topic leave me a comment.