Wednesday, March 02, 2016

Easy running totals with windowing functions

Back in the pre SQL Server 2012 days, if you wanted to do a running count, you either had to run a subquery or you could use a variable. This was slow because for each row the query that did the sum would be executed. With the additions to the windowing functions in SQL Server 2012, this is now running much faster. 

Let's take a look, first create the following table


CREATE TABLE #test(Id tinyint,SomeDate date, Charge decimal(20,10))

insert #test
SELECT 1,'20120101',1000
UNION ALL
SELECT 1,'20120401',200
UNION ALL
SELECT 1,'20120501',300
UNION ALL
SELECT 1,'20120601',600
UNION ALL
SELECT 2,'20120101',100
UNION ALL
SELECT 2,'20130101',500
UNION ALL
SELECT 2,'20140101',-800
UNION ALL
SELECT 3,'20120101',100


let's check that data we just inserted into the temporary table


SELECT * FROM #test


The output looks like this

Id SomeDate Charge
1 2012-01-01 1000.0000000000
1 2012-04-01 200.0000000000
1 2012-05-01 300.0000000000
1 2012-06-01 600.0000000000
2 2012-01-01 100.0000000000
2 2013-01-01 500.0000000000
2 2014-01-01 -800.0000000000
3 2012-01-01 100.0000000000


What we want is the following

id StartDate Enddate         Charge         RunningTotal
1 2012-01-01 2012-03-31 1000.0000000000 1000.0000000000
1 2012-04-01 2012-04-30 200.0000000000 1200.0000000000
1 2012-05-01 2012-05-31 300.0000000000 1500.0000000000
1 2012-06-01 9999-12-31 600.0000000000 2100.0000000000
2 2012-01-01 2012-12-31 100.0000000000 100.0000000000
2 2013-01-01 2013-12-31 500.0000000000 600.0000000000
2 2014-01-01 9999-12-31 -800.0000000000 -200.0000000000
3 2012-01-01 9999-12-31 100.0000000000 100.0000000000

For each row, we want to have the date that the row starts on and also the date when it end, we also want a running total as well. If there is no row after the current row for that id, we want the end date to be 9999-12-31.

So we will use a couple of functions. The first one is LEAD, LEAD accesses data from a subsequent row in the same result set without the use of a self-join. So the LEAD part looks like this

LEAD(dateadd(dd,-1,SomeDate),1,'99991231')  OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,

What we are doing is subtracting 1 from the date in the subsequent row (ateadd(dd,-1,SomeDate))
We are using 1 as the offset since we want to apply this to the next row. Finally if there is no subsequent row, we want to use the date 9999-12-31 instead of NULL

To do the running count, we will do the following

SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
          AS RunningTotal

What this means in English is for each id ordered by date, sum up the charge values for the rows between the preceding rows and the current row. Here is what all that stuff means.

ROWS BETWEEN
Specifies the rows that make up the range to use as implied by

UNBOUNDED PRECEDING
Specifies that the window starts at the first row of the partition. UNBOUNDED PRECEDING can only be specified as window starting point.

CURRENT ROW
Specifies that the window starts or ends at the current row when used with ROWS or the current value when used with RANGE.
CURRENT ROW can be specified as both a starting and ending point.

And here is the query


SELECT id, someDate as StartDate,
LEAD(dateadd(dd,-1,SomeDate),1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM #test
  ORDER BY id, SomeDate


And running that query, gives us the running count as well as the dates

id StartDate Enddate         Charge         RunningTotal
1 2012-01-01 2012-03-31 1000.0000000000 1000.0000000000
1 2012-04-01 2012-04-30 200.0000000000 1200.0000000000
1 2012-05-01 2012-05-31 300.0000000000 1500.0000000000
1 2012-06-01 9999-12-31 600.0000000000 2100.0000000000
2 2012-01-01 2011-12-31 100.0000000000 100.0000000000
2 2012-01-01 2011-13-31 500.0000000000 600.0000000000
2 2012-01-01 9999-12-31 -800.0000000000 -200.0000000000
3 2012-01-01 9999-12-31 100.0000000000 100.0000000000
That's all for this post...

Monday, February 29, 2016

#BackToBasics returning rows with maximum values from a table



In the A challenge for 2016....... accepted! post I said that I would write a post once every month that would explain very basic stuff about SQL Server. Today is the second post, in this post we will take a look at how you can return the row with the maximum value for a group.

In order to explain the code in this post, I decided to grab the top largest cities in the world as the data. You can find the data here List of cities proper by population

 Below is the SQL code to create a table and also the data to populate this table

create table #Temp( Rank int not null, 
 City varchar(100) not null,
 Population int not null,
 Density int not null,
 Country varchar(100))

INSERT #Temp
SELECT 1,'Shanghai',24256800,3826,'China' union all
SELECT 2,'Karachi',23500000,6663,'Pakistan' union all
SELECT 3,'Beijing',21516000,1311,'China' union all
SELECT 4,'Lagos',21324000,18206,'Nigeria' union all
SELECT 5,'Delhi',16787941,11320,'India' union all
SELECT 6,'Tianjin',15200000,1293,'China' union all
SELECT 7,'Istanbul',14160467,2593,'Turkey' union all
SELECT 8,'Tokyo',13297629,6075,'Japan' union all
SELECT 9,'Guangzhou',13080500,1759,'China' union all
SELECT 10,'Mumbai',12478447,20680,'India' union all
SELECT 11,'Moscow',12197596,4859,'Russia' union all
SELECT 12,'São Paulo',11895893,7821,'Brazil' union all
SELECT 13,'Shenzhen',10467400,5256,'China' union all
SELECT 14,'Jakarta',10075310,15171,'Indonesia' union all
SELECT 15,'Seoul',10048850,17134,'South Korea' 

Now that we have the data in our table, let's run a simple query to return the row with the most populous city in the world. You can accomplish this in SQL Server by using the TOP operator and ordering the data by population in descending order.

SELECT TOP 1 * FROM #Temp
ORDER BY population DESC

Here is the output from that query


Rank City  Population Density Country
1 Shanghai 24256800 3826 China


What we want to do next is return one row the most populous city per country.

So let's take it a step at a time, the query below will give you the maximum population by country

SELECT MAX(population) as BiggestPopulation,Country
FROM #Temp
GROUP BY Country
ORDER BY BiggestPopulation DESC

Here is the output

BiggestPopulation Country
24256800  China
23500000  Pakistan
21324000  Nigeria
16787941  India
14160467  Turkey
13297629  Japan
12197596  Russia
11895893  Brazil
10075310  Indonesia
10048850  South Korea

What we can do is join the query we have above with the table itself on the country and population columns. The query will look like this

SELECT T.* FROM(
SELECT MAX(population) as BiggestPopulation,Country
FROM #Temp
GROUP BY Country) X
JOIN #Temp t on x.BiggestPopulation = t.Population
AND t.Country = x.Country
ORDER BY t.Population DESC

Here is the output

+------+-----------+------------+---------+-------------+
| Rank | City      | Population | Density | Country     |
+------+-----------+------------+---------+-------------+
| 1    | Shanghai  | 24256800   | 3826    | China       |
+------+-----------+------------+---------+-------------+
| 2    | Karachi   | 23500000   | 6663    | Pakistan    |
+------+-----------+------------+---------+-------------+
| 4    | Lagos     | 21324000   | 18206   | Nigeria     |
+------+-----------+------------+---------+-------------+
| 5    | Delhi     | 16787941   | 11320   | India       |
+------+-----------+------------+---------+-------------+
| 7    | Istanbul  | 14160467   | 2593    | Turkey      |
+------+-----------+------------+---------+-------------+
| 8    | Tokyo     | 13297629   | 6075    | Japan       |
+------+-----------+------------+---------+-------------+
| 11   | Moscow    | 12197596   | 4859    | Russia      |
+------+-----------+------------+---------+-------------+
| 12   | São Paulo | 11895893   | 7821    | Brazil      |
+------+-----------+------------+---------+-------------+
| 14   | Jakarta   | 10075310   | 15171   | Indonesia   |
+------+-----------+------------+---------+-------------+
| 15   | Seoul     | 10048850   | 17134   | South Korea |
+------+-----------+------------+---------+-------------+


SQL Server 2005 introduced a bunch of ranking/windowing functions which make writing these queries a little simpler. We are going to use ROW_NUMBER for this post. Here is what the query will look like


SELECT ROW_NUMBER() OVER (PARTITION By [what do you want it grouped/partitioned by?] ORDER BY [what do you want it ordered by?]) AS Row,* 
FROM Table

In our case, we want to partition the result set by country, we also want to order each partition by population descending. The query looks like this


SELECT ROW_NUMBER() OVER (PARTITION By Country ORDER BY population DESC) AS Row,*
FROM #Temp
ORDER BY Row,Population DESC

Here is what that query returns

+-----+------+-----------+------------+---------+-------------+
| Row | Rank | City      | Population | Density | Country     |
+-----+------+-----------+------------+---------+-------------+
| 1   | 1    | Shanghai  | 24256800   | 3826    | China       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 2    | Karachi   | 23500000   | 6663    | Pakistan    |
+-----+------+-----------+------------+---------+-------------+
| 1   | 4    | Lagos     | 21324000   | 18206   | Nigeria     |
+-----+------+-----------+------------+---------+-------------+
| 1   | 5    | Delhi     | 16787941   | 11320   | India       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 7    | Istanbul  | 14160467   | 2593    | Turkey      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 8    | Tokyo     | 13297629   | 6075    | Japan       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 11   | Moscow    | 12197596   | 4859    | Russia      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 12   | São Paulo | 11895893   | 7821    | Brazil      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 14   | Jakarta   | 10075310   | 15171   | Indonesia   |
+-----+------+-----------+------------+---------+-------------+
| 1   | 15   | Seoul     | 10048850   | 17134   | South Korea |
+-----+------+-----------+------------+---------+-------------+
| 2   | 3    | Beijing   | 21516000   | 1311    | China       |
+-----+------+-----------+------------+---------+-------------+
| 2   | 10   | Mumbai    | 12478447   | 20680   | India       |
+-----+------+-----------+------------+---------+-------------+
| 3   | 6    | Tianjin   | 15200000   | 1293    | China       |
+-----+------+-----------+------------+---------+-------------+
| 4   | 9    | Guangzhou | 13080500   | 1759    | China       |
+-----+------+-----------+------------+---------+-------------+
| 5   | 13   | Shenzhen  | 10467400   | 5256    | China       |
+-----+------+-----------+------------+---------+-------------+


What we really want is the top 10 rows. You might think that you can just add WHERE Row = 1 to the query. This is not allowed since the WHERE clause gets executed before the ROW_Number function as part of the SELECT. What you can do is use a common table expression and then specify WHERE Row = 1 when selecting from the CTE

Here is what it looks like

;WITH CTE AS(
SELECT ROW_NUMBER() OVER (PARTITION By Country ORDER BY population DESC) AS Row,*
FROM #Temp)

SELECT Rank, City,Population,Density,Country
FROM CTE
WHERE Row = 1
ORDER BY Population DESC

And here is our output, for each country we have only one row, you can also see that 4 cities from the top 10 are not in this list, those cities are in either China or India

+------+-----------+------------+---------+-------------+--+
| Rank | City      | Population | Density | Country     |  |
+------+-----------+------------+---------+-------------+--+
| 1    | Shanghai  | 24256800   | 3826    | China       |  |
+------+-----------+------------+---------+-------------+--+
| 2    | Karachi   | 23500000   | 6663    | Pakistan    |  |
+------+-----------+------------+---------+-------------+--+
| 4    | Lagos     | 21324000   | 18206   | Nigeria     |  |
+------+-----------+------------+---------+-------------+--+
| 5    | Delhi     | 16787941   | 11320   | India       |  |
+------+-----------+------------+---------+-------------+--+
| 7    | Istanbul  | 14160467   | 2593    | Turkey      |  |
+------+-----------+------------+---------+-------------+--+
| 8    | Tokyo     | 13297629   | 6075    | Japan       |  |
+------+-----------+------------+---------+-------------+--+
| 11   | Moscow    | 12197596   | 4859    | Russia      |  |
+------+-----------+------------+---------+-------------+--+
| 12   | São Paulo | 11895893   | 7821    | Brazil      |  |
+------+-----------+------------+---------+-------------+--+
| 14   | Jakarta   | 10075310   | 15171   | Indonesia   |  |
+------+-----------+------------+---------+-------------+--+
| 15   | Seoul     | 10048850   | 17134   | South Korea |  |
+------+-----------+------------+---------+-------------+--+



That is all for this month, see you next month. In the meantime, make sure to explore windowing/ranking functions as well as Common Table Expressions

Sunday, February 21, 2016

How to store twitter search results from R into SQL Server

I was messing around with R and grabbing twitter searches from within R, after that I decided to store the result in a csv file. Instead of stroring the results in a file, what if I want to store that info into a SQL Server database? This is pretty easy to do. If you want to try this out yourself, first head on over to my other post Using twitter from the R console. There you will be able to see how to setup OAuth so that you can use twitter within R, you will also see how to install the R twitteR package as well.

Time to get started....

First let's see how to connect to SQL Server from R
To load data from SQL Server into R, you first need to create an ODBC Data Source, I created one named LocalSqlServer, this DSN Data Source points to my local SQL Server 2016 instance.


After you have created your ODBC data source, you need to load the (RODBC) package, create a connection, execute a query, display the results and close the connection.. Here is what it looks like in the R Console



> require(RODBC)

>db <- odbcConnect("LocalSqlServer")

> Sample <- sqlQuery(db, "select * from sysobjects", stringsAsFactors=FALSE)

> head(Sample)
> close(db)  #close ORDBC connection
name id xtype uid info status base_schema_ver 1 sp_MSalreadyhavegeneration -1073624922 P 4 0 0 0 2 sp_MSwritemergeperfcounter -1072815163 P 4 0 0 0 3 TABLE_PRIVILEGES -1072372588 V 3 0 4 sp_replsetsyncstatus -1071944761 X 4 0 0 0 5 sp_replshowcmds -1070913306 P 4 0 0 0 6 sp_publishdb -1070573756 P 4 0 0 0

Now that we know how to connect to SQL Server, let's see if we can get the twitteR package to work, head on over to my other post Using twitter from the R console to see how to install twitterR and how some queries are executed.


Assuming the twitteR package is installed, we need to load it to use it


library(twitteR)

The next command is the command to enable OAuth, to see how to set this up go here Using twitter from the R console

setup_twitter_oauth("API key", "API secret", "Access token", "Access secret")
[1] "Using direct authentication"

Now it is time to do a simple search

tweets <- searchTwitter('#rstats', n=6) 
tweets





Running that command should have given you some results. However what we need as output is a dataframe, this way we can easily load this into a table in SQL Server. You can use the twListToDF function


tweets <- searchTwitter('#rstats', n=6) 
twListToDF(tweets)


The output now looks different



1                             RT @rquintino: @Mairos_B #sqlsatportugal session: all about R in #SqlServer 2016 #rstats https://t.co/DHrqIZrz1e
2                                                      a quick script to use imgcat in #rstats https://t.co/fpUlgWNX33 https://t.co/AhCCMLewCH
3 RT @KirkDBorne: Useful packages (libraries) for Data Analysis in R: https://t.co/haRKopFyly #DataScience #Rstats by @analyticsvidhya https:…
4                      Hey #rstats tweeps, do you have any readings to recommend on sensitivity analysis? Books/articles/websites all welcome.
5 RT @KirkDBorne: A Complete Tutorial on Time Series Modeling in R: https://t.co/7oI6JKyU4E #MachineLearning #DataScience #Rstats by @Analyti…
6 RT @KirkDBorne: A Complete Tutorial on Time Series Modeling in R: https://t.co/7oI6JKyU4E #MachineLearning #DataScience #Rstats by @Analyti…
  favorited favoriteCount replyToSN             created truncated replyToSID
1     FALSE             0        NA 2016-02-20 20:29:54     FALSE         NA
2     FALSE             0        NA 2016-02-20 20:24:50     FALSE         NA
3     FALSE             0        NA 2016-02-20 20:16:25     FALSE         NA
4     FALSE             0        NA 2016-02-20 20:11:08     FALSE         NA
5     FALSE             0        NA 2016-02-20 20:11:06     FALSE         NA
6     FALSE             0        NA 2016-02-20 20:02:05     FALSE         NA
                  id replyToUID
1 701141750161784834         NA
2 701140474019577856         NA
3 701138356466483204         NA
4 701137026075140096         NA
5 701137018508722176         NA
6 701134750296227840         NA
                                                                            statusSource
1                Mobile Web (M5)
Tweetbot for Mac
3   Twitter for Android
4                     Twitter Web Client
5     Twitter for iPhone
6                     Twitter Web Client
     screenName retweetCount isRetweet retweeted longitude latitude
1      psousa75            3      TRUE     FALSE        NA       NA
2      millerdl            0     FALSE     FALSE        NA       NA
3   diana_nario           50      TRUE     FALSE        NA       NA
4    emjonaitis            0     FALSE     FALSE        NA       NA
5       caryden           41      TRUE     FALSE        NA       NA
6 ArkangelScrap           41      TRUE     FALSE        NA       NA



The output now has a lot more stuff, you can see if it has been retweeted or favorited as well as the latitude, longtitude and more.

Time to send the result to SQL Server

There is a problem with how the sqlSave method creates the table, it makes the created column a float and then all inserts fail.


Run the following DDL statement in your database
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
CREATE TABLE [dbo].[tweets](
	[rownames] [nvarchar](255) NULL,
	[text] [nvarchar](255) NULL,
	[favorited] [nvarchar](5) NULL,
	[favoriteCount] [float] NULL,
	[replyToSN] [nvarchar](5) NULL,
	[created] [datetime] NULL,
	[truncated] [nvarchar](5) NULL,
	[replyToSID] [nvarchar](5) NULL,
	[id] [nvarchar](255) NULL,
	[replyToUID] [nvarchar](5) NULL,
	[statusSource] [nvarchar](255) NULL,
	[screenName] [nvarchar](255) NULL,
	[retweetCount] [float] NULL,
	[isRetweet] [nvarchar](5) NULL,
	[retweeted] [nvarchar](5) NULL,
	[longitude] [nvarchar](255) NULL,
	[latitude] [nvarchar](255) NULL
) ON [PRIMARY]

Now we are ready to save our twitter data in SQL Server,  here is how you do it.


> sqlSave(db,append = TRUE,colnames = FALSE,verbose = FALSE, addPK = FALSE, twListToDF(tweets))
Here is what the results look in SSMS



Make sure that you have append = TRUE, if you make it FALSE, the table will be recreated every time but since the command issued creates the wrong data type for the created column, you can't use it


So to recap, here are the commands you need after you loaded the twitterR package and have setup OAuth


> tweets <- searchTwitter('#rstats', n=6) #store the twitter search results
> twListToDF(tweets)  #convert to dataframe

> require(RODBC)  # load RODBC package
> db <- odbcConnect("LocalSqlServer")  #connect
> sqlSave(db,append = TRUE,colnames = FALSE,verbose = FALSE, addPK = FALSE, twListToDF(tweets))  #insert data into your table
> close(db)  #close ORDBC connection
 

Saturday, February 13, 2016

Quickly change m/d/yyyy dates to mm/dd/yyyy dates in Excel

Sometimes you get data in an Excel or csv file and it will be in a m/d/yyyy format.


Here is an example

1/1/2005
2/1/2012
2/12/2012
12/5/2015

This will look fine if you open the csv file in Excel but if you need to import this data into a database you might get an error that the format is not correct.

 What you want is the data to be in this format
01/01/2005
02/01/2012
02/12/2012
12/05/2015


If you use Format Cells and then pick Date you will see that there is no mm/dd/yyyy format What you have to do is pick the Custom Category and click on the m/d/yyyy format



Now change m/d/yyyy to mm/dd/yyyy



Click OK and you are all set, you will see that the dates are now in mm/dd/yyyy format

Let me just mention that the problem was not the conversion in SQL itself but a tool that did the import choked

BTW SQL Server can do this fairly easily

If I run the following

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
;with dates (date)
as (select '1/1/2001' union all
 select '11/1/2001' union all
 select '1/12/2001' union all
 select '5/5/2001' union all
 select '1/22/2001' union all
 select '12/1/2001' union all
 select '01/01/2001' )

 select convert(date,date)
 from dates

I will get this as output

2001-01-01
2001-11-01
2001-01-12
2001-05-05
2001-01-22
2001-12-01
2001-01-01

That of course is just a date, if you want mm/dd/yyyy format, you can use convert and style 101...

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
;with dates (date)
as (select '1/1/2001' union all
 select '11/1/2001' union all
 select '1/12/2001' union all
 select '5/5/2001' union all
 select '1/22/2001' union all
 select '12/1/2001' union all
 select '01/01/2001' )

 select convert(varchar(10),convert(date,date),101)
 from dates

 select convert(date,date)
 from dates

And here is that output

01/01/2001
11/01/2001
01/12/2001
05/05/2001
01/22/2001
12/01/2001
01/01/2001

Monday, January 25, 2016

The one change I make to each tSQLt database

Whenever I put the tSQLt database on a server, I make sure that I make one change to a stored procedure that ships with tSQLt. The reason for this is that we run our tSQLT tests as part of our automated build process. If any of the tests fail, we developers who are part of that particular project will get an email telling us what the results of the tests were.

The main proc that runs the unit tests always returns 0, I need a way to know if any tests failed or not, if we had failures, we want the email to go out, if there were no failures then we don't want the email to go out.

Here is how I modified the RunAll stored procedure


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
ALTER PROCEDURE [tSQLt].[RunAll]
AS
BEGIN
  DECLARE @TestResultFormatter NVARCHAR(MAX);
  SELECT @TestResultFormatter = tSQLt.GetTestResultFormatter();
  
  EXEC tSQLt.Private_RunAll @TestResultFormatter;

  -- Added to return -1 instead  of the default 0
  IF EXISTS (SELECT * FROM tSQLt.TestResult WHERE result = 'Failure')
 RETURN -1

END;

I added the stuff on lines 9 till 11. I am checking if there were any tests that failed, these tests will have a result of Failure. If any tests failed, the stored procedure will return -1. Now we can easily check this value to determine if we should send the email or not.

I could have also checked if there were any tests which had a result which was not Success instead. A test could also have a status which is Error, you will get such a status if for example an object doesn't exist in the database.

There you have it, this is the change I made. Maybe you had the same problem and this will help you as well.



Monday, January 18, 2016

#BackToBasics: Returning all rows from one table that do not exists in another table



In the A challenge for 2016....... accepted! post I said that I would write a post once every month that would explain very basic stuff about SQL Server. Today is the first post, in this post we will take a look at how you can return rows from one table which do not exist in another table.

First we need to create our two tables, these tables will be very simple, each will have only 1 column

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
CREATE TABLE MainData (ID int)
GO

INSERT MainData values(1)
INSERT MainData values(2)
INSERT MainData values(4)
INSERT MainData values(5)
INSERT MainData values(7)
GO


CREATE TABLE SomeData (ID2 int)
GO

INSERT SomeData values(1)
INSERT SomeData values(2)
INSERT SomeData values(3)
INSERT SomeData values(4)
INSERT SomeData values(6)
INSERT SomeData values(8)
GO

Now that our tables are set up it is time to look at the various queries.

NOT IN

This is one of the simpler ways, it is almost a direct translation from English,  select all from this table which is not in the other table
Here is what the query looks like

1
2
SELECT t2.* FROM SomeData t2
WHERE ID2 NOT IN(SELECT ID FROM MainData)

This is the output

ID2
--------
3
6
8

The output is correct, our MainData table has only these values: 1,2,4,5 and 7. The SomeData table has the values 3, 6 and 8, these values do not exist in our MainData table. As you can see the NOT IN query is very simple and will work most of the time, there are however two scenarios where it could be problematic using NOT IN. We will look at these two scenarios at the end of this post.

NOT EXISTS

Using NOT EXISTS is very similar to NOT IN, one addition you have to make is adding a WHERE clause, you are in essence doing a JOIN condition but not returning anything that satisfied this condition. Here is what the query looks like

1
2
SELECT t2.* FROM SomeData t2
WHERE NOT EXISTS (SELECT ID FROM MainData t1 WHERE t1.ID = t2.ID2)

The results for this query are the same as the NOT IN query

LEFT JOIN

A LEFT JOIN query returns all the data from both table, it will return NULL values for the rows that don't have a matched in the outer joined table. If you run the following query


1
2
SELECT * FROM SomeData t2
LEFT JOIN MainData t1 ON t1.ID = t2.ID2

Here is what the results look like

ID2 ID
----    -----
1 1
2 2
3 NULL
4 4
6 NULL
8 NULL


As you can see the rows with the values 3,6 and 8 in the 1st column have a NULL value in the 2nd column.

The LEFT JOIN query is more complex compared to NOT IN and NOT EXISTS. You also need to know  that this query will return data from both tables, this is why we need to specify SELECT t2.*. I also specified t2.* in the other two queries, this was however not needed since  NOT IN and NOT EXISTS only return data from one table. Like in the NOT EXISTS query, you also specify a JOIN condition in the LEFT JOIN query. Finally in the WHERE clause you are filtering out the data which does exists by asking for all the rows where t1.ID is NULL
Here is what the query looks like and the results are the same as for the NOT IN and the NOT EXISTS queries

1
2
3
SELECT t2.* FROM SomeData t2
LEFT JOIN MainData t1 ON t1.ID = t2.ID2
WHERE t1.ID IS NULL


EXCEPT

Using EXCEPT is very easy, the query is basically return every from one table EXCEPT what is returned by the bottom query. Here is what the query looks like

1
2
3
SELECT t2.* FROM SomeData t2
EXCEPT
SELECT ID FROM MainData

This query will return the same data as the other queries. Using EXCEPT is pretty simple and straightforward but I have to warn you, EXCEPT does a sort and is the worst performing query of all the ones mentioned here. I don't recommend using EXCEPT for any big tables


Some problems you might encounter when using NOT IN 


 Take a look at the query below, do you see anything wrong? Run the following query

1
2
SELECT t2.* FROM SomeData t2
WHERE ID2 NOT IN(SELECT ID2 FROM MainData)

You don't get any errors but the query returns nothing. The problem is that the MainData table does not contain a column named ID2, it is named ID. Since the ID2 table does exists in the SomeData table SQL Server does not throw an error. If you use LEFT JOIN or NOT EXISTS, you cannot make this mistake. Run the following two queries to see what happens


1
2
3
4
5
6
SELECT t2.* FROM SomeData t2
WHERE NOT EXISTS (SELECT ID FROM MainData t1 WHERE t1.ID2 = t2.ID2)

SELECT t2.* FROM SomeData t2
LEFT JOIN MainData t1 ON t1.ID2 = t2.ID2
WHERE t1.ID IS NULL
Here is the output

Msg 207, Level 16, State 1, Line 75
Invalid column name 'ID2'.
Msg 207, Level 16, State 1, Line 79
Invalid column name 'ID2'.

As you can see you got an error.

NULL values will also cause a problem when using NOT IN. Add the following NULL value to the MainData table

1
INSERT MainData values(NULL)

If you run all the 3 different queries again, you will notice something

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
SELECT t2.ID2 as 'Not in' FROM SomeData t2
WHERE ID2 NOT IN(SELECT ID FROM MainData)

SELECT t2.ID2 as 'Not Exists' FROM SomeData t2
WHERE NOT EXISTS (SELECT ID FROM MainData t1 WHERE t1.ID = t2.ID2)

SELECT t2.ID2 as 'Left Join' FROM SomeData t2
LEFT JOIN MainData t1 ON t1.ID = t2.ID2
WHERE t1.ID IS NULL

SELECT t2.ID2 as 'Except' FROM SomeData t2
EXCEPT
SELECT ID FROM MainData
Here are the results

Not in
-----------

Not Exists
-----------
3
6
8

Left Join
-----------
3
6
8

Except
-----------
3
6
8

Do you see what happened? The NOT IN query is not returning anything, a NULL value is not equal to anything, not even to another NULL value.

So that's it for this post, I showed you four ways to return values from one table which do not exists in another table. I also showed you why NOT IN might cause some problems. In general I like to use EXISTS and NOT EXISTS unless I need data from both table, in that case I will use a JOIN.

Till next month. If you want me to cover a topic leave me a comment.

Sunday, January 03, 2016

A challenge for 2016....... accepted!

I was reading a challenge for 2016 post by Aaron Bertrand and in that post Aaron mentioned that Tim Ford challenged the SQL community to post 1 introductory post per month. The tweet in question is below


I am not an MVP anymore, but I decided to still jump on the bandwagon as well. I noticed that with some of the new hires we had, some of the stuff I take for granted is sometimes not very clear/obvious to them.

Here are some of the things I will blog about

Selecting all the rows from one table that don't exist in another table
How to make SSMS rock so that you are more productive
Get the max/first/latest/oldest row based on some criteria

I will think of more once time passes. During the summer when we get our interns in the office will also be a good time to get some additional ideas.


Saturday, October 31, 2015

TempDB Database changes in SQL Server 2016

According to Books On Line, there are several changes made in SQL Server 2016 in regards to TempDB. Here is what Books On Line has on the topic

There are several enhancements to TempDB:
  • Trace Flags 1117 and 1118 are not required for tempdb anymore. If there are multiple tempdb database files all files will grow at the same time depending on growth settings. In addition, all allocations in tempdb will use uniform extents.
  • By default, setup adds as many tempdb files as the CPU count or 8, whichever is lower.
  • During setup, you can configure the number of tempdb database files, initial size, autogrowth and directory placement using the new UI input control on the Database Engine Configuration - TempDB section of SQL Server Installation Wizard.
  • The default initial size is 8MB and the default autogrowth is 64MB.
  • You can specify multiple volumes for tempdb database files. If multiple directories are specified tempdb data files will be spread across the directories in a round-robin fashion.

I decided to check this out, I downloaded and installed SQL Server 2016 Community Technology Preview 3.0 During the install I did not make any changes to TempDB

Once the install was done, I executed the following to see how many files TempDB had on my laptop

exec sp_helpdb 'tempdb'

Here is what I saw in the output

name fileid filename filegroup size maxsize growth usage
tempdev 1 C:\DATA\tempdb.mdf PRIMARY 8192 KB Unlimited 65536 KB data only
templog 2 C:\DATA\templog.ldf NULL 8192 KB Unlimited 65536 KB log only
temp2 3 C:\DATA\tempdb_mssql_2.ndf PRIMARY 8192 KB Unlimited 65536 KB data only
temp3 4 C:\DATA\tempdb_mssql_3.ndf PRIMARY 8192 KB Unlimited 65536 KB data only
temp4 5 C:\DATA\tempdb_mssql_4.ndf PRIMARY 8192 KB Unlimited 65536 KB data only


As you can see, there are 4 data files, when I looked at my SQL Server 2014 instance which was installed with defaults, there is only 1 data file. Another difference is that SQL Server 2014 uses 10% growth as default while SQL Server 2016 uses 65536 KB

This is a good change for all the casual installs by people who just want to experiment and learn, however most shops I would think already would have changed TempDB to have more than 1 file. But then again maybe Microsoft noticed that a lot of support calls that they are getting have only 1 data file for TempDB and thus they decided to implement this change.


Thursday, October 29, 2015

SQL Server 2016 Community Technology Preview 3.0 is now available for download

Public service announcement!!!!!

SQL Server 2016 Community Technology Preview 3.0 is now available for download. The download is 2.6 GB, you can get it here SQL Server 2016 Community Technology Preview 3.0


I am already downloading mine.... what are you waiting for?


The stuff I am most excited about in this CTP is the SQL Server R Services in SQL Server 2016....


To accompany CTP3, the SQL Server team also created an updated version of AdventureWorks, called AdventureWorks2016CTP3, and they have created a number of samples to showcase some of the new functionality in SQL2016. They recommend you download and install the sample databases and play with the sample scripts. Download link: AdventureWorks Sample Databases and Scripts for SQL Server 2016 CTP3

SQL Server 2016 Community Technology Preview 3.0 will be available this week

SQL Server 2016 Community Technology Preview 3.0 will be available this week.

Some of the new and exciting stuff......


Advanced Analytics (RRE integration)

With this release, we are very excited to announce the public availability SQL Server R Services in SQL Server 2016, an Advanced Analytics capability which supports enterprise-scale data science, significantly reducing the friction for adopting machine learning in your business. SQL Server R Services is all about helping customers embrace the highly popular open source R language in their business. R is the most popular programming language for Advanced Analytics. You can use it to analyze data, uncover patterns and trends and build predictive models. It offers an incredibly rich set of packages and a vibrant and fast-growing developer community. At the same time, embracing R in an enterprise setting presents certain challenges, especially as the volume of data rises and with the switch from modeling to production environments. Microsoft SQL Server R Services with in-database analytics helps customers embrace this technology by supporting several scenarios. 
Transactional replicate from SQL Server to Azure SQL DB 
This is new in CTP3. Now you can setup Azure SQL DB as a subscriber of transaction replication, allowing you to migrate data from SQL Server instance on-premises or in IaaS to Azure SQL database without downtime. The replication is one way in this release, and works with SQL Server 2016, SQL Server 2014 and SQL Server 2012. This is the same Transactional Replication technology you have been using for many years on premise. As you configure a subscriber (from SSMS or by script), instead of entering an instance name, you enter the name of your Azure SQL DB subscription along with the associated login and password. A snapshot (as in a Replication Snapshot) will used to initialize the subscription and subsequent data changes will be replicated to you Azure SQL DB in the same transactional consistent way you are used to. A transactional publication can deliver changes to subscribers both in Azure SQL DB and/or on premise/Azure VM. There is no Replication service hosted in Azure for this. Everything is driven from on-premise distribution agents. To use this feature, you just need to set it up the way you do to replicate on-premises: Install the Replication components, configure the Distributor, the Publisher and create the Publication, the Articles and you the Subscriptions. In this case, one of the subscriptions will be your Azure SQL DB.
In-Memory improvements in this release:
  • In-Memory OLTP
    • FOREIGN KEY constraints
    • CHECK constraints
    • UNIQUE constraints
    • DML triggers (AFTER only)
    • EXECUTE AS CALLER
    • Inline table-values functions
    • Security built-ins and increased math function support
  • Real-time Operational Analytics
    • Support for in-memory tables
    • Existing nonclustered columnstore index (NCCI) are updateable without requiring index rebuild
    • Parallel Index build of nonclustered columnstore index (NCCI)
    • Performance improvements (INSERT, String pushdown, bypassing delete buffer when processing deleted rows)
  • In-Memory Analytics
    • You can upgrade databases with nonclustered columnstore index and have it updateable without requiring rebuild of the index
    • General performance improvements for analytics queries with columnstore index especially involving aggregates and string predicates
    • Improved supportability with DMVs and XEvents
Stretch Database updates in this release:
  • Engine Update
    • Create/Drop index support
    • AlwaysEncrypted support
    • Improved credential management for remote Stretch database stores
    • Improved performance for joins between stretched tables
    • New external data source integration
  • SSMS Wizard updates
    • Database and Table level fly out menu options were updated to reflect new Stretch functionality
    • Stretch monitor functionality added to allow users to monitor current migration status, including the ability to pause the migration at the table level
    • XEVENT support for diagnostics session support in monitor
    • Updated and simplified stretch wizard flow to reduce the amount of steps required to enable or reconfigure Stretch
    • Help icon links Updated to point to new MSDN content focusing specifically on wizard topic
    • Added functionality that allows users to pause or disable migration at the table level
    • Added ability to Stretch individual tables
    • Added database scoped credential support - for AlwaysOn
    • Ability to enabling stretch on the server using the wizard
    • Updated table level validation error/warning messaging
    • The ability to Stretch to new SQL Azure or existing SQL Azure server
    • Updated SSMS Object Explorer Stretch Databases icons
    • SMO model for Stretch status query and updates
Polybase in CTP3 includes the following new capabilities:
  • Improved PolyBase query performance with scale-out computation on external data (PolyBase scale-out groups)
  • Improved PolyBase query performance with faster data movement from HDFS to SQL Server and between PolyBase Engine and SQL Server
  • Support for exporting data to external data source via INSERT INTO EXTERNAL TABLE SELECT FROM TABLE
  • Support for push-down computation to Hadoop for string operations (compare, LIKE)
  • Support for ALTER EXTERNAL DATA SOURCE statement
Built-in JSON support improvements in this release include:
  • OPENJSON - Table value function that parses JSON text and returns rowset view of JSON. By default, OPENJSON returns properties of object or elements of array that is parsed. Advanced version of OPENJSON function with defined schema allows user to define schema of resulting rowset, and mapping rules that define where can be found values in the parsed JSON text that will be returned in the resulting rowset. It enables developers to easily parse JSON text and import it into relational tables.
  • JSON_VALUE - Scalar function that returns a value from JSON on the specified path. It can be used in any query, view, computed column. It can be also used to define indexes on properties of JSON text stored in table columns.
  • ISJSON - function that validates that JSON is properly formatted. It can be used to define check constraints on the columns that contain JSON text. It is not supported in check constraints defined on in-memory tables.
  • JSON_QUERY - Scalar function that returns a fragment from the JSON text. Unlike JSON_VALUE that returns scalar values, JSON_QUERY returns complex object (i.e. JSON arrays and objects).
Temporal support improvements in this release include:
  • Support for using temporal system-versioning with In-Memory OLTP
    • Combining disk-based table for cost-effective storing of history data with memory-optimized tables for storing latest (actual) data
    • Super-fast DML and current data querying supported from natively compiled code
    • Temporal querying supported from interop mode
    • Internal in-memory table created to minimally impact performance of DML operations
    • Background process that flushes the data from internal in-memory to permanent disk-based history table
  • Direct ALTER for system-versioned temporal tables enables modifying table schema without introducing maintenance window
    • Support for adding/altering/dropping columns while SYSTEM_VERSIONING is ON
    • Support for ADD/DROP HIDDEN for period columns while SYSTEM_VERSIONING is ON
  • Support for temporal querying clause FOR SYSTEM_TIME ALL that enables users to query entire data history easily without specifying period boundaries
  • Optimized CONTAINED IN implementation with minimized locking on current table. If your main case is analysis on historical data only, use CONTAINED IN.
Query Store improvements in this release include:
  • Performance monitoring supported for natively compiled code from In-Memory OLTP workloads:
    • Collecting queries, plans and compile time statistics enabled for natively compiled queries when Query Store is ON
    • Stored plan is semantically equivalent to one that is produced when SET SHOWPLAN_XML is set to ON with one difference: plans in Query Store are always split and stored per individual statement
    • Runtime statistics collection is controlled with sys.sp_xtp_control_query_exec_stats (does not enabled by default)
    • is_natively_compiled field added to sys.query_store_plan to help finding queries  generated by the native code compilation
    • Plan forcing for queries from natively compiled modules is available and forced plans are honored during module recompilation. As for disk-based workloads, Query Store does not guarantee success of plan forcing operation as some plan shapes cannot be forced
    • Memory grants metrics within sys.query_store_runtime_stats are not populated for natively compiled queries – their values are always 0
  • Improving implementation of time-based cleanup (configured with STALE_QUERY_THRESHOLD_DAYS) to run in multiple transactions, holding database lock for a shorter period of time and thus minimize impact on customer workload
  • Hadoop Connector: SSIS Hadoop connector allows customer to copy data to/from HDFS and trigger Hive/Pig job on Hadoop cluster. This brings in following components: Hadoop Connection Manager, H
SQL Server Integration Services (SSIS) improvements in this release include:
  • SSIS control flow template enables customers to save a commonly used control flow task or container to a standalone template file and reuse it multiple times in a package or multiple packages in a project. This reusability introduced by template makes SSIS packages easier to design and maintain.
  • Added Azure blob source support for the Import/Export wizard; user can use Azure blob source as source or destination during the transformation.
  • Relaxed Max Buffer Size of Data Flow Task. The max Default Buffer Size of Data Flow Task is relaxed to 2G-1 from 100M. A new attribute 'AutoAdjustBufferSize' is added to Data Flow Task, which can be set in SSDT. If it is set to true, the Default Buffer Size will be set automatically in runtime according to Default Buffer Max Rows.
SQL Server Analysis Services (SSAS) improvements in this release include the following; please visit the SSAS team blog to learn more.
  • DBCC support
  • The Microsoft.AnalysisServices library has been re-factored to include a second namespace, Microsoft.AnalysisServices.Core. The new namespace separates out common classes like Server, Database, and Role that have broad application in Analysis Services, irrespective of server mode.
  • SSMS and SSDT updates for Tabular
SQL Server Reporting Services (SSRS) improvements in this release include the following; please visit the SSRS team blog to learn more.
  • Pin Reporting Services report items - Including charts, gauge panels, maps, and images - to Power BI dashboards. Dashboard tiles always show up-to-date data thanks to scheduled refresh. Click a dashboard tile to drill through to the complete Reporting Services report.
  • Design reports using Visual Studio 2015 with an updated version of SQL Server Data Tools.
  • Uses .NET Framework 4.x code in report expressions, report code, referenced assemblies, and extensions for report security, data processing, rendering, or delivery.
Master Data Services (MDS) improvements in this release include:
  • Entity Change Approve Flow. Admin can mark an entity requiring approvals for changes. Examples include:
    • Admin marks an entity requiring approval for changes in the entity administrator page
    • User needs to save pending change to change set and submit to admin for approval
    • Admin approves or rejects the pending changes
    • The approved pending changes will be committed to the master data services
  • Domain Based Attribute Constrained List. For domain-based attributes, optionally, user can select a parent attribute whose value will constrain the allowed values for this attribute. Examples include:
    • Model has State, City, Account entity
    • Account has a City DBA to City entity and State DBA to State entity
    • City has a State DBA to State entity and a derived hierarchy from State to City
    • A constrain can be added on Account; City attribute which parent is Account.State. so the City attribute dropdown list is constrained by State value
Query memory grant enhancement improvement in CTP3:
  • Updated sort_warning and hash_warning XEvent to include spill IO stats
  • New hash_spill_details XEvent for detailed spill information
  • Statistics XML and SSMS updated to include spill warning
  • sys.dm_exec_query_stats DMV: updated with new memory grant and parallelism info
  • sys.dm_exec_requests DMV: updated with new memory grant and parallelism info
  • sys.dm_exec_query_parallel_workers: new DMV to show worker threads status for parallel queries
  • sys.dm_exec_query_optimizer_memory_gateways: new DMV to show query optimizer compile gates




To see when SQL Server 2016 Community Technology Preview 3.0 will be available visit this page  http://www.microsoft.com/en-us/server-cloud/products/sql-server-2016/default.aspx