SQL Server Code,Tips and Tricks, Performance Tuning

Thursday, May 19, 2016

Living in the NYC tri-state area? Attend the SQL 2016 Launch Event in NYC on June 2nd

Got this info from my co-worker Daryl earlier today.

On June 2nd Microsoft will hold a SQL 2016 Launch Event in New York City. The Microsoft Offices at 11 Times Square (at 8th and 42nd) in New York is where this event will be held.

WHEN
Thursday, June 2, 2016 from 8:00 AM to 2:00 PM (EDT)

WHERE
Microsoft Offices - 11 Times Square, New York, NY 10036

I might be there but I don't want to promise anything. You can sign up here for this event: https://www.eventbrite.com/e/sql-2016-launch-event-nyc-ny-tickets-25477835893

Wednesday, May 11, 2016

How to disable tests in tSQLt

The tSQLt unit test framework that we use to do our DB tests does not allow for disabled tests, this is because each test is just a stored proc. Right now the only way is to either rename the proc so it doesn’t start with test or to remove the test

I was thinking about this in the shower this morning….here is what I came up with

Here is a quick way how to prevent a test from running when the tSQLt tests run.
Put the code below into each unit test that you want disabled, you should put this code right after the CREATE PROCEDURE ProcName AS part

This code will pick the correct name and schema, so you can use the same exact code in each test

DECLARE @IsTestDisabled bit = 1 -- set this to 1 if you don't want the test to run

IF @IsTestDisabled =1
BEGIN
 DECLARE @SchemaName varchar(1000)
 SELECT @SchemaName = QUOTENAME(SCHEMA_NAME(schema_id)) +'.'
 FROM sys.procedures WHERE object_id = @@procid

 PRINT 'Disabled Test: ' + @SchemaName +  QUOTENAME(OBJECT_NAME(@@procid))
 RETURN
END

So if we grab the example test from the tSQLt website

We would change it from this

CREATE PROCEDURE testFinancialApp.[test that ConvertCurrency converts using given conversion rate]
AS


BEGIN
    DECLARE @actual MONEY;
    DECLARE @rate DECIMAL(10,4); SET @rate = 1.2;
    DECLARE @amount MONEY; SET @amount = 2.00;

    SELECT @actual = FinancialApp.ConvertCurrency(@rate, @amount);

    DECLARE @expected MONEY; SET @expected = 2.4;   --(rate * amount)
    EXEC tSQLt.AssertEquals @expected, @actual;

END;
GO

To this

CREATE PROCEDURE testFinancialApp.[test that ConvertCurrency converts using given conversion rate]
AS

DECLARE @IsTestDisabled bit = 1 -- set this to 1 if you don't want the test to run

IF @IsTestDisabled =1
BEGIN
 DECLARE @SchemaName varchar(1000)
 SELECT @SchemaName = QUOTENAME(SCHEMA_NAME(schema_id)) +'.'
 FROM sys.procedures WHERE object_id = @@procid

 PRINT 'Disabled Test: ' + @SchemaName +  QUOTENAME(OBJECT_NAME(@@procid))
 RETURN
END


BEGIN
    DECLARE @actual MONEY;
    DECLARE @rate DECIMAL(10,4); SET @rate = 1.2;
    DECLARE @amount MONEY; SET @amount = 2.00;

    SELECT @actual = FinancialApp.ConvertCurrency(@rate, @amount);

    DECLARE @expected MONEY; SET @expected = 2.4;   --(rate * amount)
    EXEC tSQLt.AssertEquals @expected, @actual;

END;
GO

Now you will get the following in your tSQLt test run output

Disabled Test: [testFinancialApp].[test that ConvertCurrency converts using given conversion rate]

This will be right above this text

+----------------------+
|Test Execution Summary|
+----------------------+

I want this printed so we know we have disabled tests when we get a email whenever Jenkins kicks off these tests…..

To enable the test again, all you have to do is change the value from 1 to 0 here. So instead of

DECLARE @IsTestDisabled bit = 1

You would make it

DECLARE @IsTestDisabled bit = 0

To see all my tSQLt posts, visit this link: http://sqlservercode.blogspot.com/search/label/tSQLt

Monday, May 02, 2016

SQL Server 2016 will be generally available on June 1, 2016

Mark your calendars, SQL Server will be generally available less than a month from now, June 1st 2016.

You don't have a reason not to play around with SQL Server 2016, in March it was announced that the SQL Server 2016 Developer edition will also be available as a free download.

Also today, Lenovo published a new #1 TPC-H 30 TB world record using SQL Server 2016 and Windows Server 2016 on Lenovo System x2950 X6

To find out about all the difference between the various editions, download this pdf: SQL Server 2016 editions –what’s new

For the official announcement, see here: Get ready, SQL Server 2016 coming on June 1st

Thursday, March 31, 2016

Microsoft releases SQL Server Horizontica, a Vertica killer

Amsterdam, Noord Holland, Nederlands, European Union. — April 1, 2016 — Microsoft Corp. on Friday morning announced that it has released the next version of its flagship Relational (and beyond) Database Management System SQL Server Horizontica.

Here is a quick rundown of all the new, exciting and shiny things......

Goes to eleven.
What we have done here is, if you select this option, we will offload very CPU intensive operations to the GPU. Think about it, GPUs are many times faster than CPUs… hookup a bunch of graphic card to you server and your CPU utilization goes down by 99%. Here is what an external GPU rig look like from one of our biggest beta testers

Mr Stein who is a member of that team will explain what he likes the most about Goes To Eleven.

Frank E. Stein

The best part about Goes To Eleven is that if the GPU used capacity is below 50%, SQL Server will automatically use the excess capacity to start bitcoin mining. This in turn will lower the cost of running SQL Server, it really is a no brainer. When you need capacity it is there, when you don't need it, you earn money. Being a former Oracle customer where the licensing model is "How much money do you have?" this is really welcome.

Some more new goodies

BSON
Every vendor has JSON by this point, we took it s step further and added Binary JSON also known as BSON support

MINDOP
Minimum degree of parallelism. Brand new in this version, you can now specify the minimum number of CPUs/GPUs that a query should use. If you know your query is going to return a lot of data...instead of adding indexes...think big and use a minimum of 128 CPUs/GPUs for your query

Rowstore
When we looked at Vertica and Sybase IQ customers we noticed a lot of queries executed were using SELECT *, this is very slow with a columnar data storage. With our patent pending rowstore, it is as fast as the traditional page store. The ROI is immense, no more do programmers and brogrammers have to find all the code where a table is used to add the additional column, SELECT * is always correct. Rowstore is also more natural, when getting the data back people read from left to right or right to left, people don't read from top to bottom, since the data is already stored in rows...no transform needs to happen, this also makes data display faster.

Variable page sizes
Instead of just 8K page sizes, SQL Server now also offers the following page sizes: 2K, 4K, 8K, 16K, 32K, 64K and 640K. The reason we have 640K is because that is the max page size anyone will ever need, you probably will not use it now, but you might in 10 years, we are already prepared for that

Smallishint
Smallishint is a 3 byte integer data type, bigger than tinyint and smaller than smallint. You can store values up to 4096

Myriaton
We took the concepts of Hekaton and got another 100x improvement

One of our customers will explaining how they are using it and how it has improved their applications

Dr A. Cula

I am Dr A. Cula and I work for Ichor, a company specializing in eternal life. We are running a big blood bank, we need to keep this stocked for our mmm inner circle members. We use tagging of subjects so that we can go and mmm get more blood when needed. We were always running into problems when summer came along..our mmm dispatchers and collectors don't like to work when it is sunny. So we needed really good performance during summer hours. After switching SQL Server Horizontica and using Myriaton we now are so efficient that we opened up 4 more blood banks. We could never have envisioned something like this was possible, because of this we also expanded our inner circle by 200%

Universal Stretch and Universal Availability Groups
This is really exciting, instead of a regular stretch database on Azure, Universal Stretch will copy your data on one of our data centers located in the universe. This year it is just on the Moon where we have a data center already, in 2 years you can also use the one we are currently sending to Mars. Then when we start sending people to Mars, you are already open for business...the data is already there and it will be synchronized between Earth, Moon and Mars. Universal Availability Groups work the same way, you plug in the planet or moon and we take care of the rest.

ChandarUma ChandranJaya (also known as CU) program manager on the SQL Server programmability team will fill us in on some additional new features not mentioned yet.

New JOINS

ANTI JOIN
This is a brand new JOIN and it is ANSI SQL 2022 compliant. It is basically a JOIN between 2 tables on a column where the values are not the same (Think like NOT EXISTS)

SELECT *
FROM Table1 t1
ANTI JOIN Table2 t2
ON t1.ID = T2

CUBE JOIN
The CUBE JOIN takes the CROSS JOIN to whole other level...instead of returning all rows in one table for each row in the other tables.... the CUBE JOIN returns all values in all columns in all rows in one table for each value in all columns in all rows in another table... it is like a hypercube...

SELECT * 
FROM Table1 t1
CUBE JOIN Table2 t2

NATURAL JOIN
The NATURAL JOIN joins two tables on columns with the same name

SELECT * 
FROM Table1 t1
NATURAL JOIN Table2 t2

KEY JOIN
The KEY JOIN is really cool, what it does it joins tables based on primary and foreign keys, no need to specify these in the join condition anymore

SELECT * 
FROM Table1 t1
KEY JOIN Table2 t2

New Traceflags
Traceflag 666
This is a brand new traceflag, it will give your SQL Server instance hellish performance. It acts like a Double Distributed Denial Of Service (DDDOS) attack against you SQL Server instance...... IO, Memory.... everything will be completely saturated... however you can still use the DAC (Dedicated Administrative Connection) to get in and run queries against DMVs to check the Subsystem poerformance

New functions
We have implemented the following new functions
IsDaylightSavingsTime
IsNotWeekend
Trim
sp_MsForEachServer
DBCC CheckCrappyCode
DBCC
Quartile
DCOUNT
DAYS360
IsOdd
IsNontext
Gigo
Lifo
Fifo
Fido

Specialized indexes
We now have the following new indexes

DateIndex designed to work with all the date data types

FunctionIndex, finally you can index for example the first 5 characters of a column only, you can use all the functions available in SQL Server

Swift
Swift is a brand new storage engine, you can use this when you don't need constraints, keys..after all real developers manage those in the app tier. You enable this per database and you can make the same face as Taylor when thinking how you don't have to deal with all those nasty constrains that just get in the way. The beauty of Swift is that if you do have constraints in your DDL scripts, they will be just silently ignored. No more scripts that have to run in a specific order...it always runs fine..., your manager will love, no adore you..

There are many more things that are new, you can visit the SQL Server homepage to find out even more exciting stuff

Monday, March 28, 2016

The most interesting Build sessions for people working with data

Microsoft's Build conference is happening this week, the schedule for all the sessions was just made available. Below are all the sessions that will be most interesting to people who work with data. Clicking on the session's title will bring you to the Channel 9 page that will have the content. These session recording are of course not available yet, you will need to wait till after the conference is done to see them.

Intelligent Data Driven Applications that Learn and Adapt
Shawn Bice, Pablo Castro
Azure, Data, Intelligent App, Machine Learning
200 - Intermediate

Applications show intelligence when they can spot trends, react to unusual events, predict outcomes or recommend choices. Learn how to introduce intelligence traits into your apps including; establishing feedback loops, applying big data and machine learning techniques to classify, predict or otherwise analyze explicit and implicit signals, and operationalizing the full stack into the regular usage flow of the app. Most every day apps, from consumer to enterprise can deliver greater customer or business benefit by learning from user behavior and other signals. In this session we’ll take a pragmatic look at introducing real, useful data-driven intelligence into apps by walking through services, code and data needed to make it happen.

SQL Database Technologies for Developers
Lindsey Allen, Tony Petrossian
Azure, Data, Intelligent App
200 - Intermediate

Microsoft offers SQL Server and Azure SQL Database to help you develop great relational database applications. In this session you will learn about the top developer features coming in SQL Server 2016 which are already in Azure SQL Database. Additionally, you will see the latest investments in Azure SQL Database that enable you to easily manage thousands of databases and get the performance and security insight needed to build robust and secure applications in the cloud.

Adding Power BI Data Experiences to Your Applications
Josh Caplan, Lukasz Pawlowski
Analytics, Data
300 - Experienced

Bring stunning and interactive Power BI data experience to your applications. Learn how to integrate to Power BI using the Power BI REST APIs and how to build your own custom visualizations.

Delivering Applications at Scale with DocumentDB, Azure's NoSQL Document Database
John Macintyre, Dharma Shukla
Azure, Data, Intelligent App
200 - Intermediate

What does a #1 app on the iTunes App Store and an installment of one of the most successful gaming franchises in history have in common? They both use Azure DocumentDB to ingest massive volumes of data and serve low latency queries to provide great user experiences. Come learn about the business goals and technical challenges faced by two real-world, immensely popular applications and why they chose to use Azure DocumentDB, as well as the application patterns they used to achieve their massive scale requirements.

Azure IoT: Complete Cloud Offerings for the IoT Revolution
Sam George, Cameron Skinner
Azure, Data, Internet of Things
200 - Intermediate

IoT is the next revolution in computing and an incredibly exciting space. A critical part of IoT is cloud based solutions that enable you to connect, secure and manage IoT devices as well as providing deep insights from IoT data. Azure is all-in on IoT, this session will cover our industry leading offerings. Come be part of the next revolution!

Data Science for Developers
Danielle Dean, Daniel Grecoe
Analytics, Data
200 - Intermediate

In this session you will learn about Azure Machine Learning through a comprehensive end-to-end example that he builds during the session and that encompasses: •Problem detection •Algorithm selection •Machine learning model creation and deployment as a RESTful web service •Consumption of the machine learning model The session is intended for engineers, and Dan himself is an engineer, so he does not delve into a deep understanding of complex mathematical models behind machine learning, but instead focuses on the concepts of machine learning to demystify cloud machine learning.

Azure Data Lake and Azure Data Warehouse: Applying Modern Practices to Your App
Tim Mallalieu, Lara Rubbelke
Azure, Data
200 - Intermediate

Azure Data Warehouse and Azure Data Lake are two new services designed to work with all of your data no matter how big or complex. With two strong options to store, process and analyze large volumes of data, you may be curious about which service is right for your application needs. This session will drill into the strengths of each service and walk through the common application patterns for integration of large scale data analysis in the cloud using these services.

Advanced Analytics with R and SQL
Nagesh Pabbisetty, Tobias Ternstrom
Analytics, Data, Intelligent App
200 - Intermediate

R is the lingua franca of Analytics. SQL is the world’s most popular database language. What magic can you make happen by combining the power of R and SQL for Data Science and Advanced Analytics? Imagine the power of exploring, transforming, modeling, and scoring data at scale from the comfort of your favorite R environment. Now, imagine operationalizing the models you create directly in SQL Server, allowing your applications to use them from T-SQL, executed right where your data resides. Come learn how to build and deploy intelligent applications that combine the power of R, SQL Server, thousands of open source R extension packages, and high-performance implementations of the most popular machine learning algorithms at scale.

Insight from Exhaust, Enriching Your Applications with Data
Matthew Winkler
Azure, Analytics, Data, Analytics/Telemetry, .NET, C#, Machine Learning
200 - Intermediate

Modern cloud applications provide new opportunities every day for greater insights into our customers, code and quality using the signals in the application’s exhaust. Using the data produced by users and applications, developers can easily improve the experience for users and increase their engagement within the apps they are using. Starting with a typical application deployed in Azure, this talk will walk through the steps required to build a more intelligent application, customized for each user to provide a more engaging experience. We’ll add more engaging recommendations tailored to a user, more relevant in-application search results, and gain deeper understanding the application’s users. Additionally, we’ll cover how to process this telemetry as it is created to enable us to react in real time to changes in the application. No knowledge of machine learning, data science, or big data is required, by the end you’ll learn to use all three to create a richer application.

Tuesday, March 08, 2016

How to get SQL Server on Linux?

In order to get SQL Server on Linux you need to go to the following page: https://www.microsoft.com/en-us/server-cloud/sql-server-on-linux.aspx. Click on the part where is says Sign up to stay informed and to apply for the preview. It will bring you to a signup page that looks like this

After you filled out all the information required you will have to wait for a reply. Be aware that currently the SQL Server on Linux preview is limited to Ubuntu or as a Docker image

Monday, March 07, 2016

SQL Server 2016 Release Candidate 0 now available for download

The SQL Server engineering team just announced the availability of SQL Server 2016 Release Candidate 0.

To learn more about the release, visit the SQL Server 2016 preview page. To experience the new, exciting features in SQL Server 2016 and the new rapid release model, download the preview and start evaluating the impact these new innovations can have for your business.

For more information on this release, see SQL Server 2016 Release Notes and What’s New in SQL Server 2016.

Some of the more interesting things added is that instead of a trace flag, you can now execute a

ALTER DATABASE SCOPED CONFIGURATION command.

It looks like this

ALTER DATABASE SCOPED CONFIGURATION

{

[FOR SECONDARY]

{ SET | CLEAR PROCEDURE_CACHE }

}[;]

< set_options > ::=

{ MAXDOP = { | PRIMARY}

| LEGACY_CARDINALITY_ESTIMATION = { ON | OFF | PRIMARY}

| PARAMETER_SNIFFING = { ON | OFF | PRIMARY}

| QUERY_OPTIMIZER_HOTFIXES = { ON | OFF | PRIMARY}

| ONLINE_OPERATION_DEFAULT = {ON | OFF} }

There is also finally a built in split string function.

It looks like this

SELECT * FROM STRING_SPLIT('Hi how are you doing?',' ')

The result is:

value

—–

how

are

you

doing?

Finally, there will be an event on March 10, 2016: SQL Server 2016 — Take the lead, watch the live stream

Get the RC 0 here: SQL Server 2016 Release Candidate 0

Friday, March 04, 2016

Gotcha with the LEAD analytical function and date datatype

The other day some code that used the LEAD analytical function stopped working suddenly. It started to throw an error.

Msg 206, Level 16, State 2, Line 1

Operand type clash: date is incompatible with int

The code wasn't touched in a long time... so what could be the issue. Let's reproduce it here

First create the following table

CREATE TABLE testRunningTotal(Id tinyint,SomeDate datetime, Charge decimal(20,2))

INSERT testRunningTotal
SELECT 1,'20120101',1000
UNION ALL
SELECT 1,'20120401',200
UNION ALL
SELECT 1,'20120501',300
UNION ALL
SELECT 1,'20120601',600
UNION ALL
SELECT 2,'20120101',100
UNION ALL
SELECT 2,'20130101',500
UNION ALL
SELECT 2,'20140101',-800
UNION ALL
SELECT 3,'20120101',100

Running this query to see what is in this table

SELECT * FROM testRunningTotal

Id SomeDate Charge

1 2012-01-01 00:00:00.000 1000.00

1 2012-04-01 00:00:00.000 200.00

1 2012-05-01 00:00:00.000 300.00

1 2012-06-01 00:00:00.000 600.00

2 2012-01-01 00:00:00.000 100.00

2 2013-01-01 00:00:00.000 500.00

2 2014-01-01 00:00:00.000 -800.00

3 2012-01-01 00:00:00.000 100.00

Now run our fancy query...

SELECT id, someDate as StartDate,
LEAD(SomeDate -1,1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM testRunningTotal
  ORDER BY id, SomeDate

Here are our results

id StartDate Enddate Charge RunningTotal

1 2012-01-01 00:00:00.000 2012-03-31 00:00:00.000 1000.00 1000.00

1 2012-04-01 00:00:00.000 2012-04-30 00:00:00.000 200.00 1200.00

1 2012-05-01 00:00:00.000 2012-05-31 00:00:00.000 300.00 1500.00

1 2012-06-01 00:00:00.000 9999-12-31 00:00:00.000 600.00 2100.00

2 2012-01-01 00:00:00.000 2012-12-31 00:00:00.000 100.00 100.00

2 2013-01-01 00:00:00.000 2013-12-31 00:00:00.000 500.00 600.00

2 2014-01-01 00:00:00.000 9999-12-31 00:00:00.000 -800.00 -200.00

3 2012-01-01 00:00:00.000 9999-12-31 00:00:00.000 100.00 100.00

To see what the query actually does, visit this post Easy running totals with windowing functions

Let's take a closer look at the LEAD analytical function, this is what Books On Line has on its page

LEAD ( scalar_expression [ ,offset ] , [ default ] )

OVER ( [ partition_by_clause ] order_by_clause )

offset

The number of rows forward from the current row from which to obtain a value. If not specified, the default is 1. offset can be a column, subquery, or other expression that evaluates to a positive integer or can be implicitly converted to bigint. offset cannot be a negative value or an analytic function.

Looks all cool to me..nothing that stands out that could cause a problem...

What will happen if someone changes the column from a datetime to a date? Maybe they looked at the data and saw that all the values had only dates and no time components... why use 8 bytes of storage when you can use 3 instead?

To find out exactly what will happen, run the following DDL statement....

ALTER TABLE testRunningTotal ALTER COLUMN SomeDate date

Now go run your query again

SELECT id, someDate as StartDate,
LEAD(SomeDate -1,1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM testRunningTotal
  ORDER BY id, SomeDate

Here is the error you get

Msg 206, Level 16, State 2, Line 1

Operand type clash: date is incompatible with int

The error is the same as if you would run the following

SELECT CONVERT(DATE,GETDATE() )-1

Instead what you have to do is use the DATEADD function to do arithmetic with dates

SELECT DATEADD(dd,-1,CONVERT(DATE,GETDATE()))

So to fix our query from before, we want to change the -1 to be DATEADD(dd,-1 instead, here is what our query looks like

 SELECT id, someDate as StartDate,
LEAD(dateadd(dd,-1,SomeDate),1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM testRunningTotal
  ORDER BY id, SomeDate

And here are our results, it's all good now

id StartDate Enddate Charge RunningTotal

1 2012-01-01 2012-03-31 1000.00 1000.00

1 2012-04-01 2012-04-30 200.00 1200.00

1 2012-05-01 2012-05-31 300.00 1500.00

1 2012-06-01 9999-12-31 600.00 2100.00

2 2012-01-01 2012-12-31 100.00 100.00

2 2013-01-01 2013-12-31 500.00 600.00

2 2014-01-01 9999-12-31 -800.00 -200.00

3 2012-01-01 9999-12-31 100.00 100.00

I still think that maybe the LEAD function should do this conversion internally since all we are doing is specifying the OFFSET...what is your opinion?

Thursday, March 03, 2016

Did you know I have a github repository for stuff I run over and over?

There are queries or commands that I run every now and then. I typically don’t have these stored on my computer; to find this code I usually find the blog post I made and copy paste it from there. This becomes a pain in the neck sometimes because I don’t quite remember the title of the post, so I might be spending a good amount of time looking for it. I decided to put some of the code that I run frequently up on github.

You will see code to find index fragmentation as well code to concatenate rows into columns. So there are two kinds of code here. Code that you can run as is and code that you will plug into a stored procedure after you make changes to the table names. You can find that github repository here https://github.com/SQLMenace/CommonTasks

Besides the Common Tasks repository, I also have a Today I Learned repository. A collection of concise write-ups on small things I learn day to day across a variety of languages and technologies. These are things that don't really warrant a full blog post.

You can find that project here https://github.com/SQLMenace/til

With the Today I Learned repository you will find that there are not many things yet. The reason is that I just started this repository, but since SQl Server 2016 is just around the corner, I will start playing around with SQL Server 2016 more and more, thus the Today I Learned repository will become bigger soon.

Wednesday, March 02, 2016

Easy running totals with windowing functions

Back in the pre SQL Server 2012 days, if you wanted to do a running count, you either had to run a subquery or you could use a variable. This was slow because for each row the query that did the sum would be executed. With the additions to the windowing functions in SQL Server 2012, this is now running much faster.

Let's take a look, first create the following table

CREATE TABLE #test(Id tinyint,SomeDate date, Charge decimal(20,10))

insert #test
SELECT 1,'20120101',1000
UNION ALL
SELECT 1,'20120401',200
UNION ALL
SELECT 1,'20120501',300
UNION ALL
SELECT 1,'20120601',600
UNION ALL
SELECT 2,'20120101',100
UNION ALL
SELECT 2,'20130101',500
UNION ALL
SELECT 2,'20140101',-800
UNION ALL
SELECT 3,'20120101',100

let's check that data we just inserted into the temporary table

SELECT * FROM #test

The output looks like this

Id SomeDate Charge
1 2012-01-01 1000.0000000000
1 2012-04-01 200.0000000000
1 2012-05-01 300.0000000000
1 2012-06-01 600.0000000000
2 2012-01-01 100.0000000000
2 2013-01-01 500.0000000000
2 2014-01-01 -800.0000000000
3 2012-01-01 100.0000000000

What we want is the following

id StartDate Enddate         Charge         RunningTotal
1 2012-01-01 2012-03-31 1000.0000000000 1000.0000000000
1 2012-04-01 2012-04-30 200.0000000000 1200.0000000000
1 2012-05-01 2012-05-31 300.0000000000 1500.0000000000
1 2012-06-01 9999-12-31 600.0000000000 2100.0000000000
2 2012-01-01 2012-12-31 100.0000000000 100.0000000000
2 2013-01-01 2013-12-31 500.0000000000 600.0000000000
2 2014-01-01 9999-12-31 -800.0000000000 -200.0000000000
3 2012-01-01 9999-12-31 100.0000000000 100.0000000000

For each row, we want to have the date that the row starts on and also the date when it end, we also want a running total as well. If there is no row after the current row for that id, we want the end date to be 9999-12-31.

So we will use a couple of functions. The first one is LEAD, LEAD accesses data from a subsequent row in the same result set without the use of a self-join. So the LEAD part looks like this

LEAD(dateadd(dd,-1,SomeDate),1,'99991231') OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,

What we are doing is subtracting 1 from the date in the subsequent row (ateadd(dd,-1,SomeDate))
We are using 1 as the offset since we want to apply this to the next row. Finally if there is no subsequent row, we want to use the date 9999-12-31 instead of NULL

To do the running count, we will do the following

SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
AS RunningTotal

What this means in English is for each id ordered by date, sum up the charge values for the rows between the preceding rows and the current row. Here is what all that stuff means.

ROWS BETWEEN
Specifies the rows that make up the range to use as implied by

UNBOUNDED PRECEDING
Specifies that the window starts at the first row of the partition. UNBOUNDED PRECEDING can only be specified as window starting point.

CURRENT ROW
Specifies that the window starts or ends at the current row when used with ROWS or the current value when used with RANGE.
CURRENT ROW can be specified as both a starting and ending point.

And here is the query

SELECT id, someDate as StartDate,
LEAD(dateadd(dd,-1,SomeDate),1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM #test
  ORDER BY id, SomeDate

And running that query, gives us the running count as well as the dates

id StartDate Enddate         Charge         RunningTotal
1 2012-01-01 2012-03-31 1000.0000000000 1000.0000000000
1 2012-04-01 2012-04-30 200.0000000000 1200.0000000000
1 2012-05-01 2012-05-31 300.0000000000 1500.0000000000
1 2012-06-01 9999-12-31 600.0000000000 2100.0000000000
2 2012-01-01 2011-12-31 100.0000000000 100.0000000000
2 2012-01-01 2011-13-31 500.0000000000 600.0000000000
2 2012-01-01 9999-12-31 -800.0000000000 -200.0000000000
3 2012-01-01 9999-12-31 100.0000000000 100.0000000000

That's all for this post...

Monday, February 29, 2016

#BackToBasics returning rows with maximum values from a table

In the A challenge for 2016....... accepted! post I said that I would write a post once every month that would explain very basic stuff about SQL Server. Today is the second post, in this post we will take a look at how you can return the row with the maximum value for a group.

In order to explain the code in this post, I decided to grab the top largest cities in the world as the data. You can find the data here List of cities proper by population

Below is the SQL code to create a table and also the data to populate this table

create table #Temp( Rank int not null, 
 City varchar(100) not null,
 Population int not null,
 Density int not null,
 Country varchar(100))

INSERT #Temp
SELECT 1,'Shanghai',24256800,3826,'China' union all
SELECT 2,'Karachi',23500000,6663,'Pakistan' union all
SELECT 3,'Beijing',21516000,1311,'China' union all
SELECT 4,'Lagos',21324000,18206,'Nigeria' union all
SELECT 5,'Delhi',16787941,11320,'India' union all
SELECT 6,'Tianjin',15200000,1293,'China' union all
SELECT 7,'Istanbul',14160467,2593,'Turkey' union all
SELECT 8,'Tokyo',13297629,6075,'Japan' union all
SELECT 9,'Guangzhou',13080500,1759,'China' union all
SELECT 10,'Mumbai',12478447,20680,'India' union all
SELECT 11,'Moscow',12197596,4859,'Russia' union all
SELECT 12,'São Paulo',11895893,7821,'Brazil' union all
SELECT 13,'Shenzhen',10467400,5256,'China' union all
SELECT 14,'Jakarta',10075310,15171,'Indonesia' union all
SELECT 15,'Seoul',10048850,17134,'South Korea'

Now that we have the data in our table, let's run a simple query to return the row with the most populous city in the world. You can accomplish this in SQL Server by using the TOP operator and ordering the data by population in descending order.

SELECT TOP 1 * FROM #Temp
ORDER BY population DESC

Here is the output from that query

Rank City  Population Density Country
1 Shanghai 24256800 3826 China

What we want to do next is return one row the most populous city per country.

So let's take it a step at a time, the query below will give you the maximum population by country

SELECT MAX(population) as BiggestPopulation,Country
FROM #Temp
GROUP BY Country
ORDER BY BiggestPopulation DESC

Here is the output

BiggestPopulation Country
24256800  China
23500000  Pakistan
21324000  Nigeria
16787941  India
14160467  Turkey
13297629  Japan
12197596  Russia
11895893  Brazil
10075310  Indonesia
10048850  South Korea

What we can do is join the query we have above with the table itself on the country and population columns. The query will look like this

SELECT T.* FROM(
SELECT MAX(population) as BiggestPopulation,Country
FROM #Temp
GROUP BY Country) X
JOIN #Temp t on x.BiggestPopulation = t.Population
AND t.Country = x.Country
ORDER BY t.Population DESC

Here is the output

+------+-----------+------------+---------+-------------+
| Rank | City      | Population | Density | Country     |
+------+-----------+------------+---------+-------------+
| 1    | Shanghai  | 24256800   | 3826    | China       |
+------+-----------+------------+---------+-------------+
| 2    | Karachi   | 23500000   | 6663    | Pakistan    |
+------+-----------+------------+---------+-------------+
| 4    | Lagos     | 21324000   | 18206   | Nigeria     |
+------+-----------+------------+---------+-------------+
| 5    | Delhi     | 16787941   | 11320   | India       |
+------+-----------+------------+---------+-------------+
| 7    | Istanbul  | 14160467   | 2593    | Turkey      |
+------+-----------+------------+---------+-------------+
| 8    | Tokyo     | 13297629   | 6075    | Japan       |
+------+-----------+------------+---------+-------------+
| 11   | Moscow    | 12197596   | 4859    | Russia      |
+------+-----------+------------+---------+-------------+
| 12   | São Paulo | 11895893   | 7821    | Brazil      |
+------+-----------+------------+---------+-------------+
| 14   | Jakarta   | 10075310   | 15171   | Indonesia   |
+------+-----------+------------+---------+-------------+
| 15   | Seoul     | 10048850   | 17134   | South Korea |
+------+-----------+------------+---------+-------------+

SQL Server 2005 introduced a bunch of ranking/windowing functions which make writing these queries a little simpler. We are going to use ROW_NUMBER for this post. Here is what the query will look like

SELECT ROW_NUMBER() OVER (PARTITION By [what do you want it grouped/partitioned by?] ORDER BY [what do you want it ordered by?]) AS Row,*
FROM Table

In our case, we want to partition the result set by country, we also want to order each partition by population descending. The query looks like this

SELECT ROW_NUMBER() OVER (PARTITION By Country ORDER BY population DESC) AS Row,*
FROM #Temp
ORDER BY Row,Population DESC

Here is what that query returns

+-----+------+-----------+------------+---------+-------------+
| Row | Rank | City      | Population | Density | Country     |
+-----+------+-----------+------------+---------+-------------+
| 1   | 1    | Shanghai  | 24256800   | 3826    | China       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 2    | Karachi   | 23500000   | 6663    | Pakistan    |
+-----+------+-----------+------------+---------+-------------+
| 1   | 4    | Lagos     | 21324000   | 18206   | Nigeria     |
+-----+------+-----------+------------+---------+-------------+
| 1   | 5    | Delhi     | 16787941   | 11320   | India       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 7    | Istanbul  | 14160467   | 2593    | Turkey      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 8    | Tokyo     | 13297629   | 6075    | Japan       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 11   | Moscow    | 12197596   | 4859    | Russia      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 12   | São Paulo | 11895893   | 7821    | Brazil      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 14   | Jakarta   | 10075310   | 15171   | Indonesia   |
+-----+------+-----------+------------+---------+-------------+
| 1   | 15   | Seoul     | 10048850   | 17134   | South Korea |
+-----+------+-----------+------------+---------+-------------+
| 2   | 3    | Beijing   | 21516000   | 1311    | China       |
+-----+------+-----------+------------+---------+-------------+
| 2   | 10   | Mumbai    | 12478447   | 20680   | India       |
+-----+------+-----------+------------+---------+-------------+
| 3   | 6    | Tianjin   | 15200000   | 1293    | China       |
+-----+------+-----------+------------+---------+-------------+
| 4   | 9    | Guangzhou | 13080500   | 1759    | China       |
+-----+------+-----------+------------+---------+-------------+
| 5   | 13   | Shenzhen  | 10467400   | 5256    | China       |
+-----+------+-----------+------------+---------+-------------+

What we really want is the top 10 rows. You might think that you can just add WHERE Row = 1 to the query. This is not allowed since the WHERE clause gets executed before the ROW_Number function as part of the SELECT. What you can do is use a common table expression and then specify WHERE Row = 1 when selecting from the CTE

Here is what it looks like

;WITH CTE AS(
SELECT ROW_NUMBER() OVER (PARTITION By Country ORDER BY population DESC) AS Row,*
FROM #Temp)

SELECT Rank, City,Population,Density,Country
FROM CTE
WHERE Row = 1
ORDER BY Population DESC

And here is our output, for each country we have only one row, you can also see that 4 cities from the top 10 are not in this list, those cities are in either China or India

+------+-----------+------------+---------+-------------+--+
| Rank | City      | Population | Density | Country     |  |
+------+-----------+------------+---------+-------------+--+
| 1    | Shanghai  | 24256800   | 3826    | China       |  |
+------+-----------+------------+---------+-------------+--+
| 2    | Karachi   | 23500000   | 6663    | Pakistan    |  |
+------+-----------+------------+---------+-------------+--+
| 4    | Lagos     | 21324000   | 18206   | Nigeria     |  |
+------+-----------+------------+---------+-------------+--+
| 5    | Delhi     | 16787941   | 11320   | India       |  |
+------+-----------+------------+---------+-------------+--+
| 7    | Istanbul  | 14160467   | 2593    | Turkey      |  |
+------+-----------+------------+---------+-------------+--+
| 8    | Tokyo     | 13297629   | 6075    | Japan       |  |
+------+-----------+------------+---------+-------------+--+
| 11   | Moscow    | 12197596   | 4859    | Russia      |  |
+------+-----------+------------+---------+-------------+--+
| 12   | São Paulo | 11895893   | 7821    | Brazil      |  |
+------+-----------+------------+---------+-------------+--+
| 14   | Jakarta   | 10075310   | 15171   | Indonesia   |  |
+------+-----------+------------+---------+-------------+--+
| 15   | Seoul     | 10048850   | 17134   | South Korea |  |
+------+-----------+------------+---------+-------------+--+

That is all for this month, see you next month. In the meantime, make sure to explore windowing/ranking functions as well as Common Table Expressions

Sunday, February 21, 2016

How to store twitter search results from R into SQL Server

I was messing around with R and grabbing twitter searches from within R, after that I decided to store the result in a csv file. Instead of stroring the results in a file, what if I want to store that info into a SQL Server database? This is pretty easy to do. If you want to try this out yourself, first head on over to my other post Using twitter from the R console. There you will be able to see how to setup OAuth so that you can use twitter within R, you will also see how to install the R twitteR package as well.

Time to get started....

First let's see how to connect to SQL Server from R
To load data from SQL Server into R, you first need to create an ODBC Data Source, I created one named LocalSqlServer, this DSN Data Source points to my local SQL Server 2016 instance.

After you have created your ODBC data source, you need to load the (RODBC) package, create a connection, execute a query, display the results and close the connection.. Here is what it looks like in the R Console

> require(RODBC)

>db <- odbcConnect("LocalSqlServer")

> Sample <- sqlQuery(db, "select * from sysobjects", stringsAsFactors=FALSE)

> head(Sample)

> close(db)  #close ORDBC connection
     name          id xtype uid info status base_schema_ver
1      sp_MSalreadyhavegeneration -1073624922    P    4    0      0               0
2      sp_MSwritemergeperfcounter -1072815163    P    4    0      0               0
3       TABLE_PRIVILEGES -1072372588    V    3    0         
4       sp_replsetsyncstatus -1071944761    X    4    0      0               0
5       sp_replshowcmds -1070913306    P    4    0      0               0
6       sp_publishdb -1070573756    P    4    0      0               0

Now that we know how to connect to SQL Server, let's see if we can get the twitteR package to work, head on over to my other post Using twitter from the R console to see how to install twitterR and how some queries are executed.

Assuming the twitteR package is installed, we need to load it to use it

> library(twitteR)

The next command is the command to enable OAuth, to see how to set this up go here Using twitter from the R console

> setup_twitter_oauth("API key", "API secret", "Access token", "Access secret")

[1] "Using direct authentication"

Now it is time to do a simple search

> tweets <- searchTwitter('#rstats', n=6)
> tweets

Running that command should have given you some results. However what we need as output is a dataframe, this way we can easily load this into a table in SQL Server. You can use the twListToDF function

> tweets <- searchTwitter('#rstats', n=6)
> twListToDF(tweets)

The output now looks different

1 RT @rquintino: @Mairos_B #sqlsatportugal session: all about R in #SqlServer 2016 #rstats https://t.co/DHrqIZrz1e
2 a quick script to use imgcat in #rstats https://t.co/fpUlgWNX33 https://t.co/AhCCMLewCH
3 RT @KirkDBorne: Useful packages (libraries) for Data Analysis in R: https://t.co/haRKopFyly #DataScience #Rstats by @analyticsvidhya https:…
4 Hey #rstats tweeps, do you have any readings to recommend on sensitivity analysis? Books/articles/websites all welcome.
5 RT @KirkDBorne: A Complete Tutorial on Time Series Modeling in R: https://t.co/7oI6JKyU4E #MachineLearning #DataScience #Rstats by @Analyti…
6 RT @KirkDBorne: A Complete Tutorial on Time Series Modeling in R: https://t.co/7oI6JKyU4E #MachineLearning #DataScience #Rstats by @Analyti…
favorited favoriteCount replyToSN created truncated replyToSID
1 FALSE 0 NA 2016-02-20 20:29:54 FALSE NA
2 FALSE 0 NA 2016-02-20 20:24:50 FALSE NA
3 FALSE 0 NA 2016-02-20 20:16:25 FALSE NA
4 FALSE 0 NA 2016-02-20 20:11:08 FALSE NA
5 FALSE 0 NA 2016-02-20 20:11:06 FALSE NA
6 FALSE 0 NA 2016-02-20 20:02:05 FALSE NA
id replyToUID
1 701141750161784834 NA
2 701140474019577856 NA
3 701138356466483204 NA
4 701137026075140096 NA
5 701137018508722176 NA
6 701134750296227840 NA
statusSource
1 Mobile Web (M5)
2 Tweetbot for Mac
3   Twitter for Android
4   Twitter Web Client
5   Twitter for iPhone
6   Twitter Web Client
screenName retweetCount isRetweet retweeted longitude latitude
1 psousa75 3 TRUE FALSE NA NA
2 millerdl 0 FALSE FALSE NA NA
3 diana_nario 50 TRUE FALSE NA NA
4 emjonaitis 0 FALSE FALSE NA NA
5 caryden 41 TRUE FALSE NA NA
6 ArkangelScrap 41 TRUE FALSE NA NA

The output now has a lot more stuff, you can see if it has been retweeted or favorited as well as the latitude, longtitude and more.

Time to send the result to SQL Server

There is a problem with how the sqlSave method creates the table, it makes the created column a float and then all inserts fail.

Run the following DDL statement in your database

CREATE TABLE [dbo].[tweets](
	[rownames] [nvarchar](255) NULL,
	[text] [nvarchar](255) NULL,
	[favorited] [nvarchar](5) NULL,
	[favoriteCount] [float] NULL,
	[replyToSN] [nvarchar](5) NULL,
	[created] [datetime] NULL,
	[truncated] [nvarchar](5) NULL,
	[replyToSID] [nvarchar](5) NULL,
	[id] [nvarchar](255) NULL,
	[replyToUID] [nvarchar](5) NULL,
	[statusSource] [nvarchar](255) NULL,
	[screenName] [nvarchar](255) NULL,
	[retweetCount] [float] NULL,
	[isRetweet] [nvarchar](5) NULL,
	[retweeted] [nvarchar](5) NULL,
	[longitude] [nvarchar](255) NULL,
	[latitude] [nvarchar](255) NULL
) ON [PRIMARY]

Now we are ready to save our twitter data in SQL Server, here is how you do it.

> sqlSave(db,append = TRUE,colnames = FALSE,verbose = FALSE, addPK = FALSE, twListToDF(tweets))

Here is what the results look in SSMS

Make sure that you have append = TRUE, if you make it FALSE, the table will be recreated every time but since the command issued creates the wrong data type for the created column, you can't use it

So to recap, here are the commands you need after you loaded the twitterR package and have setup OAuth

> tweets <- searchTwitter('#rstats', n=6) #store the twitter search results
> twListToDF(tweets)  #convert to dataframe

> require(RODBC)  # load RODBC package
> db <- odbcConnect("LocalSqlServer")  #connect
> sqlSave(db,append = TRUE,colnames = FALSE,verbose = FALSE, addPK = FALSE, twListToDF(tweets))  #insert data into your table
> close(db)  #close ORDBC connection