Monday, March 28, 2016

The most interesting Build sessions for people working with data



Microsoft's Build conference is happening this week, the schedule for all the sessions was just made available. Below are all the sessions that will be most interesting to people who work with data. Clicking on the session's title will bring you to the Channel 9 page that will have the content. These session recording are of course not available yet, you will need to wait till after the conference is done to see them.



Intelligent Data Driven Applications that Learn and Adapt
Shawn Bice, Pablo Castro
Azure, Data, Intelligent App, Machine Learning
200 - Intermediate

Applications show intelligence when they can spot trends, react to unusual events, predict outcomes or recommend choices. Learn how to introduce intelligence traits into your apps including; establishing feedback loops, applying big data and machine learning techniques to classify, predict or otherwise analyze explicit and implicit signals, and operationalizing the full stack into the regular usage flow of the app. Most every day apps, from consumer to enterprise can deliver greater customer or business benefit by learning from user behavior and other signals. In this session we’ll take a pragmatic look at introducing real, useful data-driven intelligence into apps by walking through services, code and data needed to make it happen.


SQL Database Technologies for Developers
Lindsey Allen, Tony Petrossian
Azure, Data, Intelligent App
200 - Intermediate

Microsoft offers SQL Server and Azure SQL Database to help you develop great relational database applications. In this session you will learn about the top developer features coming in SQL Server 2016 which are already in Azure SQL Database. Additionally, you will see the latest investments in Azure SQL Database that enable you to easily manage thousands of databases and get the performance and security insight needed to build robust and secure applications in the cloud.



Adding Power BI Data Experiences to Your Applications
Josh Caplan, Lukasz Pawlowski
Analytics, Data
300 - Experienced

Bring stunning and interactive Power BI data experience to your applications. Learn how to integrate to Power BI using the Power BI REST APIs and how to build your own custom visualizations.


Delivering Applications at Scale with DocumentDB, Azure's NoSQL Document Database
John Macintyre, Dharma Shukla
Azure, Data, Intelligent App
200 - Intermediate

What does a #1 app on the iTunes App Store and an installment of one of the most successful gaming franchises in history have in common? They both use Azure DocumentDB to ingest massive volumes of data and serve low latency queries to provide great user experiences. Come learn about the business goals and technical challenges faced by two real-world, immensely popular applications and why they chose to use Azure DocumentDB, as well as the application patterns they used to achieve their massive scale requirements.



Azure IoT: Complete Cloud Offerings for the IoT Revolution
Sam George, Cameron Skinner
Azure, Data, Internet of Things
200 - Intermediate

IoT is the next revolution in computing and an incredibly exciting space. A critical part of IoT is cloud based solutions that enable you to connect, secure and manage IoT devices as well as providing deep insights from IoT data. Azure is all-in on IoT, this session will cover our industry leading offerings. Come be part of the next revolution!


Data Science for Developers
Danielle Dean, Daniel Grecoe
Analytics, Data
200 - Intermediate

In this session you will learn about Azure Machine Learning through a comprehensive end-to-end example that he builds during the session and that encompasses: •Problem detection •Algorithm selection •Machine learning model creation and deployment as a RESTful web service •Consumption of the machine learning model The session is intended for engineers, and Dan himself is an engineer, so he does not delve into a deep understanding of complex mathematical models behind machine learning, but instead focuses on the concepts of machine learning to demystify cloud machine learning.


Azure Data Lake and Azure Data Warehouse: Applying Modern Practices to Your App
Tim Mallalieu, Lara Rubbelke
Azure, Data
200 - Intermediate

Azure Data Warehouse and Azure Data Lake are two new services designed to work with all of your data no matter how big or complex. With two strong options to store, process and analyze large volumes of data, you may be curious about which service is right for your application needs. This session will drill into the strengths of each service and walk through the common application patterns for integration of large scale data analysis in the cloud using these services.


Advanced Analytics with R and SQL
Nagesh Pabbisetty, Tobias Ternstrom
Analytics, Data, Intelligent App
200 - Intermediate

R is the lingua franca of Analytics. SQL is the world’s most popular database language. What magic can you make happen by combining the power of R and SQL for Data Science and Advanced Analytics? Imagine the power of exploring, transforming, modeling, and scoring data at scale from the comfort of your favorite R environment. Now, imagine operationalizing the models you create directly in SQL Server, allowing your applications to use them from T-SQL, executed right where your data resides. Come learn how to build and deploy intelligent applications that combine the power of R, SQL Server, thousands of open source R extension packages, and high-performance implementations of the most popular machine learning algorithms at scale.


Insight from Exhaust, Enriching Your Applications with Data
Matthew Winkler
Azure, Analytics, Data, Analytics/Telemetry, .NET, C#, Machine Learning
200 - Intermediate

Modern cloud applications provide new opportunities every day for greater insights into our customers, code and quality using the signals in the application’s exhaust. Using the data produced by users and applications, developers can easily improve the experience for users and increase their engagement within the apps they are using. Starting with a typical application deployed in Azure, this talk will walk through the steps required to build a more intelligent application, customized for each user to provide a more engaging experience. We’ll add more engaging recommendations tailored to a user, more relevant in-application search results, and gain deeper understanding the application’s users. Additionally, we’ll cover how to process this telemetry as it is created to enable us to react in real time to changes in the application. No knowledge of machine learning, data science, or big data is required, by the end you’ll learn to use all three to create a richer application.

Tuesday, March 08, 2016

How to get SQL Server on Linux?


In order to get SQL Server on Linux you need to go to the following page: https://www.microsoft.com/en-us/server-cloud/sql-server-on-linux.aspx. Click on the part where is says Sign up to stay informed and to apply for the preview. It will bring you to a signup page that looks like this



After you filled out all the information required you will have to wait for a reply. Be aware that currently the SQL Server on Linux preview is limited to Ubuntu or as a Docker image

Monday, March 07, 2016

SQL Server 2016 Release Candidate 0 now available for download


The SQL Server engineering team just announced the availability of SQL Server 2016 Release Candidate 0.


To learn more about the release, visit the SQL Server 2016 preview page. To experience the new, exciting features in SQL Server 2016 and the new rapid release model, download the preview and start evaluating the impact these new innovations can have for your business.


For more information on this release, see SQL Server 2016 Release Notes and What’s New in SQL Server 2016.

Some of the more interesting things added is that instead of a trace flag, you can now execute a 
ALTER DATABASE SCOPED CONFIGURATION command.

It looks like this

ALTER DATABASE SCOPED CONFIGURATION
{
[FOR SECONDARY]
{ SET  |  CLEAR PROCEDURE_CACHE }
}[;]
< set_options > ::=
{ MAXDOP = { | PRIMARY}
| LEGACY_CARDINALITY_ESTIMATION = { ON | OFF | PRIMARY}
| PARAMETER_SNIFFING = { ON | OFF | PRIMARY}
| QUERY_OPTIMIZER_HOTFIXES = { ON | OFF | PRIMARY}
| ONLINE_OPERATION_DEFAULT = {ON | OFF} }


There is also finally a built in split string function.

It looks like this

SELECT * FROM STRING_SPLIT('Hi how are you doing?',' ')

The result is:

value
—–
Hi
how
are
you
doing?

Finally, there will be an event on March 10, 2016: SQL Server 2016 — Take the lead, watch the live stream

Get the RC 0  here: SQL Server 2016 Release Candidate 0 

Friday, March 04, 2016

Gotcha with the LEAD analytical function and date datatype

The other day some code that used the LEAD analytical function stopped working suddenly. It started to throw an error.

Msg 206, Level 16, State 2, Line 1
Operand type clash: date is incompatible with int

The code wasn't touched in a long time... so what could be the issue. Let's reproduce it here


First create the following table


CREATE TABLE testRunningTotal(Id tinyint,SomeDate datetime, Charge decimal(20,2))

INSERT testRunningTotal
SELECT 1,'20120101',1000
UNION ALL
SELECT 1,'20120401',200
UNION ALL
SELECT 1,'20120501',300
UNION ALL
SELECT 1,'20120601',600
UNION ALL
SELECT 2,'20120101',100
UNION ALL
SELECT 2,'20130101',500
UNION ALL
SELECT 2,'20140101',-800
UNION ALL
SELECT 3,'20120101',100

Running this query to see what is in this table


SELECT * FROM testRunningTotal

Id SomeDate Charge
1 2012-01-01 00:00:00.000 1000.00
1 2012-04-01 00:00:00.000 200.00
1 2012-05-01 00:00:00.000 300.00
1 2012-06-01 00:00:00.000 600.00
2 2012-01-01 00:00:00.000 100.00
2 2013-01-01 00:00:00.000 500.00
2 2014-01-01 00:00:00.000 -800.00
3 2012-01-01 00:00:00.000 100.00


Now run our fancy query...


SELECT id, someDate as StartDate,
LEAD(SomeDate -1,1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM testRunningTotal
  ORDER BY id, SomeDate


Here are our results

id StartDate Enddate Charge RunningTotal
1 2012-01-01 00:00:00.000 2012-03-31 00:00:00.000 1000.00 1000.00
1 2012-04-01 00:00:00.000 2012-04-30 00:00:00.000 200.00 1200.00
1 2012-05-01 00:00:00.000 2012-05-31 00:00:00.000 300.00 1500.00
1 2012-06-01 00:00:00.000 9999-12-31 00:00:00.000 600.00 2100.00
2 2012-01-01 00:00:00.000 2012-12-31 00:00:00.000 100.00 100.00
2 2013-01-01 00:00:00.000 2013-12-31 00:00:00.000 500.00 600.00
2 2014-01-01 00:00:00.000 9999-12-31 00:00:00.000 -800.00 -200.00
3 2012-01-01 00:00:00.000 9999-12-31 00:00:00.000 100.00 100.00


To see what the query actually does, visit this post Easy running totals with windowing functions

Let's take a closer look at the LEAD analytical function, this is what Books On Line has on its page

  LEAD ( scalar_expression [ ,offset ] , [ default ] ) 
    OVER ( [ partition_by_clause ] order_by_clause )

offset
The number of rows forward from the current row from which to obtain a value. If not specified, the default is 1. offset can be a column, subquery, or other expression that evaluates to a positive integer or can be implicitly converted to bigint. offset cannot be a negative value or an analytic function.

Looks all cool to me..nothing that stands out that could cause a problem...



What will happen if someone changes the column from a datetime to a date? Maybe they looked at the data and saw that all the values had only dates and no time components... why use 8 bytes of storage when you can use 3 instead?



To find out exactly what will happen, run the following DDL statement....


ALTER TABLE testRunningTotal ALTER COLUMN SomeDate date


Now go run your query again


SELECT id, someDate as StartDate,
LEAD(SomeDate -1,1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM testRunningTotal
  ORDER BY id, SomeDate

Here is the error you get

  Msg 206, Level 16, State 2, Line 1
Operand type clash: date is incompatible with int


The error is the same as if you would run the following


SELECT CONVERT(DATE,GETDATE() )-1

Instead what you have to do is use the DATEADD function to do arithmetic with dates


SELECT DATEADD(dd,-1,CONVERT(DATE,GETDATE()))


So to fix our query from before, we want to change the -1 to be DATEADD(dd,-1 instead, here is what our query looks like


 SELECT id, someDate as StartDate,
LEAD(dateadd(dd,-1,SomeDate),1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM testRunningTotal
  ORDER BY id, SomeDate

And here are our results, it's all good now

id StartDate Enddate Charge RunningTotal
1 2012-01-01 2012-03-31 1000.00 1000.00
1 2012-04-01 2012-04-30 200.00 1200.00
1 2012-05-01 2012-05-31 300.00 1500.00
1 2012-06-01 9999-12-31 600.00 2100.00
2 2012-01-01 2012-12-31 100.00 100.00
2 2013-01-01 2013-12-31 500.00 600.00
2 2014-01-01 9999-12-31 -800.00 -200.00
3 2012-01-01 9999-12-31 100.00 100.00


I still think that maybe the LEAD function should do this conversion internally since all we are doing is specifying the OFFSET...what is your opinion?


Thursday, March 03, 2016

Did you know I have a github repository for stuff I run over and over?



There are queries or commands that I run every now and then. I typically don’t have these stored on my computer; to find this code I usually find the blog post I made and copy paste it from there. This becomes a pain in the neck sometimes because I don’t quite remember the title of the post, so I might be spending a good amount of time looking for it. I decided to put some of the code that I run frequently up on github.

You will see code to find index fragmentation as well code to concatenate rows into columns. So there are two kinds of code here. Code that you can run as is and code that you will plug into a stored procedure after you make changes to the table names. You can find that github repository here https://github.com/SQLMenace/CommonTasks

Besides the Common Tasks repository, I also have a Today I Learned repository. A collection of concise write-ups on small things I learn day to day across a variety of languages and technologies. These are things that don't really warrant a full blog post.

You can find that project here https://github.com/SQLMenace/til

With the Today I Learned repository you will find that there are not many things yet. The reason is that I just started this repository, but since SQl Server 2016 is just around the corner, I will start playing around with SQL Server 2016 more and more, thus the Today I Learned repository will become bigger soon.

Wednesday, March 02, 2016

Easy running totals with windowing functions

Back in the pre SQL Server 2012 days, if you wanted to do a running count, you either had to run a subquery or you could use a variable. This was slow because for each row the query that did the sum would be executed. With the additions to the windowing functions in SQL Server 2012, this is now running much faster. 

Let's take a look, first create the following table


CREATE TABLE #test(Id tinyint,SomeDate date, Charge decimal(20,10))

insert #test
SELECT 1,'20120101',1000
UNION ALL
SELECT 1,'20120401',200
UNION ALL
SELECT 1,'20120501',300
UNION ALL
SELECT 1,'20120601',600
UNION ALL
SELECT 2,'20120101',100
UNION ALL
SELECT 2,'20130101',500
UNION ALL
SELECT 2,'20140101',-800
UNION ALL
SELECT 3,'20120101',100


let's check that data we just inserted into the temporary table


SELECT * FROM #test


The output looks like this

Id SomeDate Charge
1 2012-01-01 1000.0000000000
1 2012-04-01 200.0000000000
1 2012-05-01 300.0000000000
1 2012-06-01 600.0000000000
2 2012-01-01 100.0000000000
2 2013-01-01 500.0000000000
2 2014-01-01 -800.0000000000
3 2012-01-01 100.0000000000


What we want is the following

id StartDate Enddate         Charge         RunningTotal
1 2012-01-01 2012-03-31 1000.0000000000 1000.0000000000
1 2012-04-01 2012-04-30 200.0000000000 1200.0000000000
1 2012-05-01 2012-05-31 300.0000000000 1500.0000000000
1 2012-06-01 9999-12-31 600.0000000000 2100.0000000000
2 2012-01-01 2012-12-31 100.0000000000 100.0000000000
2 2013-01-01 2013-12-31 500.0000000000 600.0000000000
2 2014-01-01 9999-12-31 -800.0000000000 -200.0000000000
3 2012-01-01 9999-12-31 100.0000000000 100.0000000000

For each row, we want to have the date that the row starts on and also the date when it end, we also want a running total as well. If there is no row after the current row for that id, we want the end date to be 9999-12-31.

So we will use a couple of functions. The first one is LEAD, LEAD accesses data from a subsequent row in the same result set without the use of a self-join. So the LEAD part looks like this

LEAD(dateadd(dd,-1,SomeDate),1,'99991231')  OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,

What we are doing is subtracting 1 from the date in the subsequent row (ateadd(dd,-1,SomeDate))
We are using 1 as the offset since we want to apply this to the next row. Finally if there is no subsequent row, we want to use the date 9999-12-31 instead of NULL

To do the running count, we will do the following

SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
          AS RunningTotal

What this means in English is for each id ordered by date, sum up the charge values for the rows between the preceding rows and the current row. Here is what all that stuff means.

ROWS BETWEEN
Specifies the rows that make up the range to use as implied by

UNBOUNDED PRECEDING
Specifies that the window starts at the first row of the partition. UNBOUNDED PRECEDING can only be specified as window starting point.

CURRENT ROW
Specifies that the window starts or ends at the current row when used with ROWS or the current value when used with RANGE.
CURRENT ROW can be specified as both a starting and ending point.

And here is the query


SELECT id, someDate as StartDate,
LEAD(dateadd(dd,-1,SomeDate),1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM #test
  ORDER BY id, SomeDate


And running that query, gives us the running count as well as the dates

id StartDate Enddate         Charge         RunningTotal
1 2012-01-01 2012-03-31 1000.0000000000 1000.0000000000
1 2012-04-01 2012-04-30 200.0000000000 1200.0000000000
1 2012-05-01 2012-05-31 300.0000000000 1500.0000000000
1 2012-06-01 9999-12-31 600.0000000000 2100.0000000000
2 2012-01-01 2011-12-31 100.0000000000 100.0000000000
2 2012-01-01 2011-13-31 500.0000000000 600.0000000000
2 2012-01-01 9999-12-31 -800.0000000000 -200.0000000000
3 2012-01-01 9999-12-31 100.0000000000 100.0000000000
That's all for this post...

Monday, February 29, 2016

#BackToBasics returning rows with maximum values from a table



In the A challenge for 2016....... accepted! post I said that I would write a post once every month that would explain very basic stuff about SQL Server. Today is the second post, in this post we will take a look at how you can return the row with the maximum value for a group.

In order to explain the code in this post, I decided to grab the top largest cities in the world as the data. You can find the data here List of cities proper by population

 Below is the SQL code to create a table and also the data to populate this table

create table #Temp( Rank int not null, 
 City varchar(100) not null,
 Population int not null,
 Density int not null,
 Country varchar(100))

INSERT #Temp
SELECT 1,'Shanghai',24256800,3826,'China' union all
SELECT 2,'Karachi',23500000,6663,'Pakistan' union all
SELECT 3,'Beijing',21516000,1311,'China' union all
SELECT 4,'Lagos',21324000,18206,'Nigeria' union all
SELECT 5,'Delhi',16787941,11320,'India' union all
SELECT 6,'Tianjin',15200000,1293,'China' union all
SELECT 7,'Istanbul',14160467,2593,'Turkey' union all
SELECT 8,'Tokyo',13297629,6075,'Japan' union all
SELECT 9,'Guangzhou',13080500,1759,'China' union all
SELECT 10,'Mumbai',12478447,20680,'India' union all
SELECT 11,'Moscow',12197596,4859,'Russia' union all
SELECT 12,'São Paulo',11895893,7821,'Brazil' union all
SELECT 13,'Shenzhen',10467400,5256,'China' union all
SELECT 14,'Jakarta',10075310,15171,'Indonesia' union all
SELECT 15,'Seoul',10048850,17134,'South Korea' 

Now that we have the data in our table, let's run a simple query to return the row with the most populous city in the world. You can accomplish this in SQL Server by using the TOP operator and ordering the data by population in descending order.

SELECT TOP 1 * FROM #Temp
ORDER BY population DESC

Here is the output from that query


Rank City  Population Density Country
1 Shanghai 24256800 3826 China


What we want to do next is return one row the most populous city per country.

So let's take it a step at a time, the query below will give you the maximum population by country

SELECT MAX(population) as BiggestPopulation,Country
FROM #Temp
GROUP BY Country
ORDER BY BiggestPopulation DESC

Here is the output

BiggestPopulation Country
24256800  China
23500000  Pakistan
21324000  Nigeria
16787941  India
14160467  Turkey
13297629  Japan
12197596  Russia
11895893  Brazil
10075310  Indonesia
10048850  South Korea

What we can do is join the query we have above with the table itself on the country and population columns. The query will look like this

SELECT T.* FROM(
SELECT MAX(population) as BiggestPopulation,Country
FROM #Temp
GROUP BY Country) X
JOIN #Temp t on x.BiggestPopulation = t.Population
AND t.Country = x.Country
ORDER BY t.Population DESC

Here is the output

+------+-----------+------------+---------+-------------+
| Rank | City      | Population | Density | Country     |
+------+-----------+------------+---------+-------------+
| 1    | Shanghai  | 24256800   | 3826    | China       |
+------+-----------+------------+---------+-------------+
| 2    | Karachi   | 23500000   | 6663    | Pakistan    |
+------+-----------+------------+---------+-------------+
| 4    | Lagos     | 21324000   | 18206   | Nigeria     |
+------+-----------+------------+---------+-------------+
| 5    | Delhi     | 16787941   | 11320   | India       |
+------+-----------+------------+---------+-------------+
| 7    | Istanbul  | 14160467   | 2593    | Turkey      |
+------+-----------+------------+---------+-------------+
| 8    | Tokyo     | 13297629   | 6075    | Japan       |
+------+-----------+------------+---------+-------------+
| 11   | Moscow    | 12197596   | 4859    | Russia      |
+------+-----------+------------+---------+-------------+
| 12   | São Paulo | 11895893   | 7821    | Brazil      |
+------+-----------+------------+---------+-------------+
| 14   | Jakarta   | 10075310   | 15171   | Indonesia   |
+------+-----------+------------+---------+-------------+
| 15   | Seoul     | 10048850   | 17134   | South Korea |
+------+-----------+------------+---------+-------------+


SQL Server 2005 introduced a bunch of ranking/windowing functions which make writing these queries a little simpler. We are going to use ROW_NUMBER for this post. Here is what the query will look like


SELECT ROW_NUMBER() OVER (PARTITION By [what do you want it grouped/partitioned by?] ORDER BY [what do you want it ordered by?]) AS Row,* 
FROM Table

In our case, we want to partition the result set by country, we also want to order each partition by population descending. The query looks like this


SELECT ROW_NUMBER() OVER (PARTITION By Country ORDER BY population DESC) AS Row,*
FROM #Temp
ORDER BY Row,Population DESC

Here is what that query returns

+-----+------+-----------+------------+---------+-------------+
| Row | Rank | City      | Population | Density | Country     |
+-----+------+-----------+------------+---------+-------------+
| 1   | 1    | Shanghai  | 24256800   | 3826    | China       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 2    | Karachi   | 23500000   | 6663    | Pakistan    |
+-----+------+-----------+------------+---------+-------------+
| 1   | 4    | Lagos     | 21324000   | 18206   | Nigeria     |
+-----+------+-----------+------------+---------+-------------+
| 1   | 5    | Delhi     | 16787941   | 11320   | India       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 7    | Istanbul  | 14160467   | 2593    | Turkey      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 8    | Tokyo     | 13297629   | 6075    | Japan       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 11   | Moscow    | 12197596   | 4859    | Russia      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 12   | São Paulo | 11895893   | 7821    | Brazil      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 14   | Jakarta   | 10075310   | 15171   | Indonesia   |
+-----+------+-----------+------------+---------+-------------+
| 1   | 15   | Seoul     | 10048850   | 17134   | South Korea |
+-----+------+-----------+------------+---------+-------------+
| 2   | 3    | Beijing   | 21516000   | 1311    | China       |
+-----+------+-----------+------------+---------+-------------+
| 2   | 10   | Mumbai    | 12478447   | 20680   | India       |
+-----+------+-----------+------------+---------+-------------+
| 3   | 6    | Tianjin   | 15200000   | 1293    | China       |
+-----+------+-----------+------------+---------+-------------+
| 4   | 9    | Guangzhou | 13080500   | 1759    | China       |
+-----+------+-----------+------------+---------+-------------+
| 5   | 13   | Shenzhen  | 10467400   | 5256    | China       |
+-----+------+-----------+------------+---------+-------------+


What we really want is the top 10 rows. You might think that you can just add WHERE Row = 1 to the query. This is not allowed since the WHERE clause gets executed before the ROW_Number function as part of the SELECT. What you can do is use a common table expression and then specify WHERE Row = 1 when selecting from the CTE

Here is what it looks like

;WITH CTE AS(
SELECT ROW_NUMBER() OVER (PARTITION By Country ORDER BY population DESC) AS Row,*
FROM #Temp)

SELECT Rank, City,Population,Density,Country
FROM CTE
WHERE Row = 1
ORDER BY Population DESC

And here is our output, for each country we have only one row, you can also see that 4 cities from the top 10 are not in this list, those cities are in either China or India

+------+-----------+------------+---------+-------------+--+
| Rank | City      | Population | Density | Country     |  |
+------+-----------+------------+---------+-------------+--+
| 1    | Shanghai  | 24256800   | 3826    | China       |  |
+------+-----------+------------+---------+-------------+--+
| 2    | Karachi   | 23500000   | 6663    | Pakistan    |  |
+------+-----------+------------+---------+-------------+--+
| 4    | Lagos     | 21324000   | 18206   | Nigeria     |  |
+------+-----------+------------+---------+-------------+--+
| 5    | Delhi     | 16787941   | 11320   | India       |  |
+------+-----------+------------+---------+-------------+--+
| 7    | Istanbul  | 14160467   | 2593    | Turkey      |  |
+------+-----------+------------+---------+-------------+--+
| 8    | Tokyo     | 13297629   | 6075    | Japan       |  |
+------+-----------+------------+---------+-------------+--+
| 11   | Moscow    | 12197596   | 4859    | Russia      |  |
+------+-----------+------------+---------+-------------+--+
| 12   | São Paulo | 11895893   | 7821    | Brazil      |  |
+------+-----------+------------+---------+-------------+--+
| 14   | Jakarta   | 10075310   | 15171   | Indonesia   |  |
+------+-----------+------------+---------+-------------+--+
| 15   | Seoul     | 10048850   | 17134   | South Korea |  |
+------+-----------+------------+---------+-------------+--+



That is all for this month, see you next month. In the meantime, make sure to explore windowing/ranking functions as well as Common Table Expressions

Sunday, February 21, 2016

How to store twitter search results from R into SQL Server

I was messing around with R and grabbing twitter searches from within R, after that I decided to store the result in a csv file. Instead of stroring the results in a file, what if I want to store that info into a SQL Server database? This is pretty easy to do. If you want to try this out yourself, first head on over to my other post Using twitter from the R console. There you will be able to see how to setup OAuth so that you can use twitter within R, you will also see how to install the R twitteR package as well.

Time to get started....

First let's see how to connect to SQL Server from R
To load data from SQL Server into R, you first need to create an ODBC Data Source, I created one named LocalSqlServer, this DSN Data Source points to my local SQL Server 2016 instance.


After you have created your ODBC data source, you need to load the (RODBC) package, create a connection, execute a query, display the results and close the connection.. Here is what it looks like in the R Console



> require(RODBC)

>db <- odbcConnect("LocalSqlServer")

> Sample <- sqlQuery(db, "select * from sysobjects", stringsAsFactors=FALSE)

> head(Sample)
> close(db)  #close ORDBC connection
name id xtype uid info status base_schema_ver 1 sp_MSalreadyhavegeneration -1073624922 P 4 0 0 0 2 sp_MSwritemergeperfcounter -1072815163 P 4 0 0 0 3 TABLE_PRIVILEGES -1072372588 V 3 0 4 sp_replsetsyncstatus -1071944761 X 4 0 0 0 5 sp_replshowcmds -1070913306 P 4 0 0 0 6 sp_publishdb -1070573756 P 4 0 0 0

Now that we know how to connect to SQL Server, let's see if we can get the twitteR package to work, head on over to my other post Using twitter from the R console to see how to install twitterR and how some queries are executed.


Assuming the twitteR package is installed, we need to load it to use it


library(twitteR)

The next command is the command to enable OAuth, to see how to set this up go here Using twitter from the R console

setup_twitter_oauth("API key", "API secret", "Access token", "Access secret")
[1] "Using direct authentication"

Now it is time to do a simple search

tweets <- searchTwitter('#rstats', n=6) 
tweets





Running that command should have given you some results. However what we need as output is a dataframe, this way we can easily load this into a table in SQL Server. You can use the twListToDF function


tweets <- searchTwitter('#rstats', n=6) 
twListToDF(tweets)


The output now looks different



1                             RT @rquintino: @Mairos_B #sqlsatportugal session: all about R in #SqlServer 2016 #rstats https://t.co/DHrqIZrz1e
2                                                      a quick script to use imgcat in #rstats https://t.co/fpUlgWNX33 https://t.co/AhCCMLewCH
3 RT @KirkDBorne: Useful packages (libraries) for Data Analysis in R: https://t.co/haRKopFyly #DataScience #Rstats by @analyticsvidhya https:…
4                      Hey #rstats tweeps, do you have any readings to recommend on sensitivity analysis? Books/articles/websites all welcome.
5 RT @KirkDBorne: A Complete Tutorial on Time Series Modeling in R: https://t.co/7oI6JKyU4E #MachineLearning #DataScience #Rstats by @Analyti…
6 RT @KirkDBorne: A Complete Tutorial on Time Series Modeling in R: https://t.co/7oI6JKyU4E #MachineLearning #DataScience #Rstats by @Analyti…
  favorited favoriteCount replyToSN             created truncated replyToSID
1     FALSE             0        NA 2016-02-20 20:29:54     FALSE         NA
2     FALSE             0        NA 2016-02-20 20:24:50     FALSE         NA
3     FALSE             0        NA 2016-02-20 20:16:25     FALSE         NA
4     FALSE             0        NA 2016-02-20 20:11:08     FALSE         NA
5     FALSE             0        NA 2016-02-20 20:11:06     FALSE         NA
6     FALSE             0        NA 2016-02-20 20:02:05     FALSE         NA
                  id replyToUID
1 701141750161784834         NA
2 701140474019577856         NA
3 701138356466483204         NA
4 701137026075140096         NA
5 701137018508722176         NA
6 701134750296227840         NA
                                                                            statusSource
1                Mobile Web (M5)
Tweetbot for Mac
3   Twitter for Android
4                     Twitter Web Client
5     Twitter for iPhone
6                     Twitter Web Client
     screenName retweetCount isRetweet retweeted longitude latitude
1      psousa75            3      TRUE     FALSE        NA       NA
2      millerdl            0     FALSE     FALSE        NA       NA
3   diana_nario           50      TRUE     FALSE        NA       NA
4    emjonaitis            0     FALSE     FALSE        NA       NA
5       caryden           41      TRUE     FALSE        NA       NA
6 ArkangelScrap           41      TRUE     FALSE        NA       NA



The output now has a lot more stuff, you can see if it has been retweeted or favorited as well as the latitude, longtitude and more.

Time to send the result to SQL Server

There is a problem with how the sqlSave method creates the table, it makes the created column a float and then all inserts fail.


Run the following DDL statement in your database
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
CREATE TABLE [dbo].[tweets](
	[rownames] [nvarchar](255) NULL,
	[text] [nvarchar](255) NULL,
	[favorited] [nvarchar](5) NULL,
	[favoriteCount] [float] NULL,
	[replyToSN] [nvarchar](5) NULL,
	[created] [datetime] NULL,
	[truncated] [nvarchar](5) NULL,
	[replyToSID] [nvarchar](5) NULL,
	[id] [nvarchar](255) NULL,
	[replyToUID] [nvarchar](5) NULL,
	[statusSource] [nvarchar](255) NULL,
	[screenName] [nvarchar](255) NULL,
	[retweetCount] [float] NULL,
	[isRetweet] [nvarchar](5) NULL,
	[retweeted] [nvarchar](5) NULL,
	[longitude] [nvarchar](255) NULL,
	[latitude] [nvarchar](255) NULL
) ON [PRIMARY]

Now we are ready to save our twitter data in SQL Server,  here is how you do it.


> sqlSave(db,append = TRUE,colnames = FALSE,verbose = FALSE, addPK = FALSE, twListToDF(tweets))
Here is what the results look in SSMS



Make sure that you have append = TRUE, if you make it FALSE, the table will be recreated every time but since the command issued creates the wrong data type for the created column, you can't use it


So to recap, here are the commands you need after you loaded the twitterR package and have setup OAuth


> tweets <- searchTwitter('#rstats', n=6) #store the twitter search results
> twListToDF(tweets)  #convert to dataframe

> require(RODBC)  # load RODBC package
> db <- odbcConnect("LocalSqlServer")  #connect
> sqlSave(db,append = TRUE,colnames = FALSE,verbose = FALSE, addPK = FALSE, twListToDF(tweets))  #insert data into your table
> close(db)  #close ORDBC connection
 

Saturday, February 13, 2016

Quickly change m/d/yyyy dates to mm/dd/yyyy dates in Excel

Sometimes you get data in an Excel or csv file and it will be in a m/d/yyyy format.


Here is an example

1/1/2005
2/1/2012
2/12/2012
12/5/2015

This will look fine if you open the csv file in Excel but if you need to import this data into a database you might get an error that the format is not correct.

 What you want is the data to be in this format
01/01/2005
02/01/2012
02/12/2012
12/05/2015


If you use Format Cells and then pick Date you will see that there is no mm/dd/yyyy format What you have to do is pick the Custom Category and click on the m/d/yyyy format



Now change m/d/yyyy to mm/dd/yyyy



Click OK and you are all set, you will see that the dates are now in mm/dd/yyyy format

Let me just mention that the problem was not the conversion in SQL itself but a tool that did the import choked

BTW SQL Server can do this fairly easily

If I run the following

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
;with dates (date)
as (select '1/1/2001' union all
 select '11/1/2001' union all
 select '1/12/2001' union all
 select '5/5/2001' union all
 select '1/22/2001' union all
 select '12/1/2001' union all
 select '01/01/2001' )

 select convert(date,date)
 from dates

I will get this as output

2001-01-01
2001-11-01
2001-01-12
2001-05-05
2001-01-22
2001-12-01
2001-01-01

That of course is just a date, if you want mm/dd/yyyy format, you can use convert and style 101...

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
;with dates (date)
as (select '1/1/2001' union all
 select '11/1/2001' union all
 select '1/12/2001' union all
 select '5/5/2001' union all
 select '1/22/2001' union all
 select '12/1/2001' union all
 select '01/01/2001' )

 select convert(varchar(10),convert(date,date),101)
 from dates

 select convert(date,date)
 from dates

And here is that output

01/01/2001
11/01/2001
01/12/2001
05/05/2001
01/22/2001
12/01/2001
01/01/2001

Monday, January 25, 2016

The one change I make to each tSQLt database

Whenever I put the tSQLt database on a server, I make sure that I make one change to a stored procedure that ships with tSQLt. The reason for this is that we run our tSQLT tests as part of our automated build process. If any of the tests fail, we developers who are part of that particular project will get an email telling us what the results of the tests were.

The main proc that runs the unit tests always returns 0, I need a way to know if any tests failed or not, if we had failures, we want the email to go out, if there were no failures then we don't want the email to go out.

Here is how I modified the RunAll stored procedure


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
ALTER PROCEDURE [tSQLt].[RunAll]
AS
BEGIN
  DECLARE @TestResultFormatter NVARCHAR(MAX);
  SELECT @TestResultFormatter = tSQLt.GetTestResultFormatter();
  
  EXEC tSQLt.Private_RunAll @TestResultFormatter;

  -- Added to return -1 instead  of the default 0
  IF EXISTS (SELECT * FROM tSQLt.TestResult WHERE result = 'Failure')
 RETURN -1

END;

I added the stuff on lines 9 till 11. I am checking if there were any tests that failed, these tests will have a result of Failure. If any tests failed, the stored procedure will return -1. Now we can easily check this value to determine if we should send the email or not.

I could have also checked if there were any tests which had a result which was not Success instead. A test could also have a status which is Error, you will get such a status if for example an object doesn't exist in the database.

There you have it, this is the change I made. Maybe you had the same problem and this will help you as well.