Thursday, March 31, 2016

Microsoft releases SQL Server Horizontica, a Vertica killer



Amsterdam, Noord Holland, Nederlands, European Union. — April 1, 2016 — Microsoft Corp. on Friday morning announced that it has released the next version of its flagship Relational (and beyond) Database Management System SQL Server Horizontica.

Here is a quick rundown of all the new, exciting and shiny things......

Goes to eleven.
What we have done here is, if you select this option, we will offload very CPU intensive operations to the GPU. Think about it, GPUs are many times faster than CPUs… hookup a bunch of graphic card to you server and your CPU utilization goes down by 99%. Here is what an external GPU rig look like from one of our biggest beta testers



Mr Stein who is a member of that team will explain what he likes the most about Goes To Eleven.

Frank E. Stein

The best part about Goes To Eleven is that if the GPU used capacity is below 50%, SQL Server will automatically use the excess capacity to start bitcoin mining. This in turn will lower the cost of running SQL Server, it really is a no brainer. When you need capacity it is there, when you don't need it, you earn money. Being a former Oracle customer where the licensing model is "How much money do you have?" this is really welcome.

Some more new goodies

BSON
Every vendor has JSON by this point, we took it s step further and added Binary JSON also known as BSON support


MINDOP
Minimum degree of parallelism. Brand new in this version, you can now specify the minimum number of CPUs/GPUs that a query should use. If you know your query is going to return a lot of data...instead of adding indexes...think big and use a minimum of 128 CPUs/GPUs for your query

Rowstore
When we looked at Vertica and Sybase IQ customers we noticed a lot of queries executed were using SELECT *, this is very slow with a columnar data storage. With our patent pending rowstore, it is as fast as the traditional page store. The ROI is immense, no more do programmers and brogrammers have to find all the code where a table is used to add the additional column, SELECT * is always correct. Rowstore is also more natural, when getting the data back people read from left to right or right to left, people don't read from top to bottom, since the data is already stored in rows...no transform needs to happen, this also makes data display faster.

Variable page sizes
Instead of just 8K page sizes, SQL Server now also offers the following page sizes: 2K, 4K, 8K, 16K, 32K, 64K and 640K. The reason we have 640K is because that is the max page size anyone will ever need, you probably will not use it now, but you might in 10 years, we are already prepared for that

Smallishint
Smallishint is a 3 byte integer data type, bigger than tinyint and smaller than smallint. You can store values up to 4096

Myriaton
We took the concepts of Hekaton and got another 100x improvement

One of our customers will explaining how they are using it and how it has improved their applications

Dr A. Cula


I am Dr A. Cula and I work for Ichor, a company specializing in eternal life. We are running a big blood bank, we need to keep this stocked for our mmm inner circle members. We use tagging of subjects so that we can go and mmm get more blood when needed. We were always running into problems when summer came along..our mmm dispatchers and collectors don't like to work when it is sunny. So we needed really good performance during summer hours. After switching SQL Server Horizontica and using Myriaton we now are so efficient that we opened up 4 more blood banks. We could never have envisioned something like this was possible, because of this we also expanded our inner circle by 200%


Universal Stretch and Universal Availability Groups
This is really exciting, instead of a regular stretch database on Azure, Universal Stretch will copy your data on one of our data centers located in the universe. This year it is just on the Moon where we have a data center already, in 2 years you can also use the one we are currently sending to Mars. Then when we start sending people to Mars, you are already open for business...the data is already there and it will be synchronized between Earth, Moon and Mars. Universal Availability Groups work the same way, you plug in the planet or moon and we take care of the rest.


ChandarUma ChandranJaya (also known as CU) program manager on the SQL Server programmability team will fill us in on some additional new features not mentioned yet.

New JOINS

ANTI JOIN
This is a brand new JOIN and it is ANSI SQL 2022 compliant. It is basically a JOIN between 2 tables on a column where the values are not the same (Think like NOT EXISTS)


SELECT *
FROM Table1 t1
ANTI JOIN Table2 t2
ON t1.ID = T2


CUBE JOIN
The CUBE JOIN takes the CROSS JOIN to whole other level...instead of returning all rows in one table for each row in the other tables.... the CUBE JOIN returns all values in all columns in all rows in one table for each value in all columns in all rows in another table... it is like a hypercube...


SELECT * 
FROM Table1 t1
CUBE JOIN Table2 t2

NATURAL JOIN
The  NATURAL JOIN joins two tables on columns with the same name


SELECT * 
FROM Table1 t1
NATURAL JOIN Table2 t2

KEY JOIN
The KEY JOIN is really cool, what it does it joins tables based on primary and foreign keys, no need to specify these in the join condition anymore


SELECT * 
FROM Table1 t1
KEY JOIN Table2 t2


New Traceflags
Traceflag 666
This is a brand new traceflag, it will give your SQL Server instance hellish performance. It acts like a Double Distributed Denial Of Service (DDDOS) attack against you SQL Server instance...... IO, Memory.... everything will be completely saturated... however you can still use the DAC (Dedicated Administrative Connection) to get in and run queries against DMVs to check the Subsystem poerformance

New functions
We have implemented the following new functions
IsDaylightSavingsTime
IsNotWeekend
Trim
sp_MsForEachServer
DBCC CheckCrappyCode
DBCC
Quartile
DCOUNT
DAYS360
IsOdd
IsNontext
Gigo
Lifo
Fifo
Fido

Specialized indexes
We now have the following new indexes

DateIndex  designed to work with all the date data types

FunctionIndex, finally you can index for example the first 5 characters of a column only, you can use all the functions available in SQL Server

Swift
Swift is a brand new storage engine, you can use this when you don't need constraints, keys..after all real developers manage those in the app tier. You enable this per database and you can make the same face as Taylor when thinking how you don't have to deal with all those nasty constrains that just get in the way. The beauty of Swift is that if you do have constraints in your DDL scripts, they will be just silently ignored. No more scripts that have to run in a specific order...it always runs fine..., your manager will love, no adore you..



There are many more things that are new, you can visit the SQL Server homepage to find out even more exciting stuff









Monday, March 28, 2016

The most interesting Build sessions for people working with data



Microsoft's Build conference is happening this week, the schedule for all the sessions was just made available. Below are all the sessions that will be most interesting to people who work with data. Clicking on the session's title will bring you to the Channel 9 page that will have the content. These session recording are of course not available yet, you will need to wait till after the conference is done to see them.



Intelligent Data Driven Applications that Learn and Adapt
Shawn Bice, Pablo Castro
Azure, Data, Intelligent App, Machine Learning
200 - Intermediate

Applications show intelligence when they can spot trends, react to unusual events, predict outcomes or recommend choices. Learn how to introduce intelligence traits into your apps including; establishing feedback loops, applying big data and machine learning techniques to classify, predict or otherwise analyze explicit and implicit signals, and operationalizing the full stack into the regular usage flow of the app. Most every day apps, from consumer to enterprise can deliver greater customer or business benefit by learning from user behavior and other signals. In this session we’ll take a pragmatic look at introducing real, useful data-driven intelligence into apps by walking through services, code and data needed to make it happen.


SQL Database Technologies for Developers
Lindsey Allen, Tony Petrossian
Azure, Data, Intelligent App
200 - Intermediate

Microsoft offers SQL Server and Azure SQL Database to help you develop great relational database applications. In this session you will learn about the top developer features coming in SQL Server 2016 which are already in Azure SQL Database. Additionally, you will see the latest investments in Azure SQL Database that enable you to easily manage thousands of databases and get the performance and security insight needed to build robust and secure applications in the cloud.



Adding Power BI Data Experiences to Your Applications
Josh Caplan, Lukasz Pawlowski
Analytics, Data
300 - Experienced

Bring stunning and interactive Power BI data experience to your applications. Learn how to integrate to Power BI using the Power BI REST APIs and how to build your own custom visualizations.


Delivering Applications at Scale with DocumentDB, Azure's NoSQL Document Database
John Macintyre, Dharma Shukla
Azure, Data, Intelligent App
200 - Intermediate

What does a #1 app on the iTunes App Store and an installment of one of the most successful gaming franchises in history have in common? They both use Azure DocumentDB to ingest massive volumes of data and serve low latency queries to provide great user experiences. Come learn about the business goals and technical challenges faced by two real-world, immensely popular applications and why they chose to use Azure DocumentDB, as well as the application patterns they used to achieve their massive scale requirements.



Azure IoT: Complete Cloud Offerings for the IoT Revolution
Sam George, Cameron Skinner
Azure, Data, Internet of Things
200 - Intermediate

IoT is the next revolution in computing and an incredibly exciting space. A critical part of IoT is cloud based solutions that enable you to connect, secure and manage IoT devices as well as providing deep insights from IoT data. Azure is all-in on IoT, this session will cover our industry leading offerings. Come be part of the next revolution!


Data Science for Developers
Danielle Dean, Daniel Grecoe
Analytics, Data
200 - Intermediate

In this session you will learn about Azure Machine Learning through a comprehensive end-to-end example that he builds during the session and that encompasses: •Problem detection •Algorithm selection •Machine learning model creation and deployment as a RESTful web service •Consumption of the machine learning model The session is intended for engineers, and Dan himself is an engineer, so he does not delve into a deep understanding of complex mathematical models behind machine learning, but instead focuses on the concepts of machine learning to demystify cloud machine learning.


Azure Data Lake and Azure Data Warehouse: Applying Modern Practices to Your App
Tim Mallalieu, Lara Rubbelke
Azure, Data
200 - Intermediate

Azure Data Warehouse and Azure Data Lake are two new services designed to work with all of your data no matter how big or complex. With two strong options to store, process and analyze large volumes of data, you may be curious about which service is right for your application needs. This session will drill into the strengths of each service and walk through the common application patterns for integration of large scale data analysis in the cloud using these services.


Advanced Analytics with R and SQL
Nagesh Pabbisetty, Tobias Ternstrom
Analytics, Data, Intelligent App
200 - Intermediate

R is the lingua franca of Analytics. SQL is the world’s most popular database language. What magic can you make happen by combining the power of R and SQL for Data Science and Advanced Analytics? Imagine the power of exploring, transforming, modeling, and scoring data at scale from the comfort of your favorite R environment. Now, imagine operationalizing the models you create directly in SQL Server, allowing your applications to use them from T-SQL, executed right where your data resides. Come learn how to build and deploy intelligent applications that combine the power of R, SQL Server, thousands of open source R extension packages, and high-performance implementations of the most popular machine learning algorithms at scale.


Insight from Exhaust, Enriching Your Applications with Data
Matthew Winkler
Azure, Analytics, Data, Analytics/Telemetry, .NET, C#, Machine Learning
200 - Intermediate

Modern cloud applications provide new opportunities every day for greater insights into our customers, code and quality using the signals in the application’s exhaust. Using the data produced by users and applications, developers can easily improve the experience for users and increase their engagement within the apps they are using. Starting with a typical application deployed in Azure, this talk will walk through the steps required to build a more intelligent application, customized for each user to provide a more engaging experience. We’ll add more engaging recommendations tailored to a user, more relevant in-application search results, and gain deeper understanding the application’s users. Additionally, we’ll cover how to process this telemetry as it is created to enable us to react in real time to changes in the application. No knowledge of machine learning, data science, or big data is required, by the end you’ll learn to use all three to create a richer application.

Tuesday, March 08, 2016

How to get SQL Server on Linux?


In order to get SQL Server on Linux you need to go to the following page: https://www.microsoft.com/en-us/server-cloud/sql-server-on-linux.aspx. Click on the part where is says Sign up to stay informed and to apply for the preview. It will bring you to a signup page that looks like this



After you filled out all the information required you will have to wait for a reply. Be aware that currently the SQL Server on Linux preview is limited to Ubuntu or as a Docker image

Monday, March 07, 2016

SQL Server 2016 Release Candidate 0 now available for download


The SQL Server engineering team just announced the availability of SQL Server 2016 Release Candidate 0.


To learn more about the release, visit the SQL Server 2016 preview page. To experience the new, exciting features in SQL Server 2016 and the new rapid release model, download the preview and start evaluating the impact these new innovations can have for your business.


For more information on this release, see SQL Server 2016 Release Notes and What’s New in SQL Server 2016.

Some of the more interesting things added is that instead of a trace flag, you can now execute a 
ALTER DATABASE SCOPED CONFIGURATION command.

It looks like this

ALTER DATABASE SCOPED CONFIGURATION
{
[FOR SECONDARY]
{ SET  |  CLEAR PROCEDURE_CACHE }
}[;]
< set_options > ::=
{ MAXDOP = { | PRIMARY}
| LEGACY_CARDINALITY_ESTIMATION = { ON | OFF | PRIMARY}
| PARAMETER_SNIFFING = { ON | OFF | PRIMARY}
| QUERY_OPTIMIZER_HOTFIXES = { ON | OFF | PRIMARY}
| ONLINE_OPERATION_DEFAULT = {ON | OFF} }


There is also finally a built in split string function.

It looks like this

SELECT * FROM STRING_SPLIT('Hi how are you doing?',' ')

The result is:

value
—–
Hi
how
are
you
doing?

Finally, there will be an event on March 10, 2016: SQL Server 2016 — Take the lead, watch the live stream

Get the RC 0  here: SQL Server 2016 Release Candidate 0 

Friday, March 04, 2016

Gotcha with the LEAD analytical function and date datatype

The other day some code that used the LEAD analytical function stopped working suddenly. It started to throw an error.

Msg 206, Level 16, State 2, Line 1
Operand type clash: date is incompatible with int

The code wasn't touched in a long time... so what could be the issue. Let's reproduce it here


First create the following table


CREATE TABLE testRunningTotal(Id tinyint,SomeDate datetime, Charge decimal(20,2))

INSERT testRunningTotal
SELECT 1,'20120101',1000
UNION ALL
SELECT 1,'20120401',200
UNION ALL
SELECT 1,'20120501',300
UNION ALL
SELECT 1,'20120601',600
UNION ALL
SELECT 2,'20120101',100
UNION ALL
SELECT 2,'20130101',500
UNION ALL
SELECT 2,'20140101',-800
UNION ALL
SELECT 3,'20120101',100

Running this query to see what is in this table


SELECT * FROM testRunningTotal

Id SomeDate Charge
1 2012-01-01 00:00:00.000 1000.00
1 2012-04-01 00:00:00.000 200.00
1 2012-05-01 00:00:00.000 300.00
1 2012-06-01 00:00:00.000 600.00
2 2012-01-01 00:00:00.000 100.00
2 2013-01-01 00:00:00.000 500.00
2 2014-01-01 00:00:00.000 -800.00
3 2012-01-01 00:00:00.000 100.00


Now run our fancy query...


SELECT id, someDate as StartDate,
LEAD(SomeDate -1,1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM testRunningTotal
  ORDER BY id, SomeDate


Here are our results

id StartDate Enddate Charge RunningTotal
1 2012-01-01 00:00:00.000 2012-03-31 00:00:00.000 1000.00 1000.00
1 2012-04-01 00:00:00.000 2012-04-30 00:00:00.000 200.00 1200.00
1 2012-05-01 00:00:00.000 2012-05-31 00:00:00.000 300.00 1500.00
1 2012-06-01 00:00:00.000 9999-12-31 00:00:00.000 600.00 2100.00
2 2012-01-01 00:00:00.000 2012-12-31 00:00:00.000 100.00 100.00
2 2013-01-01 00:00:00.000 2013-12-31 00:00:00.000 500.00 600.00
2 2014-01-01 00:00:00.000 9999-12-31 00:00:00.000 -800.00 -200.00
3 2012-01-01 00:00:00.000 9999-12-31 00:00:00.000 100.00 100.00


To see what the query actually does, visit this post Easy running totals with windowing functions

Let's take a closer look at the LEAD analytical function, this is what Books On Line has on its page

  LEAD ( scalar_expression [ ,offset ] , [ default ] ) 
    OVER ( [ partition_by_clause ] order_by_clause )

offset
The number of rows forward from the current row from which to obtain a value. If not specified, the default is 1. offset can be a column, subquery, or other expression that evaluates to a positive integer or can be implicitly converted to bigint. offset cannot be a negative value or an analytic function.

Looks all cool to me..nothing that stands out that could cause a problem...



What will happen if someone changes the column from a datetime to a date? Maybe they looked at the data and saw that all the values had only dates and no time components... why use 8 bytes of storage when you can use 3 instead?



To find out exactly what will happen, run the following DDL statement....


ALTER TABLE testRunningTotal ALTER COLUMN SomeDate date


Now go run your query again


SELECT id, someDate as StartDate,
LEAD(SomeDate -1,1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM testRunningTotal
  ORDER BY id, SomeDate

Here is the error you get

  Msg 206, Level 16, State 2, Line 1
Operand type clash: date is incompatible with int


The error is the same as if you would run the following


SELECT CONVERT(DATE,GETDATE() )-1

Instead what you have to do is use the DATEADD function to do arithmetic with dates


SELECT DATEADD(dd,-1,CONVERT(DATE,GETDATE()))


So to fix our query from before, we want to change the -1 to be DATEADD(dd,-1 instead, here is what our query looks like


 SELECT id, someDate as StartDate,
LEAD(dateadd(dd,-1,SomeDate),1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM testRunningTotal
  ORDER BY id, SomeDate

And here are our results, it's all good now

id StartDate Enddate Charge RunningTotal
1 2012-01-01 2012-03-31 1000.00 1000.00
1 2012-04-01 2012-04-30 200.00 1200.00
1 2012-05-01 2012-05-31 300.00 1500.00
1 2012-06-01 9999-12-31 600.00 2100.00
2 2012-01-01 2012-12-31 100.00 100.00
2 2013-01-01 2013-12-31 500.00 600.00
2 2014-01-01 9999-12-31 -800.00 -200.00
3 2012-01-01 9999-12-31 100.00 100.00


I still think that maybe the LEAD function should do this conversion internally since all we are doing is specifying the OFFSET...what is your opinion?


Thursday, March 03, 2016

Did you know I have a github repository for stuff I run over and over?



There are queries or commands that I run every now and then. I typically don’t have these stored on my computer; to find this code I usually find the blog post I made and copy paste it from there. This becomes a pain in the neck sometimes because I don’t quite remember the title of the post, so I might be spending a good amount of time looking for it. I decided to put some of the code that I run frequently up on github.

You will see code to find index fragmentation as well code to concatenate rows into columns. So there are two kinds of code here. Code that you can run as is and code that you will plug into a stored procedure after you make changes to the table names. You can find that github repository here https://github.com/SQLMenace/CommonTasks

Besides the Common Tasks repository, I also have a Today I Learned repository. A collection of concise write-ups on small things I learn day to day across a variety of languages and technologies. These are things that don't really warrant a full blog post.

You can find that project here https://github.com/SQLMenace/til

With the Today I Learned repository you will find that there are not many things yet. The reason is that I just started this repository, but since SQl Server 2016 is just around the corner, I will start playing around with SQL Server 2016 more and more, thus the Today I Learned repository will become bigger soon.

Wednesday, March 02, 2016

Easy running totals with windowing functions

Back in the pre SQL Server 2012 days, if you wanted to do a running count, you either had to run a subquery or you could use a variable. This was slow because for each row the query that did the sum would be executed. With the additions to the windowing functions in SQL Server 2012, this is now running much faster. 

Let's take a look, first create the following table


CREATE TABLE #test(Id tinyint,SomeDate date, Charge decimal(20,10))

insert #test
SELECT 1,'20120101',1000
UNION ALL
SELECT 1,'20120401',200
UNION ALL
SELECT 1,'20120501',300
UNION ALL
SELECT 1,'20120601',600
UNION ALL
SELECT 2,'20120101',100
UNION ALL
SELECT 2,'20130101',500
UNION ALL
SELECT 2,'20140101',-800
UNION ALL
SELECT 3,'20120101',100


let's check that data we just inserted into the temporary table


SELECT * FROM #test


The output looks like this

Id SomeDate Charge
1 2012-01-01 1000.0000000000
1 2012-04-01 200.0000000000
1 2012-05-01 300.0000000000
1 2012-06-01 600.0000000000
2 2012-01-01 100.0000000000
2 2013-01-01 500.0000000000
2 2014-01-01 -800.0000000000
3 2012-01-01 100.0000000000


What we want is the following

id StartDate Enddate         Charge         RunningTotal
1 2012-01-01 2012-03-31 1000.0000000000 1000.0000000000
1 2012-04-01 2012-04-30 200.0000000000 1200.0000000000
1 2012-05-01 2012-05-31 300.0000000000 1500.0000000000
1 2012-06-01 9999-12-31 600.0000000000 2100.0000000000
2 2012-01-01 2012-12-31 100.0000000000 100.0000000000
2 2013-01-01 2013-12-31 500.0000000000 600.0000000000
2 2014-01-01 9999-12-31 -800.0000000000 -200.0000000000
3 2012-01-01 9999-12-31 100.0000000000 100.0000000000

For each row, we want to have the date that the row starts on and also the date when it end, we also want a running total as well. If there is no row after the current row for that id, we want the end date to be 9999-12-31.

So we will use a couple of functions. The first one is LEAD, LEAD accesses data from a subsequent row in the same result set without the use of a self-join. So the LEAD part looks like this

LEAD(dateadd(dd,-1,SomeDate),1,'99991231')  OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,

What we are doing is subtracting 1 from the date in the subsequent row (ateadd(dd,-1,SomeDate))
We are using 1 as the offset since we want to apply this to the next row. Finally if there is no subsequent row, we want to use the date 9999-12-31 instead of NULL

To do the running count, we will do the following

SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)
          AS RunningTotal

What this means in English is for each id ordered by date, sum up the charge values for the rows between the preceding rows and the current row. Here is what all that stuff means.

ROWS BETWEEN
Specifies the rows that make up the range to use as implied by

UNBOUNDED PRECEDING
Specifies that the window starts at the first row of the partition. UNBOUNDED PRECEDING can only be specified as window starting point.

CURRENT ROW
Specifies that the window starts or ends at the current row when used with ROWS or the current value when used with RANGE.
CURRENT ROW can be specified as both a starting and ending point.

And here is the query


SELECT id, someDate as StartDate,
LEAD(dateadd(dd,-1,SomeDate),1,'99991231') 
 OVER (PARTITION BY id ORDER BY SomeDate) as Enddate,
  Charge,
  SUM(Charge) OVER (PARTITION BY id ORDER BY SomeDate 
     ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) 
          AS RunningTotal
  FROM #test
  ORDER BY id, SomeDate


And running that query, gives us the running count as well as the dates

id StartDate Enddate         Charge         RunningTotal
1 2012-01-01 2012-03-31 1000.0000000000 1000.0000000000
1 2012-04-01 2012-04-30 200.0000000000 1200.0000000000
1 2012-05-01 2012-05-31 300.0000000000 1500.0000000000
1 2012-06-01 9999-12-31 600.0000000000 2100.0000000000
2 2012-01-01 2011-12-31 100.0000000000 100.0000000000
2 2012-01-01 2011-13-31 500.0000000000 600.0000000000
2 2012-01-01 9999-12-31 -800.0000000000 -200.0000000000
3 2012-01-01 9999-12-31 100.0000000000 100.0000000000
That's all for this post...

Monday, February 29, 2016

#BackToBasics returning rows with maximum values from a table



In the A challenge for 2016....... accepted! post I said that I would write a post once every month that would explain very basic stuff about SQL Server. Today is the second post, in this post we will take a look at how you can return the row with the maximum value for a group.

In order to explain the code in this post, I decided to grab the top largest cities in the world as the data. You can find the data here List of cities proper by population

 Below is the SQL code to create a table and also the data to populate this table

create table #Temp( Rank int not null, 
 City varchar(100) not null,
 Population int not null,
 Density int not null,
 Country varchar(100))

INSERT #Temp
SELECT 1,'Shanghai',24256800,3826,'China' union all
SELECT 2,'Karachi',23500000,6663,'Pakistan' union all
SELECT 3,'Beijing',21516000,1311,'China' union all
SELECT 4,'Lagos',21324000,18206,'Nigeria' union all
SELECT 5,'Delhi',16787941,11320,'India' union all
SELECT 6,'Tianjin',15200000,1293,'China' union all
SELECT 7,'Istanbul',14160467,2593,'Turkey' union all
SELECT 8,'Tokyo',13297629,6075,'Japan' union all
SELECT 9,'Guangzhou',13080500,1759,'China' union all
SELECT 10,'Mumbai',12478447,20680,'India' union all
SELECT 11,'Moscow',12197596,4859,'Russia' union all
SELECT 12,'São Paulo',11895893,7821,'Brazil' union all
SELECT 13,'Shenzhen',10467400,5256,'China' union all
SELECT 14,'Jakarta',10075310,15171,'Indonesia' union all
SELECT 15,'Seoul',10048850,17134,'South Korea' 

Now that we have the data in our table, let's run a simple query to return the row with the most populous city in the world. You can accomplish this in SQL Server by using the TOP operator and ordering the data by population in descending order.

SELECT TOP 1 * FROM #Temp
ORDER BY population DESC

Here is the output from that query


Rank City  Population Density Country
1 Shanghai 24256800 3826 China


What we want to do next is return one row the most populous city per country.

So let's take it a step at a time, the query below will give you the maximum population by country

SELECT MAX(population) as BiggestPopulation,Country
FROM #Temp
GROUP BY Country
ORDER BY BiggestPopulation DESC

Here is the output

BiggestPopulation Country
24256800  China
23500000  Pakistan
21324000  Nigeria
16787941  India
14160467  Turkey
13297629  Japan
12197596  Russia
11895893  Brazil
10075310  Indonesia
10048850  South Korea

What we can do is join the query we have above with the table itself on the country and population columns. The query will look like this

SELECT T.* FROM(
SELECT MAX(population) as BiggestPopulation,Country
FROM #Temp
GROUP BY Country) X
JOIN #Temp t on x.BiggestPopulation = t.Population
AND t.Country = x.Country
ORDER BY t.Population DESC

Here is the output

+------+-----------+------------+---------+-------------+
| Rank | City      | Population | Density | Country     |
+------+-----------+------------+---------+-------------+
| 1    | Shanghai  | 24256800   | 3826    | China       |
+------+-----------+------------+---------+-------------+
| 2    | Karachi   | 23500000   | 6663    | Pakistan    |
+------+-----------+------------+---------+-------------+
| 4    | Lagos     | 21324000   | 18206   | Nigeria     |
+------+-----------+------------+---------+-------------+
| 5    | Delhi     | 16787941   | 11320   | India       |
+------+-----------+------------+---------+-------------+
| 7    | Istanbul  | 14160467   | 2593    | Turkey      |
+------+-----------+------------+---------+-------------+
| 8    | Tokyo     | 13297629   | 6075    | Japan       |
+------+-----------+------------+---------+-------------+
| 11   | Moscow    | 12197596   | 4859    | Russia      |
+------+-----------+------------+---------+-------------+
| 12   | São Paulo | 11895893   | 7821    | Brazil      |
+------+-----------+------------+---------+-------------+
| 14   | Jakarta   | 10075310   | 15171   | Indonesia   |
+------+-----------+------------+---------+-------------+
| 15   | Seoul     | 10048850   | 17134   | South Korea |
+------+-----------+------------+---------+-------------+


SQL Server 2005 introduced a bunch of ranking/windowing functions which make writing these queries a little simpler. We are going to use ROW_NUMBER for this post. Here is what the query will look like


SELECT ROW_NUMBER() OVER (PARTITION By [what do you want it grouped/partitioned by?] ORDER BY [what do you want it ordered by?]) AS Row,* 
FROM Table

In our case, we want to partition the result set by country, we also want to order each partition by population descending. The query looks like this


SELECT ROW_NUMBER() OVER (PARTITION By Country ORDER BY population DESC) AS Row,*
FROM #Temp
ORDER BY Row,Population DESC

Here is what that query returns

+-----+------+-----------+------------+---------+-------------+
| Row | Rank | City      | Population | Density | Country     |
+-----+------+-----------+------------+---------+-------------+
| 1   | 1    | Shanghai  | 24256800   | 3826    | China       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 2    | Karachi   | 23500000   | 6663    | Pakistan    |
+-----+------+-----------+------------+---------+-------------+
| 1   | 4    | Lagos     | 21324000   | 18206   | Nigeria     |
+-----+------+-----------+------------+---------+-------------+
| 1   | 5    | Delhi     | 16787941   | 11320   | India       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 7    | Istanbul  | 14160467   | 2593    | Turkey      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 8    | Tokyo     | 13297629   | 6075    | Japan       |
+-----+------+-----------+------------+---------+-------------+
| 1   | 11   | Moscow    | 12197596   | 4859    | Russia      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 12   | São Paulo | 11895893   | 7821    | Brazil      |
+-----+------+-----------+------------+---------+-------------+
| 1   | 14   | Jakarta   | 10075310   | 15171   | Indonesia   |
+-----+------+-----------+------------+---------+-------------+
| 1   | 15   | Seoul     | 10048850   | 17134   | South Korea |
+-----+------+-----------+------------+---------+-------------+
| 2   | 3    | Beijing   | 21516000   | 1311    | China       |
+-----+------+-----------+------------+---------+-------------+
| 2   | 10   | Mumbai    | 12478447   | 20680   | India       |
+-----+------+-----------+------------+---------+-------------+
| 3   | 6    | Tianjin   | 15200000   | 1293    | China       |
+-----+------+-----------+------------+---------+-------------+
| 4   | 9    | Guangzhou | 13080500   | 1759    | China       |
+-----+------+-----------+------------+---------+-------------+
| 5   | 13   | Shenzhen  | 10467400   | 5256    | China       |
+-----+------+-----------+------------+---------+-------------+


What we really want is the top 10 rows. You might think that you can just add WHERE Row = 1 to the query. This is not allowed since the WHERE clause gets executed before the ROW_Number function as part of the SELECT. What you can do is use a common table expression and then specify WHERE Row = 1 when selecting from the CTE

Here is what it looks like

;WITH CTE AS(
SELECT ROW_NUMBER() OVER (PARTITION By Country ORDER BY population DESC) AS Row,*
FROM #Temp)

SELECT Rank, City,Population,Density,Country
FROM CTE
WHERE Row = 1
ORDER BY Population DESC

And here is our output, for each country we have only one row, you can also see that 4 cities from the top 10 are not in this list, those cities are in either China or India

+------+-----------+------------+---------+-------------+--+
| Rank | City      | Population | Density | Country     |  |
+------+-----------+------------+---------+-------------+--+
| 1    | Shanghai  | 24256800   | 3826    | China       |  |
+------+-----------+------------+---------+-------------+--+
| 2    | Karachi   | 23500000   | 6663    | Pakistan    |  |
+------+-----------+------------+---------+-------------+--+
| 4    | Lagos     | 21324000   | 18206   | Nigeria     |  |
+------+-----------+------------+---------+-------------+--+
| 5    | Delhi     | 16787941   | 11320   | India       |  |
+------+-----------+------------+---------+-------------+--+
| 7    | Istanbul  | 14160467   | 2593    | Turkey      |  |
+------+-----------+------------+---------+-------------+--+
| 8    | Tokyo     | 13297629   | 6075    | Japan       |  |
+------+-----------+------------+---------+-------------+--+
| 11   | Moscow    | 12197596   | 4859    | Russia      |  |
+------+-----------+------------+---------+-------------+--+
| 12   | São Paulo | 11895893   | 7821    | Brazil      |  |
+------+-----------+------------+---------+-------------+--+
| 14   | Jakarta   | 10075310   | 15171   | Indonesia   |  |
+------+-----------+------------+---------+-------------+--+
| 15   | Seoul     | 10048850   | 17134   | South Korea |  |
+------+-----------+------------+---------+-------------+--+



That is all for this month, see you next month. In the meantime, make sure to explore windowing/ranking functions as well as Common Table Expressions

Sunday, February 21, 2016

How to store twitter search results from R into SQL Server

I was messing around with R and grabbing twitter searches from within R, after that I decided to store the result in a csv file. Instead of stroring the results in a file, what if I want to store that info into a SQL Server database? This is pretty easy to do. If you want to try this out yourself, first head on over to my other post Using twitter from the R console. There you will be able to see how to setup OAuth so that you can use twitter within R, you will also see how to install the R twitteR package as well.

Time to get started....

First let's see how to connect to SQL Server from R
To load data from SQL Server into R, you first need to create an ODBC Data Source, I created one named LocalSqlServer, this DSN Data Source points to my local SQL Server 2016 instance.


After you have created your ODBC data source, you need to load the (RODBC) package, create a connection, execute a query, display the results and close the connection.. Here is what it looks like in the R Console



> require(RODBC)

>db <- odbcConnect("LocalSqlServer")

> Sample <- sqlQuery(db, "select * from sysobjects", stringsAsFactors=FALSE)

> head(Sample)
> close(db)  #close ORDBC connection
name id xtype uid info status base_schema_ver 1 sp_MSalreadyhavegeneration -1073624922 P 4 0 0 0 2 sp_MSwritemergeperfcounter -1072815163 P 4 0 0 0 3 TABLE_PRIVILEGES -1072372588 V 3 0 4 sp_replsetsyncstatus -1071944761 X 4 0 0 0 5 sp_replshowcmds -1070913306 P 4 0 0 0 6 sp_publishdb -1070573756 P 4 0 0 0

Now that we know how to connect to SQL Server, let's see if we can get the twitteR package to work, head on over to my other post Using twitter from the R console to see how to install twitterR and how some queries are executed.


Assuming the twitteR package is installed, we need to load it to use it


library(twitteR)

The next command is the command to enable OAuth, to see how to set this up go here Using twitter from the R console

setup_twitter_oauth("API key", "API secret", "Access token", "Access secret")
[1] "Using direct authentication"

Now it is time to do a simple search

tweets <- searchTwitter('#rstats', n=6) 
tweets





Running that command should have given you some results. However what we need as output is a dataframe, this way we can easily load this into a table in SQL Server. You can use the twListToDF function


tweets <- searchTwitter('#rstats', n=6) 
twListToDF(tweets)


The output now looks different



1                             RT @rquintino: @Mairos_B #sqlsatportugal session: all about R in #SqlServer 2016 #rstats https://t.co/DHrqIZrz1e
2                                                      a quick script to use imgcat in #rstats https://t.co/fpUlgWNX33 https://t.co/AhCCMLewCH
3 RT @KirkDBorne: Useful packages (libraries) for Data Analysis in R: https://t.co/haRKopFyly #DataScience #Rstats by @analyticsvidhya https:…
4                      Hey #rstats tweeps, do you have any readings to recommend on sensitivity analysis? Books/articles/websites all welcome.
5 RT @KirkDBorne: A Complete Tutorial on Time Series Modeling in R: https://t.co/7oI6JKyU4E #MachineLearning #DataScience #Rstats by @Analyti…
6 RT @KirkDBorne: A Complete Tutorial on Time Series Modeling in R: https://t.co/7oI6JKyU4E #MachineLearning #DataScience #Rstats by @Analyti…
  favorited favoriteCount replyToSN             created truncated replyToSID
1     FALSE             0        NA 2016-02-20 20:29:54     FALSE         NA
2     FALSE             0        NA 2016-02-20 20:24:50     FALSE         NA
3     FALSE             0        NA 2016-02-20 20:16:25     FALSE         NA
4     FALSE             0        NA 2016-02-20 20:11:08     FALSE         NA
5     FALSE             0        NA 2016-02-20 20:11:06     FALSE         NA
6     FALSE             0        NA 2016-02-20 20:02:05     FALSE         NA
                  id replyToUID
1 701141750161784834         NA
2 701140474019577856         NA
3 701138356466483204         NA
4 701137026075140096         NA
5 701137018508722176         NA
6 701134750296227840         NA
                                                                            statusSource
1                Mobile Web (M5)
Tweetbot for Mac
3   Twitter for Android
4                     Twitter Web Client
5     Twitter for iPhone
6                     Twitter Web Client
     screenName retweetCount isRetweet retweeted longitude latitude
1      psousa75            3      TRUE     FALSE        NA       NA
2      millerdl            0     FALSE     FALSE        NA       NA
3   diana_nario           50      TRUE     FALSE        NA       NA
4    emjonaitis            0     FALSE     FALSE        NA       NA
5       caryden           41      TRUE     FALSE        NA       NA
6 ArkangelScrap           41      TRUE     FALSE        NA       NA



The output now has a lot more stuff, you can see if it has been retweeted or favorited as well as the latitude, longtitude and more.

Time to send the result to SQL Server

There is a problem with how the sqlSave method creates the table, it makes the created column a float and then all inserts fail.


Run the following DDL statement in your database
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
CREATE TABLE [dbo].[tweets](
	[rownames] [nvarchar](255) NULL,
	[text] [nvarchar](255) NULL,
	[favorited] [nvarchar](5) NULL,
	[favoriteCount] [float] NULL,
	[replyToSN] [nvarchar](5) NULL,
	[created] [datetime] NULL,
	[truncated] [nvarchar](5) NULL,
	[replyToSID] [nvarchar](5) NULL,
	[id] [nvarchar](255) NULL,
	[replyToUID] [nvarchar](5) NULL,
	[statusSource] [nvarchar](255) NULL,
	[screenName] [nvarchar](255) NULL,
	[retweetCount] [float] NULL,
	[isRetweet] [nvarchar](5) NULL,
	[retweeted] [nvarchar](5) NULL,
	[longitude] [nvarchar](255) NULL,
	[latitude] [nvarchar](255) NULL
) ON [PRIMARY]

Now we are ready to save our twitter data in SQL Server,  here is how you do it.


> sqlSave(db,append = TRUE,colnames = FALSE,verbose = FALSE, addPK = FALSE, twListToDF(tweets))
Here is what the results look in SSMS



Make sure that you have append = TRUE, if you make it FALSE, the table will be recreated every time but since the command issued creates the wrong data type for the created column, you can't use it


So to recap, here are the commands you need after you loaded the twitterR package and have setup OAuth


> tweets <- searchTwitter('#rstats', n=6) #store the twitter search results
> twListToDF(tweets)  #convert to dataframe

> require(RODBC)  # load RODBC package
> db <- odbcConnect("LocalSqlServer")  #connect
> sqlSave(db,append = TRUE,colnames = FALSE,verbose = FALSE, addPK = FALSE, twListToDF(tweets))  #insert data into your table
> close(db)  #close ORDBC connection
 

Saturday, February 13, 2016

Quickly change m/d/yyyy dates to mm/dd/yyyy dates in Excel

Sometimes you get data in an Excel or csv file and it will be in a m/d/yyyy format.


Here is an example

1/1/2005
2/1/2012
2/12/2012
12/5/2015

This will look fine if you open the csv file in Excel but if you need to import this data into a database you might get an error that the format is not correct.

 What you want is the data to be in this format
01/01/2005
02/01/2012
02/12/2012
12/05/2015


If you use Format Cells and then pick Date you will see that there is no mm/dd/yyyy format What you have to do is pick the Custom Category and click on the m/d/yyyy format



Now change m/d/yyyy to mm/dd/yyyy



Click OK and you are all set, you will see that the dates are now in mm/dd/yyyy format

Let me just mention that the problem was not the conversion in SQL itself but a tool that did the import choked

BTW SQL Server can do this fairly easily

If I run the following

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
;with dates (date)
as (select '1/1/2001' union all
 select '11/1/2001' union all
 select '1/12/2001' union all
 select '5/5/2001' union all
 select '1/22/2001' union all
 select '12/1/2001' union all
 select '01/01/2001' )

 select convert(date,date)
 from dates

I will get this as output

2001-01-01
2001-11-01
2001-01-12
2001-05-05
2001-01-22
2001-12-01
2001-01-01

That of course is just a date, if you want mm/dd/yyyy format, you can use convert and style 101...

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
;with dates (date)
as (select '1/1/2001' union all
 select '11/1/2001' union all
 select '1/12/2001' union all
 select '5/5/2001' union all
 select '1/22/2001' union all
 select '12/1/2001' union all
 select '01/01/2001' )

 select convert(varchar(10),convert(date,date),101)
 from dates

 select convert(date,date)
 from dates

And here is that output

01/01/2001
11/01/2001
01/12/2001
05/05/2001
01/22/2001
12/01/2001
01/01/2001