Wednesday, October 04, 2017

Sizing database files

It has been a while since I wrote some of my best practices posts. I decided to revisit these posts again to see if anything has changed, I also wanted to see if I could add some additional info.

This post will demonstrate that there is a difference in performance if you don't size your database file accordingly. It is a good practice to have your database sized correctly for the next 6 to 12 months, you don't want your server wasting cycles with growing files all the time.

Figure out how big your files are now, figure out how much they will grow in the next year and size your files accordingly, check back every month or so to see if your estimates were correct.

By default SQL Server will create databases with very small files when you create a database and you don't specify the sizes. If you have people creating databases on your servers, consider adding a DDL trigger to notify you when a new DB is added so that you can talk to the database creator and size the files. You also can change the defaults on the server so that you don't have the 10% growth either.

First let's see what the difference is when we have a database where the files will have to grow versus one where the files are big enough for the data that will be inserted.
Here we are creating two databases, one with much bigger files than the other one


The TestBigger database is correctly sized for the data that will be inserted

CREATE DATABASE [TestBigger]
 ON  PRIMARY 
( NAME = N'TestBigger', FILENAME = N'f:\TempTestBigger.mdf' , 
SIZE = 509600KB , FILEGROWTH = 1024KB )
 LOG ON 
( NAME = N'TestBigger_log', FILENAME = N'f:\TempTestBigger_log.ldf' , 
SIZE = 502400KB , FILEGROWTH = 10%)
GO


The TestSmaller database is very small, files will have to be expanded many times to accommodate all the data I will be inserting

CREATE DATABASE [TestSmaller]
 ON  PRIMARY 
( NAME = N'TestSmaller', FILENAME = N'f:\TempTestSmaller.mdf' , 
SIZE = 5280KB , FILEGROWTH = 1024KB )
 LOG ON 
( NAME = N'TestSmaller_log', FILENAME = N'f:\TempTestSmaller_log.ldf' , 
SIZE = 504KB , FILEGROWTH = 10%)
GO



These two stored proc calls are just to verify that the files match with what we specified, you can use sp_helpdb to check the size of a database that you created when you don't specify the file sizes

EXEC sp_helpdb 'TestBigger'

name         filename             filegroup SIZE
TestBigger f:\TempTestBigger.mdf      PRIMARY 509632 KB
TestBigger_log f:\TempTestBigger_log.ldf NULL 502400 KB
EXEC sp_helpdb 'TestSmaller'

name         filename              filegroup SIZE
TestSmaller f:\TempTestSmaller.mdf      PRIMARY 5280 KB
TestSmaller_log f:\TempTestSmaller_log.ldf NULL  512 KB


Next, we are creating two identical tables, one in each database


USE TestSmaller
GO
CREATE TABLE test (SomeName VARCHAR(100), 
SomeID VARCHAR(36), SomeOtherID VARCHAR(100), SomeDate DATETIME)

USE TestBigger
GO
CREATE TABLE test (SomeName VARCHAR(100),
 SomeID VARCHAR(36), SomeOtherID VARCHAR(100), SomeDate DATETIME)

This query is just used so that the data is cached for the two inserts later on, this way the data doesn't have to be fetched from disk for either inserts, you can discard the results after the query is done

USE master
GO


SELECT TOP 1000000 c1.name,NEWID(),NEWID(),GETDATE() 
FROM sys.sysobjects c1
CROSS JOIN sys.sysobjects c2
CROSS JOIN sys.sysobjects c3
CROSS JOIN sys.sysobjects c4


Here is the first insert into the bigger database

INSERT TestBigger.dbo.test
SELECT TOP 1000000 c1.name,NEWID(),NEWID(),GETDATE() 
FROM sys.sysobjects c1
CROSS JOIN sys.sysobjects c2
CROSS JOIN sys.sysobjects c3
CROSS JOIN sys.sysobjects c4

Here is the second insert into the smaller database

INSERT TestSmaller.dbo.test
SELECT TOP 1000000 c1.name,NEWID(),NEWID(),GETDATE() 
FROM sys.sysobjects c1
CROSS JOIN sys.sysobjects c2
CROSS JOIN sys.sysobjects c3
CROSS JOIN sys.sysobjects c4


On several machines I tested on, it takes half the time or less to insert the data in the bigger database compared to the smaller database. On some machines it is almost 5 times faster to insert into the bigger database.

How about on your machine, do you see that the insert into the bigger database takes less than half the time it takes to insert into the smaller database?


Check the sizes of the databases again

EXEC sp_helpdb 'TestBigger'

name         filename            filegroup size
TestBigger f:\TempTestBigger.mdf      PRIMARY 509632 KB
TestBigger_log f:\TempTestBigger_log.ldf NULL 502400 KB
EXEC sp_helpdb 'TestSmaller'

name         filename              filegroup size
TestSmaller f:\TempTestSmaller.mdf      PRIMARY 215296 KB
TestSmaller_log f:\TempTestSmaller_log.ldf NULL 427392 KB




As you can see, the bigger database did not expand, the smaller database expanded a lot.


Autogrow
If you do use autogrow, then make sure you don't use the default 10%, take a look at this message
Date 10/03/2017 12:57:56 PM
Log SQL Server (Current - 11/25/2012 5:00:00 AM)
Source spid62
Message
Autogrow of file 'MyDB_Log' in database 'MyDB' took 104381 milliseconds. Consider using ALTER DATABASE to set a smaller FILEGROWTH for this file.
See that, it took a long time, you don't want to grow a one terabyte file by ten percent, that would be one hundred gigabytes, that is huge. Use something smaller and don't use percent, the bigger the file gets the longer it will take to expand the file.


File placement
Separate the log files from the data files by placing them on separate hard drives. Placing the files on separate drives allows I/O activity to occur at the same time for both the data and log files. Instead of having huge files consider having smaller files in separate filegroups. Put different tables used in the same join queries in different filegroups as well. This will improve performance, because of parallel disk I/O searching for joined data.

Put heavily accessed tables and the nonclustered indexes that belong to those tables on different filegroups. This will improve performance, because of parallel I/O if the files are located on different physical disks. Just remember that you can't separate the clustered indexes from the base table, you can only do this for non clustered indexes. Of course people can get very creative, I have worked with a database once where each table was placed in its own filegroups, there were hundreds of files....what a mess


Tempdb
There are all kinds of recommendations about how many data files you should have for tempdb. Start with 4 files and add more files if you see contention. Paul Randal, has a detailed post here: A SQL Server DBA myth a day: (12/30) tempdb should always have one data file per processor core.

If you can, place tempdb on its own physical drive as well, separated from the user databases.

Consider Solid State hard drives or flash storage for tempdb

See also http://support.microsoft.com/kb/2154845 for recommendations by Microsoft Customer Service and Support


Test, test, test
Never ever blindly follow what you read on the internet, make sure that you test it out first on a QA server before promoting the changes to production!!



Sunday, August 27, 2017

Why would anyone use a global temporary table?



SQL Azure has added something called database scoped global temporary tables.

Azure SQL Database supports global temporary tables that are also stored in tempdb and scoped to the database level. This means that global temporary tables are shared for all users’ sessions within the same Azure SQL database. User sessions from other Azure SQL databases cannot access global temporary tables.

They way you add these is by using a double pound sign

They way you create a database scoped global temporary table is like this

CREATE TABLE ##test ( a int, b int);

You are probably thinking...wait a minute, how is this different from a global temporary table? It is pretty much the same but on SQL Azure, it is scoped to the database level, while on prem it is scoped to the instance level


I was listening to the latest SQL Server Radio podcast and the hosts were talking about why anyone would ever need a global temporary table. Why not use a real table instead

I can come up with one answer....

Let's take a look

First create a database and then add a user who has read and write permissions


CREATE DATABASE Test
GO

CREATE LOGIN DenisTest   
    WITH PASSWORD = 'DenisTest', CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF; 

USE Test
GO

CREATE USER DenisTest FOR LOGIN DenisTest;  
GO   

ALTER ROLE [db_datareader] ADD MEMBER DenisTest
GO
ALTER ROLE [db_datawriter] ADD MEMBER DenisTest
GO

Now, let say we want to run a query several times and store the results



USE Test
GO

SELECT name,type_desc,state_desc,size
FROM sys.database_files
WHERE name ='Test'

So I need a table to store the results right?
Let's try creating one......

CREATE TABLE SomeTable (name sysname,
type_desc nvarchar(120),
state_desc nvarchar(120),
size int)

Msg 262, Level 14, State 1, Line 8
CREATE TABLE permission denied in database 'Test'.

That is right, we don't have permissions

Now let's add a double pound sign in front of that table

CREATE TABLE ##SomeTable (name sysname,
type_desc nvarchar(120),
state_desc nvarchar(120),
size int)

And this works

Now we can insert into the table we just created

INSERT ##SomeTable
SELECT name,type_desc,state_desc,size
FROM sys.database_files
WHERE name ='Test'

Also, we can insert and select from other sessions.query windows as well

Of course if the server restarts, this table will be gone. However if you want to capture some query output and look at it, you don't have to wait for someone to give you ddl_admin permission, create the table for you or make you db_owner.


Now why everyone would need a temporary stored procedure...that is another story and I can't really think of a reason









Thursday, August 24, 2017

PASS tv has added a boatload of videos



While I was away on vacation in San Francisco,  I noticed that PASS tv has added dozens of videos to their YouTube channel. These videos were added 2 days ago. There should be something there for everyone interested in working with SQL Server as well as interested in presenting

There are videos by Kalen Delaney, Bob Ward, Buck Woody, Kendra Little and many more


You can find all the videos here: https://www.youtube.com/user/SQLPASSTV/videos


Sunday, August 06, 2017

I don't always use a RIGHT JOIN, but when I do, it's because I am lazy



Do you use RIGHT JOINs? I myself rarely use a RIGHT JOIN, I think in the last 17 years or so I have only used a RIGHT JOIN once or twice. I think that RIGHT JOINs confuse people who are new to databases, everything that you can do with a RIGHT JOIN, you can also do with a LEFT JOIN, you just have to flip the query around

So why did I use a RIGHT JOIN then?

Well the answer is in the image below



That is right..it was pure laziness, if you ever met a programmer, you will find out pretty quickly that programmers are pretty lazy. I guess lazy is not the right word to use here, it is better to call programmers efficient or people who don't want to reinvent the wheel.

So what actually happened? I was checking some data in a table, I already had written the following query

SELECT number 
FROM SomeTableWithIDs c

Now I just needed to see how many numbers were not used

So what did I do, I just continued below the query I already had, I ended up with this

SELECT number 
FROM SomeTableWithIDs c
RIGHT JOIN(SELECT number 
FROM master..spt_values 
WHERE type = 'p' 
AND number BETWEEN 0 and 255) x on c.SomeID = x.number
WHERE c.SomeID is null


Let's take a look at this again with some DDL and DML you can run

First create the following table, it has only one column, this column has a tinyint data type (don't ask why it is tinyint, this is the reason I had to take a look at how many unused IDs we had to begin with)

CREATE TABLE SomeTableWithIDs (SomeID tinyint not null )

ALTER TABLE SomeTableWithIDs
ADD CONSTRAINT PK_SomeTableWithIDs PRIMARY KEY CLUSTERED (SomeID)


Now that we have the table created, let's populate it with 250 rows

INSERT SomeTableWithIDs
SELECT TOP 250 ROW_NUMBER() OVER (ORDER BY s.ID)
FROM sys.sysobjects s
CROSS JOIN sys.sysobjects s2

Let's delete a couple of rows

DELETE SomeTableWithIDs WHERE SomeID IN (2,4,6,11)


Running the RIGHT JOIN query will give us the missing rows

SELECT number 
FROM SomeTableWithIDs c
RIGHT JOIN(SELECT number 
FROM master..spt_values 
WHERE type = 'p' 
AND number BETWEEN 0 and 255) x on c.SomeID = x.number
WHERE c.SomeID is null

Here is the output

0
2
4
6
11
251
252
253
254
255



Now you have to realize that this was just an ad-hoc query, if I were to put this in a proc or script, I would have done it like this

SELECT number 
FROM master..spt_values s
WHERE type = 'p' 
AND number BETWEEN 0 and 255
AND NOT EXISTS (SELECT SomeID 
   FROM SomeTableWithIDs c 
   WHERE c.SomeID = s.number)


That query is less confusing for beginners than a RIGHT JOIN



Wednesday, July 26, 2017

Why I am learning Python and why you might need to learn it as well


If you follow my non SQL Server blog, you might have noticed that I have been posting about Python lately. There are a couple of reasons for this


  • Python will be included alongside R in SQL Server 2017, I might need to support this in the future
  • Python is used more and more by the data science community
  • My youngest son is learning Python, in order to help him, I need to know it too


I already posted  several blog posts, here are some of these posts

Summer of code 2017: Python

I am not going to list all the posts here, you can find them by clicking on the Python label,  or by going to this link http://denisgobo.blogspot.com/search/label/Python

I must admit that after messing around with Python for a month or so that I actually like the language. I don't know why but it reminds me of classic VB in the same sense that ColdFusion reminded me of classic VB back in 2000/2001.  Maybe it is the luck of curly braces {}  :-)

Visual Studio is now also a top notch Python development tool, so if you are familiar with Visual Studio you can keep using the same tools.

Here is also a presentation about Python in SQL Server if you want to know a little more about the way it is integrated.



Are you going to learn Python or are you going to stay away as far as possible from it?

Thursday, July 13, 2017

SSMS: When did the query finish, when did the query start?



This is a quick post but it might be an item you can add to your Today I Learned list  :-)

Let's say you get to a SSMS query window and you see the following



You can tell it ran really fast and you can tell that 26080 rows were affected.  Can you tell when it started and when it finished?

Yes, you can!!!

Hit the F4 button or select View-->Properties Window from the menu bar

Here is what I see



As you can see there are a bunch of properties about the query

For example

Start time: 7/13/2017 9:58:01 AM
Finish time: 7/13/2017 9:58:02 AM
Elapsed time: 00:00:00.766


There you have it, a quick and easy way to tell when something ran in your query windows

Monday, July 10, 2017

T-SQL Tuesday #92, Lessons learned the hard way

This month's T-SQL Tuesday is hosted by Raul Gonzalez, he proposed the following:  For this month, I want you peers to write about those important lessons that you learned the hard way, for instance something you did and put your systems down or maybe something you didn’t do and took your systems down. It can be also a bad decision you or someone else took back in the day and you’re still paying for it…

There are so many things to share here so everybody can learn from each others mistakes, because all of us were once a beginner and no one is born with any knowledge about SQL Server.

Please do not be ashamed of sharing your experiences, you can anonymize the whole story if you want but remember all people make mistakes, the important is to learn from them and try not to repeat them in the future.


Before we start, do you ever wonder why in job postings they ask for a minimum of n years of experience? Well what is experience but the accumulation of mistakes you have made over time. Once you drop a table, I guarantee you, you won't do it again  :-)  So when they ask for n years of experience what they are saying is.... hey I know you messed up, and I know you did it somewhere else and not here, we are glad you did it somewhere else,  but I think you won't make those mistakes again, so we are willing to hire you now.....



So here are my stories

Script out proc with drop dependent object

I still remember this one,  even though it has been over 15 years ago. In Enterprise Manager/ Query Analyzer on SQL Server 7, when scripting out a proc there was an option you could check, this was the drop dependent objects option. Somehow that was checked and the table used in the proc was also dropped
I don't have Query Analyzer installed anywhere but even in SSMS there is an option to script the dependent objects...luckily the default is false..



So I got a nice call from one of the New York City agencies that data wasn't there. I was just about to start my lunch. I lost my appetite real quick. SO what I did was take a full backup and then did a restore with stop at to 2 minutes before I dropped the table. This worked and all the data was there. I however lost my appetite and did not eat my lunch that day. But I can tell you that I have never dropped a table again.




Script out table to make history table

This is one of my favorite lessons, lucky for me a co-worker managed to do this
We needed a history table, this table would be populated each month with the current data, the main table then would be cleaned and repopulated via a DTS package. So what do you do? You script out the current table, change the table name and run the script right? Seems easy enough.....

So this was done....  an hour or 2 passes and we get a call that the original table doesn't exist.

What happened?  Well the script had a drop table and a create table part, the table name change was only done in the create part...ooops.... not good



Staging server, test server..what's that?

I worked in at least 2 companies that did not have any test/staging or QA environment. need a change..right click DTS package right on production, make changes and save.... how productive people were.....  I am still amazed that nobody did any real bad things... But I believe we did use source control, but it was Visual SourceSafe 6.0




Start transaction... walk away casually

One time I had a query window open in SSMS and there was an insert or update statement and nothing else.  hit F5 and casually walked away. After a couple of minutes some people told me things started to take a long time and timing out.  Alright I executed sp_who2 and saw a SID in the BlkBy column in many rows. Ready to go find the person and give this person a nice lecture, I noticed that the SPID was mine?  WTF, I looked at my query window..nothing. I then scrolled up and as you can guess by now there was a nice BEGIN TRANSACTION statement at the first line....  Argh... so I commited the thing and that was the end of that. At least I didn't go home for the day yet.




Insert SELECT after DELETE and before WHERE clause

Sometimes, you have code like this

DELETE Sometable
WHERE a= b

you want to see what that will affect, so you do this and run the bottom 2 lines

DELETE Sometable
SELECT * FROM Sometable
WHERE a= b

Of course if you leave it like that, for example in a SQL Agent job, you will learn the hard way that all your data is gone, not just the stuff that the WHERE clause applies to


Use SQL to bring down other things in the company
The request was that support wanted to know immediately when an order was placed. No problem, we can add a trigger that populates a table and then we have a job that runs once a minute to email the support group. Sounds reasonable right? Someone decided to do a little testing.... this person created 100,000 entries.... I guess you can imagine what happens next. Yep, the email server was overloaded. We had to quickly delete these rows and truncate the table that was used to send the emails.


Change the user without telling anyone
Another one brought to you by a co-worker. This person needed some stored procedures to be deployed to production. No problem, put in ticket and we will get it done. So stored procedures were deployed and the front end code was getting some errors. It turns out that this person decided to create a new user and use that instead of the user used in all the other connection strings. At least this was easy to fix

I will leave you with this horror story from the microsoft.public.sqlserver newsgroup, I tried to find this thread but I was unable to find it.

Some person tried to do a backup but this person got confused and somehow managed to do a restore from the only backup they had, this backup was 1 year old.  The advice this person got? Update resume.

Not good, not good at all


Monday, June 19, 2017

Five great SQL Server GitHub repos that every SQL Server person should check out



Did you know that there are some greats repos on github with awesome SQL scripts that you can use or even contribute to?

I decided to make you aware of some of these.

Tigertoolbox


The Microsoft SQL Server Tiger team has a GitHub repository where they host a bunch of SQL Server scripts

It is called the tigertoolbox. In the image below, you can see what is in this repository



GitHub Repo is here: https://github.com/Microsoft/tigertoolbox


Also make sure to follow the @mssqltiger twitter account to stay up to date with any news from the MS Tiger Team.


The SQL Server Maintenance Solution

This is a GitHub repo by Ola Hallengren
The SQL Server Maintenance Solution comprises scripts for running backups, integrity checks, and index and statistics maintenance on all editions of Microsoft SQL Server 2005, SQL Server 2008, SQL Server 2008 R2, SQL Server 2012, SQL Server 2014, and SQL Server 2016. The solution is based on stored procedures, the sqlcmd utility, and SQL Server Agent jobs

GitHub Repo is here:  https://github.com/olahallengren/sql-server-maintenance-solution

Also make sure to visit the Ola's website for more info here: https://ola.hallengren.com/



dbatools

dbatools is a free PowerShell module with over 200 SQL Server administration, best practice and migration commands included.

GitHub Repo is here: https://github.com/sqlcollaborative/dbatools

Also make sure to visit their website: https://dbatools.io/




Microsoft SQL Server Samples Repository


This GitHub repository contains code samples that demonstrate how to use Microsoft's SQL products including SQL Server, Azure SQL Database, and Azure SQL Data Warehouse. Each sample includes a README file that explains how to run and use the sample.

Repo is here: https://github.com/Microsoft/sql-server-samples



SQL-Server-First-Responder-Kit

This repo by the nice folks from Brent Ozar Unlimited®.has a collection of procs to help you with health checks, maintenance and performance checks. Procs like sp_Blitz, sp_BlitzCache, sp_BlitzFirst, sp_BlitzIndex, and other SQL Server scripts.


GitHub Repo is here: https://github.com/BrentOzarULTD/SQL-Server-First-Responder-Kit

Make sure to bookmark this page on the BrentOzar.com site to learn how to use these scripts/procs and to keep up to date on any additions/changes: https://www.brentozar.com/first-aid/

Wednesday, June 14, 2017

Running SQL Queries Against Baby Names from Social Security Card Applications



I decided to look around to see if I can find some interesting data sets
I came across the Baby Names from Social Security Card Applications-National Level Data data set and decided it would be a fun exercise to import this data into SQL Server and run some queries

You can download the zip file from the link above, you will see 137 files with a txt extension in the zip file

Here is what the description is for these files

National Data on the relative frequency of given names in the population of U.S. births where the individual has a Social Security Number 

(Tabulated based on Social Security records as of March 5, 2017)

For each year of birth YYYY after 1879, we created a comma-delimited file called yobYYYY.txt. Each record in the individual annual files has the format "name,sex,number," where name is 2 to 15 characters, sex is M (male) or F (female) and "number" is the number of occurrences of the name. Each file is sorted first on sex and then on number of occurrences in descending order. When there is a tie on the number of occurrences, names are listed in alphabetical order. This sorting makes it easy to determine a name's rank. The first record for each sex has rank 1, the second record for each sex has rank 2, and so forth.
To safeguard privacy, we restrict our list of names to those with at least 5 occurrences.



The format of the files looks like this

Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578

So these files are simple comma separated values files

Download the zip file from this url: https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data

Unzip the files in a folder on your hard drive, in my case I unzipped it to C:\Names

Let's get started with the fun part

First I created a new DB


CREATE DATABASE Scratch
GO

USE Scratch
GO

Now that we have our database, let's create this very simple table, it will store all the data from the file, I also added a year column to this table



CREATE TABLE Names (FirstName varchar(500), Gender char(1), YearInFile date)
GO

Because the file doesn't have a year column, I created this view so that I can bulk insert into the view instead of having to mess around with SSIS or format files

CREATE VIEW myView
AS SELECT FirstName,Gender,NameCount
FROM Names

I extracted all the files into a folder with the name Names, I placed this on the C drive

Now that you have the table and view created, try running the following

BULK INSERT myView FROM 'c:\Names\yob1932.txt' WITH (FIELDTERMINATOR =',')

It should insert 9382 rows

If that is working, it is time to import all 137 files

I am presenting you a solution here so that you can run this all in SSMS, of course you could have as easy used PowerShell, Python or any other language to import these files. Nothing prevents you from doing that :-)

Attention/warning!!
We will be using xp_cmdshell

It is not a best practice to have xp_cmdshell enabled. As a matter of fact beginning with SQL Server 2005, the product ships with xp_cmdshell disabled. If you try to run xp_cmdshell you will get the following message if it is not enabled
Server: Msg 15281, Level 16, State 1, Procedure xp_cmdshell, Line 1

SQL Server blocked access to procedure ‘sys.xp_cmdshell’ of component ‘xp_cmdshell’ because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of ‘xp_cmdshell’ by using sp_configure. For more information about enabling ‘xp_cmdshell’, see “Surface Area Configuration” in SQL Server Books Online.

To enable xp_cmdshell execute the following code

EXECUTE SP_CONFIGURE 'show advanced options', 1
RECONFIGURE WITH OVERRIDE
GO
 
EXECUTE SP_CONFIGURE 'xp_cmdshell', '1'
RECONFIGURE WITH OVERRIDE
GO
 
EXECUTE SP_CONFIGURE 'show advanced options', 0
RECONFIGURE WITH OVERRIDE
GO

Now it is time to import all the files. First we want to store all the files that we want to import in a table

We are creating a temp table, we then call a DIR command with the B switch. we delete everything that doesn't end in txt. Finally we add an identity column, we will use this to loop over this table


IF OBJECT_ID('tempdb..#tempList') IS NOT NULL
DROP TABLE #tempList
 
CREATE TABLE #tempList ([FileName] VARCHAR(500))
 
--plain vanilla dos dir command with /B switch (bare format)
INSERT INTO #tempList
EXEC MASTER..XP_CMDSHELL 'dir c:\Names /B'
 
 
--delete the null values
DELETE #tempList WHERE [FileName] IS NULL
 
-- Delete all the files that don't have txt extension
DELETE #tempList WHERE [FileName] NOT LIKE '%.txt'
 
--this will be used to loop over the table
ALTER TABLE #tempList ADD id int IDENTITY
go

Let's see what we have in the table

SELECT * FROM #tempList

You will see something like this

FileName id
yob1880.txt 1
yob1881.txt 2
yob1882.txt 3
yob1883.txt 4
yob1884.txt 5
yob1885.txt 6
....................

Now it is time to import all the files, here is the code that will import all 137 files, I put comments in the code so you know what it does


DECLARE @LoopID int, @MaxID int
SELECT @LoopID = min(id),@MaxID = max(ID)
FROM #tempList
 

SET NOCOUNT ON

-- if you want to run this multiple time, wipe out the data
TRUNCATE TABLE Names

-- Change this to your directory
DECLARE @Directory varchar(50) = 'c:\Names\'

 

DECLARE @FileName varchar(500),@FullFileName varchar(500)
 
DECLARE @SQL NVARCHAR(1000)
Declare @year varchar(10)


-- loop over all the file
WHILE @LoopID <= @MaxID BEGIN SELECT @FileNAme = filename FROM #tempList WHERE id = @LoopID -- concatenate the directory and filename SELECT @FullFileName = @Directory + @FileName BEGIN SELECT @SQL = N'BULK INSERT myView FROM ''' + @FullFileName +''' WITH (FIELDTERMINATOR ='','')' -- you can remove this..this basically prints what will be executed PRINT @SQL
    -- run the command
    EXEC SP_EXECUTESQL @SQL
     
  -- this grabs the year part from the file name
 SELECT @year = right(parsename(@filename,2),4)

  -- This will update the rows we just inserted to have the year value
  UPDATE Names 
  SET YearInFile = @year
  WHERE YearInFile is null
    
    END
 
    --Get the next id, instead of +1 we grab the next value in case of skipped id values
    SELECT @LoopID = min(id)
    FROM #tempList
    WHERE id > @LoopID
END


And now that we have the data we can run some queries

Let's see if we have all the data

SELECT COUNT(*) FROM names
--1,891,894

That should return 1,891,894 rows

let's see..did the baby boomers parents really get it on after WW II?

SELECT  SUM(nameCount) as SumCount,YearInFile FROM names
group by YearInFile
ORDER BY SumCount DESC


SumCount YearInFile
-----------   
4200097 1957
4156576 1959
4154809 1960
4140160 1961
4131802 1958
4121130 1956
4035499 1962
4012933 1955
3993206 2007
3980407 1954

Looks like it... but what happened in 2007, top of the housing bubble was 2006/2007 so people started buying houses and needed to populate those rooms with humans  :-)

What are the top 10 most popular names over time?

SELECT TOP 10 FirstName, gender,SUM(Namecount) as Alltime
FROM names
GROUP BY FirstName,gender
ORDER BY 3 DESC


Here are the results, looks like males don't have as many unique names

FirstName gender Alltime
James M 5136026
John M 5105834
Robert M 4809176
Michael M 4338144
Mary F 4120692
William M 4087525
David M 3601032
Joseph M 2592856
Richard M 2560771
Charles M 2378806

What were the most popular names in 1880?

SELECT TOP 10  FirstName, gender,SUM(Namecount)  as [1880]
FROM names
where YearInFile = 1880
GROUP BY FirstName,gender
ORDER BY 3 DESC

Here are those results


FirstName gender 1880
John M 9655
William M 9532
Mary F 7065
James M 5927
Charles M 5348
George M 5126
Frank M 3242
Joseph M 2632
Anna F 2604
Thomas M 2534


For 2016

SELECT TOP 10 FirstName, gender,SUM(Namecount)  as [2016]
FROM names
where YearInFile = 2016
GROUP BY FirstName,gender
ORDER BY 3 DESC

Now we have some names which I recognize since my kids friends are named like hte ones on this list


FirstName gender 2016
Emma F 19414
Olivia F 19246
Noah M 19015
Liam M 18138
Ava F 16237
Sophia F 16070
William M 15668
Mason M 15192
James M 14776
Isabella F 14722


Anybody named after months?


I guess nobody likes February

SELECT FirstName , count(*)
FROM names
WHERE FirstName in('January','February','March','April','May','June',
   'July','August','September','October','November','December')
GROUP BY FirstName

FirstName Count
April 135
August 206
December 56
January 59
July 73
June 273
March 41
May 213
November 37
October 34
September 62

I wonder how many of these names are the result of someone putting the month in the first name field  :-)


How many names are both male and female?

;WITH cte as(
SELECT FirstName 
FROM names
GROUP BY FirstName
HAVING (COUNT( DISTINCT Gender )>1))

SELECT FirstName, gender, count(*) FROM Names n
WHERE EXISTS (SELECT * FROM cte c where c.FirstName = n.FirstName)
GROUP BY FirstName, Gender
ORDER BY 1

As you can see, there are a lot

Here are some where number of males and females doesn't differ that much

FirstName gender Count
Courtney F 123
Courtney M 126
Cruz F 121
Cruz M 114
Tom F 62
Tom M 137


One more query and that is it for this post..

How many people are named after weekdays?

SELECT FirstName , count(*)
FROM names
WHERE FirstName in('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday',
   'Sunday')
GROUP BY FirstName

Here are the results

FirstName Count
-------------       -----------------
Friday          4
Monday          5
Sunday        89
Tuesday        59
Wednesday     51


Last Monday I saw Tuesday who told me that she was going to see Friday on Sunday....

Yep, crystal clear...


That is it for this post.... feel free to run your own queries and if you find some interesting things, feel free to leave me a comment with the results or the query


One more query for you, I added this at the bottom since there are 52 rows in the result

What is the most popular letter a name starts with?

Here is that query, there are a couple of counts here:, the total count, the occurance and the distinct count of a name starting with the letter across all years



SELECT  LEFT(FirstName,1) AS Initial, gender,
SUM(Namecount) AS SumCount, 
COUNT(NameCount) AS TotalCount ,
COUNT(DISTINCT firstname) AS DistinctCount
FROM names
GROUP BY left(FirstName,1),gender
ORDER BY 3 DESC

Here are the results



Initial gender SumCount TotalCount DistinctCount
J M 29,396,710 63,543 3,995
M F 19,855,157 105,326 5,583
A F 16,744,520 117,270 7,501
R M 16,329,781 46,772 2,135
J F 14,887,120 79,019 4,622
D M 14,375,824 62,369 3,380
S F 13,587,753 90,382 6,059
M M 12,717,485 51,078 2,643
C M 12,652,140 51,820 2,255
C F 12,640,899 80,401 4,159
L F 12,099,095 86,872 4,473
A M 11,611,302 66,139 3,594
K F 11,065,934 74,797 4,967
D F 9,723,142 66,592 3,916
E F 9,644,797 49,929 2,477
B F 8,485,706 39,199 2,251
T M 8,331,252 42,525 2,592
B M 8,038,739 37,772 1,701
S M 7,586,177 42,524 2,524
R F 7,216,686 51,892 2,837
E M 7,122,968 36,781 1,542
W M 7,073,357 17,052 577
L M 6,594,743 42,492 1,799
G M 6,587,004 26,930 1,095
T F 5,913,655 61,915 4,277
K M 5,744,651 44,110 3,096
P F 5,024,379 19,759 1,090
N F 4,315,987 36,667 2,507
H F 4,203,255 19,717 1,159
P M 4,181,853 15,150 666
G F 4,083,933 28,566 1,398
H M 3,962,710 26,661 1,128
V F 3,546,832 23,367 1,083
N M 3,354,035 20,738 1,180
F M 3,048,688 15,174 592
F F 2,084,083 17,670 842
I F 1,966,942 17,719 1,066
I M 1,350,255 11,261 573
W F 1,235,983 9,462 569
V M 1,183,415 11,609 559
O M 1,112,184 15,521 724
O F 951,873 12,914 718
Z M 926,015 9,284 833
Y F 656,093 11,343 1,072
Z F 525,515 13,263 1,139
Q M 183,280 3,684 296
X M 172,430 1,175 127
Y M 163,545 7,031 579
U M 93,783 2,875 152
Q F 75,612 2,663 274
X F 54,170 1,466 150
U F 50,450 1,654 169



Not that many names that start with a letter at the end of the alphabet