I decided to look around to see if I can find some interesting data sets
I came across the Baby Names from Social Security Card Applications-National Level Data data set and decided it would be a fun exercise to import this data into SQL Server and run some queries
You can download the zip file from the link above, you will see 137 files with a txt extension in the zip file
Here is what the description is for these files
National Data on the relative frequency of given names in the population of U.S. births where the individual has a Social Security Number
(Tabulated based on Social Security records as of March 5, 2017)
For each year of birth YYYY after 1879, we created a comma-delimited file called yobYYYY.txt. Each record in the individual annual files has the format "name,sex,number," where name is 2 to 15 characters, sex is M (male) or F (female) and "number" is the number of occurrences of the name. Each file is sorted first on sex and then on number of occurrences in descending order. When there is a tie on the number of occurrences, names are listed in alphabetical order. This sorting makes it easy to determine a name's rank. The first record for each sex has rank 1, the second record for each sex has rank 2, and so forth.
To safeguard privacy, we restrict our list of names to those with at least 5 occurrences.
The format of the files looks like this
Mary,F,7065
Anna,F,2604
Emma,F,2003
Elizabeth,F,1939
Minnie,F,1746
Margaret,F,1578
So these files are simple comma separated values files
Download the zip file from this url: https://catalog.data.gov/dataset/baby-names-from-social-security-card-applications-national-level-data
Unzip the files in a folder on your hard drive, in my case I unzipped it to C:\Names
Let's get started with the fun part
First I created a new DB
CREATE DATABASE Scratch GO USE Scratch GO
Now that we have our database, let's create this very simple table, it will store all the data from the file, I also added a year column to this table
CREATE TABLE Names (FirstName varchar(500), Gender char(1), YearInFile date) GO
Because the file doesn't have a year column, I created this view so that I can bulk insert into the view instead of having to mess around with SSIS or format files
CREATE VIEW myView AS SELECT FirstName,Gender,NameCount FROM Names
I extracted all the files into a folder with the name Names, I placed this on the C drive
Now that you have the table and view created, try running the following
BULK INSERT myView FROM 'c:\Names\yob1932.txt' WITH (FIELDTERMINATOR =',')
It should insert 9382 rows
If that is working, it is time to import all 137 files
I am presenting you a solution here so that you can run this all in SSMS, of course you could have as easy used PowerShell, Python or any other language to import these files. Nothing prevents you from doing that :-)
Attention/warning!!
We will be using xp_cmdshell
It is not a best practice to have xp_cmdshell enabled. As a matter of fact beginning with SQL Server 2005, the product ships with xp_cmdshell disabled. If you try to run xp_cmdshell you will get the following message if it is not enabled
Server: Msg 15281, Level 16, State 1, Procedure xp_cmdshell, Line 1
SQL Server blocked access to procedure ‘sys.xp_cmdshell’ of component ‘xp_cmdshell’ because this component is turned off as part of the security configuration for this server. A system administrator can enable the use of ‘xp_cmdshell’ by using sp_configure. For more information about enabling ‘xp_cmdshell’, see “Surface Area Configuration” in SQL Server Books Online.
To enable xp_cmdshell execute the following code
EXECUTE SP_CONFIGURE 'show advanced options', 1 RECONFIGURE WITH OVERRIDE GO EXECUTE SP_CONFIGURE 'xp_cmdshell', '1' RECONFIGURE WITH OVERRIDE GO EXECUTE SP_CONFIGURE 'show advanced options', 0 RECONFIGURE WITH OVERRIDE GO
Now it is time to import all the files. First we want to store all the files that we want to import in a table
We are creating a temp table, we then call a DIR command with the B switch. we delete everything that doesn't end in txt. Finally we add an identity column, we will use this to loop over this table
IF OBJECT_ID('tempdb..#tempList') IS NOT NULL DROP TABLE #tempList CREATE TABLE #tempList ([FileName] VARCHAR(500)) --plain vanilla dos dir command with /B switch (bare format) INSERT INTO #tempList EXEC MASTER..XP_CMDSHELL 'dir c:\Names /B' --delete the null values DELETE #tempList WHERE [FileName] IS NULL -- Delete all the files that don't have txt extension DELETE #tempList WHERE [FileName] NOT LIKE '%.txt' --this will be used to loop over the table ALTER TABLE #tempList ADD id int IDENTITY go
Let's see what we have in the table
SELECT * FROM #tempList
You will see something like this
FileName id
yob1880.txt 1
yob1881.txt 2
yob1882.txt 3
yob1883.txt 4
yob1884.txt 5
yob1885.txt 6
....................
Now it is time to import all the files, here is the code that will import all 137 files, I put comments in the code so you know what it does
DECLARE @LoopID int, @MaxID int SELECT @LoopID = min(id),@MaxID = max(ID) FROM #tempList SET NOCOUNT ON -- if you want to run this multiple time, wipe out the data TRUNCATE TABLE Names -- Change this to your directory DECLARE @Directory varchar(50) = 'c:\Names\' DECLARE @FileName varchar(500),@FullFileName varchar(500) DECLARE @SQL NVARCHAR(1000) Declare @year varchar(10)
-- loop over all the file
WHILE @LoopID <= @MaxID BEGIN SELECT @FileNAme = filename FROM #tempList WHERE id = @LoopID -- concatenate the directory and filename SELECT @FullFileName = @Directory + @FileName BEGIN SELECT @SQL = N'BULK INSERT myView FROM ''' + @FullFileName +''' WITH (FIELDTERMINATOR ='','')' -- you can remove this..this basically prints what will be executed PRINT @SQL
-- run the command EXEC SP_EXECUTESQL @SQL -- this grabs the year part from the file name SELECT @year = right(parsename(@filename,2),4) -- This will update the rows we just inserted to have the year value UPDATE Names SET YearInFile = @year WHERE YearInFile is null END --Get the next id, instead of +1 we grab the next value in case of skipped id values SELECT @LoopID = min(id) FROM #tempList WHERE id > @LoopID END
And now that we have the data we can run some queries
Let's see if we have all the data
SELECT COUNT(*) FROM names --1,891,894
That should return 1,891,894 rows
let's see..did the baby boomers parents really get it on after WW II?
SELECT SUM(nameCount) as SumCount,YearInFile FROM names group by YearInFile ORDER BY SumCount DESC
SumCount YearInFile
-----------
4200097 1957
4156576 1959
4154809 1960
4140160 1961
4131802 1958
4121130 1956
4035499 1962
4012933 1955
3993206 2007
3980407 1954
Looks like it... but what happened in 2007, top of the housing bubble was 2006/2007 so people started buying houses and needed to populate those rooms with humans :-)
What are the top 10 most popular names over time?
SELECT TOP 10 FirstName, gender,SUM(Namecount) as Alltime FROM names GROUP BY FirstName,gender ORDER BY 3 DESC
Here are the results, looks like males don't have as many unique names
FirstName | gender | Alltime | |
James | M | 5136026 | |
John | M | 5105834 | |
Robert | M | 4809176 | |
Michael | M | 4338144 | |
Mary | F | 4120692 | |
William | M | 4087525 | |
David | M | 3601032 | |
Joseph | M | 2592856 | |
Richard | M | 2560771 | |
Charles | M | 2378806 |
What were the most popular names in 1880?
SELECT TOP 10 FirstName, gender,SUM(Namecount) as [1880] FROM names where YearInFile = 1880 GROUP BY FirstName,gender ORDER BY 3 DESC
Here are those results
FirstName | gender | 1880 | |
John | M | 9655 | |
William | M | 9532 | |
Mary | F | 7065 | |
James | M | 5927 | |
Charles | M | 5348 | |
George | M | 5126 | |
Frank | M | 3242 | |
Joseph | M | 2632 | |
Anna | F | 2604 | |
Thomas | M | 2534 |
For 2016
SELECT TOP 10 FirstName, gender,SUM(Namecount) as [2016] FROM names where YearInFile = 2016 GROUP BY FirstName,gender ORDER BY 3 DESC
Now we have some names which I recognize since my kids friends are named like hte ones on this list
FirstName | gender | 2016 | |
Emma | F | 19414 | |
Olivia | F | 19246 | |
Noah | M | 19015 | |
Liam | M | 18138 | |
Ava | F | 16237 | |
Sophia | F | 16070 | |
William | M | 15668 | |
Mason | M | 15192 | |
James | M | 14776 | |
Isabella | F | 14722 |
I guess nobody likes February
SELECT FirstName , count(*) FROM names WHERE FirstName in('January','February','March','April','May','June', 'July','August','September','October','November','December') GROUP BY FirstName
FirstName | Count | |
April | 135 | |
August | 206 | |
December | 56 | |
January | 59 | |
July | 73 | |
June | 273 | |
March | 41 | |
May | 213 | |
November | 37 | |
October | 34 | |
September | 62 |
I wonder how many of these names are the result of someone putting the month in the first name field :-)
How many names are both male and female?
;WITH cte as( SELECT FirstName FROM names GROUP BY FirstName HAVING (COUNT( DISTINCT Gender )>1)) SELECT FirstName, gender, count(*) FROM Names n WHERE EXISTS (SELECT * FROM cte c where c.FirstName = n.FirstName) GROUP BY FirstName, Gender ORDER BY 1
As you can see, there are a lot
Here are some where number of males and females doesn't differ that much
FirstName | gender | Count | |
Courtney | F | 123 | |
Courtney | M | 126 | |
Cruz | F | 121 | |
Cruz | M | 114 | |
Tom | F | 62 | |
Tom | M | 137 |
One more query and that is it for this post..
How many people are named after weekdays?
SELECT FirstName , count(*) FROM names WHERE FirstName in('Monday','Tuesday','Wednesday','Thursday','Friday','Saturday', 'Sunday') GROUP BY FirstNameHere are the results
FirstName Count
------------- -----------------
Friday 4
Monday 5
Sunday 89
Tuesday 59
Wednesday 51
Last Monday I saw Tuesday who told me that she was going to see Friday on Sunday....
Yep, crystal clear...
That is it for this post.... feel free to run your own queries and if you find some interesting things, feel free to leave me a comment with the results or the query
One more query for you, I added this at the bottom since there are 52 rows in the result
What is the most popular letter a name starts with?
Here is that query, there are a couple of counts here:, the total count, the occurance and the distinct count of a name starting with the letter across all years
SELECT LEFT(FirstName,1) AS Initial, gender, SUM(Namecount) AS SumCount, COUNT(NameCount) AS TotalCount , COUNT(DISTINCT firstname) AS DistinctCount FROM names GROUP BY left(FirstName,1),gender ORDER BY 3 DESC
Here are the resultsInitial | gender | SumCount | TotalCount | DistinctCount |
J | M | 29,396,710 | 63,543 | 3,995 |
M | F | 19,855,157 | 105,326 | 5,583 |
A | F | 16,744,520 | 117,270 | 7,501 |
R | M | 16,329,781 | 46,772 | 2,135 |
J | F | 14,887,120 | 79,019 | 4,622 |
D | M | 14,375,824 | 62,369 | 3,380 |
S | F | 13,587,753 | 90,382 | 6,059 |
M | M | 12,717,485 | 51,078 | 2,643 |
C | M | 12,652,140 | 51,820 | 2,255 |
C | F | 12,640,899 | 80,401 | 4,159 |
L | F | 12,099,095 | 86,872 | 4,473 |
A | M | 11,611,302 | 66,139 | 3,594 |
K | F | 11,065,934 | 74,797 | 4,967 |
D | F | 9,723,142 | 66,592 | 3,916 |
E | F | 9,644,797 | 49,929 | 2,477 |
B | F | 8,485,706 | 39,199 | 2,251 |
T | M | 8,331,252 | 42,525 | 2,592 |
B | M | 8,038,739 | 37,772 | 1,701 |
S | M | 7,586,177 | 42,524 | 2,524 |
R | F | 7,216,686 | 51,892 | 2,837 |
E | M | 7,122,968 | 36,781 | 1,542 |
W | M | 7,073,357 | 17,052 | 577 |
L | M | 6,594,743 | 42,492 | 1,799 |
G | M | 6,587,004 | 26,930 | 1,095 |
T | F | 5,913,655 | 61,915 | 4,277 |
K | M | 5,744,651 | 44,110 | 3,096 |
P | F | 5,024,379 | 19,759 | 1,090 |
N | F | 4,315,987 | 36,667 | 2,507 |
H | F | 4,203,255 | 19,717 | 1,159 |
P | M | 4,181,853 | 15,150 | 666 |
G | F | 4,083,933 | 28,566 | 1,398 |
H | M | 3,962,710 | 26,661 | 1,128 |
V | F | 3,546,832 | 23,367 | 1,083 |
N | M | 3,354,035 | 20,738 | 1,180 |
F | M | 3,048,688 | 15,174 | 592 |
F | F | 2,084,083 | 17,670 | 842 |
I | F | 1,966,942 | 17,719 | 1,066 |
I | M | 1,350,255 | 11,261 | 573 |
W | F | 1,235,983 | 9,462 | 569 |
V | M | 1,183,415 | 11,609 | 559 |
O | M | 1,112,184 | 15,521 | 724 |
O | F | 951,873 | 12,914 | 718 |
Z | M | 926,015 | 9,284 | 833 |
Y | F | 656,093 | 11,343 | 1,072 |
Z | F | 525,515 | 13,263 | 1,139 |
Q | M | 183,280 | 3,684 | 296 |
X | M | 172,430 | 1,175 | 127 |
Y | M | 163,545 | 7,031 | 579 |
U | M | 93,783 | 2,875 | 152 |
Q | F | 75,612 | 2,663 | 274 |
X | F | 54,170 | 1,466 | 150 |
U | F | 50,450 | 1,654 | 169 |
Not that many names that start with a letter at the end of the alphabet
No comments:
Post a Comment