Monday, February 20, 2017

Importing The Metropolitan Museum of Art's Open Access Initiative data into SQL Server

The Met just made some data available for download.

The Metropolitan Museum of Art presents over 5,000 years of art from around the world for everyone to experience and enjoy. The Museum lives in three iconic sites in New York City—The Met Fifth Avenue, The Met Breuer, and The Met Cloisters. Millions of people also take part in The Met experience online. 

Since it was founded in 1870, The Met has always aspired to be more than a treasury of rare and beautiful objects. Every day, art comes alive in the Museum's galleries and through its exhibitions and events, revealing both new ideas and unexpected connections across time and across cultures. The Metropolitan Museum of Art provides select datasets of information on more than 420,000 artworks in its Collection for unrestricted commercial and noncommercial use.


I decided to take a look. The first thing we will do is download the CSV file from their GitHub repository. You can find that here: https://github.com/metmuseum/openaccess

If you go to that link, you will see the following, grab the MetObjects.csv file

Be warned, this file is 214 MB.

One word of caution... if you try to import this file with a regular BULK INSERT command....good luck...let me know how many tries you need. No problem, I will just use a format file... and now you have 2 problems..  The real issue is that the file is somewhat problematic, there are quotes where there shouldn't be quotes,  there are no quotes where there should be quotes.  So what do you do?

Are you on SQL Server vNext 1.1 or higher? If you are, good news, you can use BULK INSERT and csv format, this is new in vNext 1.1


Ok let's get started, first create the following table

CREATE TABLE MetOpenData(
ObjectNumber nvarchar(4000),
IsHighlight nvarchar(4000),
IsPublicDomain nvarchar(4000),
ObjectID nvarchar(4000),
Department nvarchar(4000),
ObjectName nvarchar(4000),
Title nvarchar(4000),
Culture nvarchar(4000),
Period nvarchar(4000),
Dynasty nvarchar(4000),
Reign nvarchar(4000),
Portfolio nvarchar(4000),
ArtistRole nvarchar(4000),
ArtistPrefix nvarchar(4000),
ArtistDisplayName nvarchar(4000),
ArtistDisplayBio nvarchar(4000),
ArtistSuffix nvarchar(4000),
ArtistAlphaSort nvarchar(4000),
ArtistNationality nvarchar(4000),
ArtistBeginDate nvarchar(4000),
ArtistEndDate nvarchar(4000),
ObjectDate nvarchar(4000),
ObjectBeginDate nvarchar(4000),
ObjectEndDate nvarchar(4000),
Medium nvarchar(4000),
Dimensions nvarchar(4000),
CreditLine nvarchar(4000),
GeographyType nvarchar(4000),
City nvarchar(4000),
State nvarchar(4000),
County nvarchar(4000),
Country nvarchar(4000),
Region nvarchar(4000),
Subregion nvarchar(4000),
Locale nvarchar(4000),
Locus nvarchar(4000),
Excavation nvarchar(4000),
River nvarchar(4000),
Classification nvarchar(4000),
RightsandReproduction nvarchar(4000),
LinkResource nvarchar(4000),
MetadataDate nvarchar(4000),
Repository nvarchar(4000))

GO

Now it is time to import the file
Just to let you know, you will get a couple of errors, however all data except for these 4 rows will be imported

Msg 4864, Level 16, State 1, Line 62
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 213266, column 25 (Medium).
Msg 4864, Level 16, State 1, Line 62
Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 217661, column 25 (Medium).
Msg 4863, Level 16, State 1, Line 62
Bulk load data conversion error (truncation) for row 226222, column 16 (ArtistDisplayBio).
Msg 4863, Level 16, State 1, Line 62
Bulk load data conversion error (truncation) for row 258639, column 16 (ArtistDisplayBio).

Here is what the BULK INSERT with FORMAT= 'CSV' command looks like. Change the filepath to point to the location where you have the MetObjects.csv file saved

BULK INSERT MetOpenData
FROM 'c:\Data\MetObjects.csv'
WITH (FORMAT = 'CSV'); 

Let's do a quick count


SELECT COUNT(*) FROM MetOpenData

The file that I imported resulted in 446026 rows. I downloaded this file on 2/10/2017, your file might have more data if they updated the file after the date I downloaded it


Now that we have the data we need, we can run some queries.
Let's see what kind of objects are in the collection, we will grab the top 15 objects

 SELECT TOP 15  ObjectName,count(*)
 FROM MetOpenData
 GROUP BY ObjectName
 ORDER BY 2 DESC

Here is what the results looks like


ObjectNameCount
Print 88582
Photograph 28071
Drawing 24905
Book 13360
Fragment 9405
Piece 8638
Negative 6258
Painting 5862
Baseball card, print 4985
Bowl 3534
Figure 3081
Baseball card 3046
Polaroid 2706
Vase 2698
Dress 2473

I don't know why..but I somehow thought painting would be the most occuring object..but what do I know

You can also treat this table as you own museum catalog, let's say you want to look at van Gogh's Madame Roulin and Her Baby painting?  No problem, run this query

 SELECT * FROM MetOpenData
 WHERE ArtistDisplayName like'%van%gogh%'
 and title = 'Madame Roulin and Her Baby'

Scroll to the LinkResource column, you will see the following: http://www.metmuseum.org/art/collection/search/459123

Clicking on that link will give you the following


Now you can download this image and do something with it, it is after all in the public domain

Here are a couple of more queries you can play around with



 
 SELECT city, count(*)
 FROM MetOpenData
 GROUP BY city
 ORDER BY 2 DESC

 SELECT Dynasty, count(*)
 FROM MetOpenData
 GROUP BY Dynasty
 ORDER BY 2 DESC

 SELECT Period, count(*)
 FROM MetOpenData
 GROUP BY Period
 ORDER BY 2 DESC

 SELECT ArtistNationality, count(*)
 FROM MetOpenData
 GROUP BY ArtistNationality
 ORDER BY 2 DESC


 SELECT * FROM MetOpenData
 WHERE ArtistDisplayName like'%pablo picasso%'

 SELECT * FROM MetOpenData
 WHERE ArtistDisplayName like'%rembrandt van rijn%'
 
 SELECT * FROM MetOpenData
 WHERE ObjectName like'%Postage stamps%'

I am not a big art person, but if you are and you have some interesting queries that you ran against this data please let me know in the comments

Also if you manage to get this file to import with plain old BCP or BULK INSERT with or without a format file...let me know the magic you used.... :-)


Tuesday, January 17, 2017

Using bigint with FORMATMESSAGE



SQL Server 2016 added the FORMATMESSAGE function.  According to Books On Line, FORMATMESSAGE constructs a message from an existing message in sys.messages or from a provided string. The functionality of FORMATMESSAGE resembles that of the RAISERROR statement. However, RAISERROR prints the message immediately, while FORMATMESSAGE returns the formatted message for further processing.

So let's take a look at this new function, run the following


SELECT FORMATMESSAGE('Signed int %i, %i', 50, -50) 
SELECT FORMATMESSAGE('Unsigned int %u, %u', 50, -50); 

Here is the output if you run that

--------------------------------------------
Signed int 50, -50
Unsigned int 50, 4294967246

Here is what the type specifications that you can use are

Type specification Represents
d or i Signed integer
o Unsigned octal
s String
u Unsigned integer
x or X Unsigned hexadecimal

We used i to denote a signed integer, we also used u to denote a unsigned integer


Let's look at another example, this time we are using a variable. The variable will be an integer and we  are using i as the type specification


DECLARE @Val int = 1
SELECT FORMATMESSAGE('The value you supplied %i is incorrect!', @Val);

Here is the output
---------------------------------------
The value you supplied 1 is incorrect!


That worked without a problem. Now let's use a variable of the bigint data type, we are using the same type specification as before


DECLARE @Val bigint = 1
SELECT FORMATMESSAGE('The value you supplied %i is incorrect!', @Val);


Here is the output
---------------------------------------------------------------------------
Error: 50000, Severity: -1, State: 1. (Params:).
The error is printed in terse mode because there was error during formatting.
Tracing, ETW, notifications etc are skipped.


As you can see that did not work, so what can we do?

One thing we can do is converting the value to a varchar and then use s as the type specification


DECLARE @Val bigint = 1
SELECT FORMATMESSAGE('The value you supplied %s is incorrect!',
   CONVERT(VARCHAR(100),@Val));

You will again get this as output

---------------------------------------
The value you supplied 1 is incorrect!

So converting to varchar worked, but what if we want to use a bigint data type without converting to a varchar?

Another way is to use I64d as the type specification


DECLARE @Val bigint = 1
SELECT FORMATMESSAGE('The value you supplied %I64d is incorrect!', @Val);


You will get this

---------------------------------------
The value you supplied 1 is incorrect!

So there you have it, if you want to use bigint with FORMATMESSAGE use I64d as the type specification, or convert to varchar and use s as the type specification

Monday, January 16, 2017

Creating the Wide World Importers sample database v1.0




Before SQL Server 2005, we had the northwind and pubs sample database, these actually shipped with the product. If you did a SQL Server 2000 install, you would see these two database in addition to the system databases.

With SQL Server 2005 this changed, there were no more sample databases included. Instead new databases were made available for download, for a while now we have the AdventureWorks available. Today I wanted to see if there was a AdventureWorks 2016 database available. I couldn't find it. Instead I found the Wide World Importers sample database v1.0.

These sample databases live now on github


Here is the link  https://github.com/Microsoft/sql-server-samples


As a normal person you will probably navigate to the databases folder https://github.com/Microsoft/sql-server-samples/tree/master/samples/databases  and in there you will probably go to https://github.com/Microsoft/sql-server-samples/tree/master/samples/databases/wide-world-importers/wwi-database-scripts

Then you will see the following scripts

Now if your name is Denis or you like to waste time (or both), you think..cool I will download this repo and run these files.  That is not unreasonable to think......  but

To save you some time, here is the easiest way to put this database on your SQL Server instance

Instead of running those scripts, go to this page https://github.com/Microsoft/sql-server-samples/releases/tag/wide-world-importers-v1.0  scroll down to downloads, download the 121 MB file named WideWorldImporters-Full.bak

While that is downloading, create a folder on you C drive named SQL, in that folder create 2 folders one named Data and one named Log, it should look like in the image below




Now assuming the backup file is on the C drive, here is what the restore command should look like


USE [master]
RESTORE DATABASE [WideWorldImporters] 
FROM  DISK = N'C:\WideWorldImporters-Full.bak' WITH  FILE = 1,  
MOVE N'WWI_Primary' TO N'C:\SQL\DATA\WideWorldImporters.mdf',  
MOVE N'WWI_UserData' TO N'C:\SQL\DATA\WideWorldImporters_UserData.ndf',  
MOVE N'WWI_Log' TO N'C:\SQL\Log\WideWorldImporters.ldf',  
MOVE N'WWI_InMemory_Data_1' TO N'C:\SQL\DATA\WideWorldImporters_InMemory_Data_1',  
NOUNLOAD,  REPLACE,  STATS = 5

GO

On SQL Server vNext 1.1, the output looks like this from the command above

5 percent processed.
10 percent processed.
15 percent processed.
20 percent processed.
25 percent processed.
30 percent processed.
Processed 1464 pages for database 'WideWorldImporters', file 'WWI_Primary' on file 1.
Processed 53096 pages for database 'WideWorldImporters', file 'WWI_UserData' on file 1.
Processed 33 pages for database 'WideWorldImporters', file 'WWI_Log' on file 1.
Processed 3862 pages for database 'WideWorldImporters', file 'WWI_InMemory_Data_1' on file 1.
Converting database 'WideWorldImporters' from version 852 to the current version 860.
Database 'WideWorldImporters' running the upgrade step from version 852 to version 853.
Database 'WideWorldImporters' running the upgrade step from version 853 to version 854.
Database 'WideWorldImporters' running the upgrade step from version 854 to version 855.
Database 'WideWorldImporters' running the upgrade step from version 855 to version 856.
Database 'WideWorldImporters' running the upgrade step from version 856 to version 857.
Database 'WideWorldImporters' running the upgrade step from version 857 to version 858.
Database 'WideWorldImporters' running the upgrade step from version 858 to version 859.
Database 'WideWorldImporters' running the upgrade step from version 859 to version 860.
100 percent processed.
RESTORE DATABASE successfully processed 58455 pages in 6.105 seconds (74.803 MB/sec).

To make sure everything is cool after the restore is done, run the following

USE WideWorldImporters
GO

select [Website].[CalculateCustomerPrice](1,number,'20160101')
from master..spt_values
where type = 'p'


select [Website].[CalculateCustomerPrice](customerid,1,'20160101')
from Sales.customers


And now we can even check that the function is visible in the new sys.dm_exec_function_stats DMV

SELECT TOP 10 d.object_id, d.database_id, 
OBJECT_NAME(object_id, database_id) AS 'function name',   
    d.cached_time, d.last_execution_time, d.total_elapsed_time,  
    d.total_elapsed_time/d.execution_count AS [avg_elapsed_time],  
    d.last_elapsed_time, d.execution_count  
FROM sys.dm_exec_function_stats AS d  
ORDER BY [total_worker_time] DESC;  

And here is the output


As you can see, we can see this function was executed a bunch of time

So there you have it, this is how you create the DB.... I also now understand why Brent Ozar uses the Stackoverflow DB for his posts, at least it is easy to find  .., if you prefer that over this example, head over here: https://www.brentozar.com/archive/2015/10/how-to-download-the-stack-overflow-database-via-bittorrent/