It has been a while since I wrote some of my best practices posts. I decided to revisit these posts again to see if anything has changed, I also wanted to see if I could add some additional info.
Today we are going to take a look at how data types can have an impact in queries and also the size of your database.
Char vs NChar
SQL Server has two data types to store character data[1], both of them come in fixed and variable length sizes. The char and varchar data type uses one byte of store to store one character, the nchar and nvarchar data type uses two bytes of store to store one character. The nchar and nvarchar data types are used to store unicode of data
Let's think about that for a second, what we are saying is that the char and varchar data type can store twice the number of characters in the same amount of store as the nchar and nvarchar data type. Why does this matter, space is cheap right? True, space is getting cheaper but we are also storing more and more data every year.
Now think about what happens you have everything stored as unicode data
- What happens to your backup and restore process, will it be faster or slower, will the files be bigger if not compressed?
- What about when transferring the results to and from your database server, are the packets able to store the same number of characters.
- What about the amount of data on a page, what does this do to indexes and index lookups, how does it affect index maintenance?
If you don't need it, then don't use unicode data.
Some examples of what I have seen stored in nchar and nvarchar when realy you shouldn't:
Some examples of what I have seen stored in nchar and nvarchar when realy you shouldn't:
Zip Code for US addresses
US addresses
Social Security Numbers (which were stored in plain text none the less)
Integer data (enforced by constraints or the app layer to make sure these were only digits)
US addresses
Social Security Numbers (which were stored in plain text none the less)
Integer data (enforced by constraints or the app layer to make sure these were only digits)
Let's take a quick look by running some T-SQL
First create these two tables
CREATE TABLE TestChar (SomeCol char(10))
GO
CREATE TABLE TestNChar (SomeCol nchar(10))
GO
CREATE index ix_test on TestChar(SomeCol)
GO
CREATE index ix_test on TestNChar(SomeCol)
GO
Now populate each with some data
INSERT TestChar
SELECT TOP 1000000 '1234567890'
FROM sys.sysobjects c1
CROSS JOIN sys.sysobjects c2
CROSS JOIN sys.sysobjects c3
CROSS JOIN sys.sysobjects c4
GO
INSERT TestNChar
SELECT TOP 1000000 '1234567890'
FROM sys.sysobjects c1
CROSS JOIN sys.sysobjects c2
CROSS JOIN sys.sysobjects c3
CROSS JOIN sys.sysobjects c4
GO
Let's see how much space is used by each tables
EXEC sp_spaceused 'TestChar'
EXEC sp_spaceused 'TestNChar'
42768 KB
62736 KB
62736 KB
If you looked at the reserved column, you will see that the nchar data is using about 20 MB more than the char data
Implicit conversions
Besides the storage increase there is also a problem when querying for data that looks like varchar but is stored as unicode. Run the code below.
SET SHOWPLAN_TEXT ON
GO
DECLARE @v varchar(10) = '0123456789'
SELECT * FROM TestChar WHERE SomeCol LIKE @v +'%'
GO
SET SHOWPLAN_TEXT OFF
GO
Here is the plan for that query
|--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1008], [Expr1009], [Expr1010]))
|--Compute Scalar(DEFINE:([Expr1008]=LikeRangeStart([@v]+'%'), [Expr1009]=LikeRangeEnd([@v]+'%'), [Expr1010]=LikeRangeInfo([@v]+'%')))
| |--Constant Scan
|--Index Seek(OBJECT:([Performance].[dbo].[TestChar].[ix_test]), SEEK:([Performance].[dbo].[TestChar].[SomeCol] > [Expr1008] AND [Performance].[dbo].[TestChar].[SomeCol] < [Expr1009]), WHERE:([Performance].[dbo].[TestChar].[SomeCol] like [@v]+'%') ORDERED FORWARD)
If we look at the plan we can see that this looks pretty good
Usually people will sometimes change the datatype of a column but will not change any code that access this column. Let's now change the data type of the column to nchar
DROP INDEX TestChar.ix_test
GO
ALTER TABLE TestChar ALTER COLUMN SomeCol nchar(10)
GO
CREATE INDEX ix_test on TestChar(SomeCol)
GO
Run the query that gives you the plan again
SET SHOWPLAN_TEXT ON
GO
DECLARE @v varchar(10) = '0123456789'
SELECT * FROM TestChar WHERE SomeCol LIKE @v +'%'
GO
SET SHOWPLAN_TEXT OFF
GO
Here is the plan
|--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1008], [Expr1009], [Expr1010]))
|--Compute Scalar(DEFINE:([Expr1008]=LikeRangeStart(CONVERT_IMPLICIT(nvarchar(11),[@v]+'%',0)), [Expr1009]=LikeRangeEnd(CONVERT_IMPLICIT(nvarchar(11),[@v]+'%',0)), [Expr1010]=LikeRangeInfo(CONVERT_IMPLICIT(nvarchar(11),[@v]+'%',0))))
| |--Constant Scan
|--Index Seek(OBJECT:([Performance].[dbo].[TestChar].[ix_test]), SEEK:([Performance].[dbo].[TestChar].[SomeCol] > [Expr1008] AND [Performance].[dbo].[TestChar].[SomeCol] < [Expr1009]), WHERE:([Performance].[dbo].[TestChar].[SomeCol] like CONVERT_IMPLICIT(nvarchar(11),[@v]+'%',0)) ORDERED FORWARD)
As you can see, there is a conversion going on right now.
In order to get rid of the conversion, use the correct data types
SET SHOWPLAN_TEXT ON
GO
DECLARE @v nvarchar(10) = '0123456789'
SELECT * FROM TestChar WHERE SomeCol LIKE @v +'%'
GO
SET SHOWPLAN_TEXT OFF
GO
|--Compute Scalar(DEFINE:([Expr1008]=LikeRangeStart([@v]+N'%'), [Expr1009]=LikeRangeEnd([@v]+N'%'), [Expr1010]=LikeRangeInfo([@v]+N'%')))
| |--Constant Scan
|--Index Seek(OBJECT:([Performance].[dbo].[TestChar].[ix_test]), SEEK:([Performance].[dbo].[TestChar].[SomeCol] > [Expr1008] AND [Performance].[dbo].[TestChar].[SomeCol] < [Expr1009]), WHERE:([Performance].[dbo].[TestChar].[SomeCol] like [@v]+N'%') ORDERED FORWARD)
Implicit conversions also were an issue when ORMs first burst onto the scene. If you used NHibernate or LINQ to SQL with .NET, since strings in .NET are unicode, all text would be sent over as unicode and you would see all kinds on conversions.
Using larger datatypes when it is not needed
I see this problem mostly with the integer data types. Below is a list of the integer data types together with their storage size and range
tinyint
Storage size is 1 byte. Integer data from 0 through 255.
Storage size is 1 byte. Integer data from 0 through 255.
smallint
Storage size is 2 bytes. Integer data from -2^15 (-32,768) through 2^15 - 1 (32,767).
Storage size is 2 bytes. Integer data from -2^15 (-32,768) through 2^15 - 1 (32,767).
int
Storage size is 4 bytes. Integer data from -2^31 (-2,147,483,648) through 2^31 - 1 (2,147,483,647).
Storage size is 4 bytes. Integer data from -2^31 (-2,147,483,648) through 2^31 - 1 (2,147,483,647).
bigint
Storage size is 8 bytes. Integer data from -2^63 (-9,223,372,036,854,775,808) through 2^63-1 (9,223,372,036,854,775,807).
Storage size is 8 bytes. Integer data from -2^63 (-9,223,372,036,854,775,808) through 2^63-1 (9,223,372,036,854,775,807).
Now imagine facebook with a billion users decided to use bigint as CountryID in their Country table, this key is then uses as a foreign key in the user demographics table. This is wasteful,either use a smallint since we won't go through 32 thousand countries in the foreseeable feature or use the 2 or 3 character ISO code.
The problem is even worse if you have a compound 6 column key and it is used as a foreign key in tons of other tables...that was real fun to clean up....use a surrogate 1 column key in that case...but be sure to test....normalize till it hurts then denormalize till it works....I will cover normalization in another post...just wanted to mention it
[1] I know there is text and ntext but those are deprecated
The performance difference due to conversion is ignorable I guess as it just happened to the variable, not to the indexed column. So I do not agree with your concern about performance although I do on the space side.
ReplyDelete