Showing posts with label Gotcha. Show all posts
Showing posts with label Gotcha. Show all posts

Thursday, February 14, 2019

Finding rows where the column starts or ends with a 'bad' character

A coworker came to me asking me for some help. He had some issues trying to convert some data from a staging table to numeric. I asked him to show me the data in SSMS and at first glance it looked good to me. Then I asked where the data came from, he said it came from Excel.

Aha... I have plenty of war stories with Excel so I said, it's probably some non printable character that is in the column.. either a tab (char(9)) or a non breaking space (char(160))..especially if the value was copied from the internet

He said isnumeric was returning 0 for rows that looked valid, I then told him to run this query on those rows


SELECT ASCII(LEFT(SomeColumn,1)),
 ASCII(RIGHT(SomeColumn,1)),* 
FROM StagingData s

That would give them the ascii numerical value. For example a tab is 9, linefeed =  10....

Here is a chart for the characters between 0 and 32
BinaryOctDecHexAbbreviation[b][c][d]Name (1967)
196319651967
000 0000000000NULLNUL^@\0Null
000 0001001101SOMSOH^AStart of Heading
000 0010002202EOASTX^BStart of Text
000 0011003303EOMETX^CEnd of Text
000 0100004404EOT^DEnd of Transmission
000 0101005505WRUENQ^EEnquiry
000 0110006606RUACK^FAcknowledgement
000 0111007707BELLBEL^G\aBell
000 1000010808FE0BS^H\bBackspace[e][f]
000 1001011909HT/SKHT^I\tHorizontal Tab[g]
000 1010012100ALF^J\nLine Feed
000 1011013110BVTABVT^K\vVertical Tab
000 1100014120CFF^L\fForm Feed
000 1101015130DCR^M\rCarriage Return[h]
000 1110016140ESO^NShift Out
000 1111017150FSI^OShift In
001 00000201610DC0DLE^PData Link Escape
001 00010211711DC1^QDevice Control 1 (often XON)
001 00100221812DC2^RDevice Control 2
001 00110231913DC3^SDevice Control 3 (often XOFF)
001 01000242014DC4^TDevice Control 4
001 01010252115ERRNAK^UNegative Acknowledgement
001 01100262216SYNCSYN^VSynchronous Idle
001 01110272317LEMETB^WEnd of Transmission Block
001 10000302418S0CAN^XCancel
001 10010312519S1EM^YEnd of Medium
001 1010032261AS2SSSUB^ZSubstitute
001 1011033271BS3ESC^[\e[i]Escape[j]
001 1100034281CS4FS^\File Separator
001 1101035291DS5GS^]Group Separator
001 1110036301ES6RS^^[k]Record Separator
001 1111037311FS7US^_Unit Separator

Source: https://en.wikipedia.org/wiki/ASCII



He then ran the following to grab all the rows that ended or started with tabs

SELECT * FROM StagingData s
WHERE LEFT(SomeColumn,1)  = char(9)
OR  RIGHT(SomeColumn,1)  = char(9)






So let's take another look at this to see how we can make this a little better

Let's create a table that will hold these bad characters that we don't want, in my case ACII values 1 untill 32

Here is what we will do to create and populate the table

CREATE TABLE BadCharacters(
  BadChar char(1) NOT NULL, 
  ASCIINumber int NOT NULL,
   CONSTRAINT pk_BadCharacters 
  PRIMARY KEY CLUSTERED( BadChar )
 )

GO

INSERT BadCharacters
SELECT char(number),number
FROM master..SPT_VALUES
WHERE type = 'P'
AND number BETWEEN 1 AND 32
OR number = 160


A quick look at the data looks like this

SELECT * FROM BadCharacters



 Now let's create our staging table and insert some data so that we can do some tests

CREATE TABLE StagingData (SomeColumn varchar(255) )

INSERT StagingData
SELECT CONVERT(VARCHAR(10),s1.number) + '.' + CONVERT(VARCHAR(10),s2.number)
FROM master..SPT_VALUES s1
CROSS JOIN master..SPT_VALUES s2
WHERE s1.type = 'P'
AND s2.type = 'P'


That inserted 4194304 rows on my machine

Time to insert some of that bad data

Here is what some of the data inserted will look like

2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8

And this is the query to generated and insert those bad rows, on my machine it generated 1089 such rows


;WITH cte as(SELECT CONVERT(VARCHAR(10),number) as num
FROM master..SPT_VALUES
WHERE type = 'P'
AND number BETWEEN 1 AND 1000)

--INSERT StagingData
SELECT  b.BadChar + c1.num + '.' + c2.num + b2.BadChar
FROM cte c1
CROSS JOIN cte c2
JOIN BadCharacters b on c1.num = b.ASCIINumber
JOIN BadCharacters b2 on c2.num = b2.ASCIINumber

The query create a value by using a bad value, a number a dot a number and a bad value, you can see those values above


Now it's time to find these bad rows, but before we do that, let's add an index

CREATE INDEX ix_StagingData on StagingData(SomeColumn)


OK, we are ready...

Of course I here you saying, why don't we just do this



SELECT * FROM StagingData
WHERE TRY_CONVERT(numeric(20,10),SomeColumn) IS NULL


Well, yes that gives me everything that can't be converted to numeric, but I want to see what those characters are


Before we start, let's set statistics io on so that we can look at some performance


SET STATISTICS IO ON
GO


Here are the queries to find the bad characters at the start

SELECT * FROM StagingData s
JOIN BadCharacters b on b.BadChar = LEFT(s.SomeColumn,1)



SELECT * FROM StagingData s
JOIN BadCharacters b on s.SomeColumn like b.BadChar +'%'


Here is what the reads look like

(1089 row(s) affected)
Table 'BadCharacters'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'StagingData'. Scan count 9, logical reads 10851, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1089 row(s) affected)
Table 'BadCharacters'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'StagingData'. Scan count 33, logical reads 135, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

As you can see from the stats, the top query is non-SARGable and generates a lot more reads, the bottom query can use the index. Always make sure to write your queries in a way so that SQL Server can you an index



What about the last character, how can we find those

SELECT * FROM StagingData s
JOIN BadCharacters b on b.BadChar = RIGHT(s.SomeColumn,1)


SELECT * FROM StagingData s
JOIN BadCharacters b on s.SomeColumn like +'%' + b.BadChar 


Here are the stats again

(1089 row(s) affected)
Table 'BadCharacters'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'StagingData'. Scan count 9, logical reads 10851, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1089 row(s) affected)
Table 'BadCharacters'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'StagingData'. Scan count 33, logical reads 445863, physical reads 0, read-ahead reads 13, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


So both of these queries suck the life out of your SQL Server instance, so what can be done?

One thing we can do is add a computed column to the table that will hold just the last character of the column, then we can index the computed column

Here are the commands to do that


ALTER TABLE StagingData ADD RightChar as RIGHT(SomeColumn,1)
GO


CREATE INDEX ix_RightChar on StagingData(RightChar)
GO

And now we can just run the same queries again

SELECT * FROM StagingData s
JOIN BadCharacters b on b.BadChar = RIGHT(s.SomeColumn,1)


SELECT * FROM StagingData s
JOIN BadCharacters b on s.SomeColumn like +'%' + b.BadChar 


Here are the stats

(1089 row(s) affected)
Table 'StagingData'. Scan count 33, logical reads 1223, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'BadCharacters'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(1089 row(s) affected)
Table 'StagingData'. Scan count 33, logical reads 1223, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'BadCharacters'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


Did you expect to get the same exact reads for both queries?

So what is going on?  Well lets take a look


In both cases, the optimizer was smart enough to use the index on the computed column

Hopefully this will make someone's life easier and you can expand the table to add other character you consider bad.  You can also add constraint to reject values or you can add triggers and then move those bad rows to a bad rows table

Finally if you need to worry about unicode you might want to change the table to be nvarchar

Enjoy.. importing that data..... we all know..it's only getting bigger and bigger






Monday, February 04, 2019

After 20+ years in IT .. I finally discovered this...

Originally I was not going to write this post, but after I found out that several other people didn't know this I figured what the heck, why not, maybe this will help someone else as well

Last week I was on a remote session with 2 clients, each run the Advent program . The team I am part of provides a script to run the advent (APX or Axys) executable. This will then generate the portfolios, composites, price, security master, splits and other files. We then zip it up and sftp it over for ingestion so that we can run analytics and attribution

During these calls I interact with system administrators because usually the need to give permissions so that the script runs correctly

None of these admins knew that what I will show you existed. All the co-workers I asked didn't know this either (This could be because they are developers and not admins)



Back in the day (win 98 or perhaps NT 4), there was a windows powertool that you could install and if you right clicked on a folder you would get an option to open a command window and it would be in the path that you right clicked on

Those power tools don't exist anymore and you could do the same by hacking the registry, it's like a 16 step process

But there is a faster way.....

So what I usually did before 2 months ago is that I would select the path


And then I would open a command prompt, type CD and then paste the path...not too complicated

But here is the faster way.... instead of copying the path...just type in cmd in the address bar and hit enter


Boom shakalaka... a command prompt is opened immediately and you are in the same path



Did you know this also works when you type Powershell in the address bar, Eric Darling left me a comment on twitter informing me that it works with powershell as well

Here is what you see after typing it




So there you have it... hopefully it will save you some minutes of valuable time in a year

Also if you knew about this or did not know..leave a comment and let me know





Tuesday, January 08, 2019

Print.. the disruptor of batch deletes in SQL



Someone had an issue where a batched delete script was not deleting anything. I looked over some code in our repository and noticed two patterns the way queries are written to handle batch deletes

One is a while loop that runs while @@rowcount is greater than 0

WHILE @@rowcount > 0
 BEGIN
  DELETE TOP (5000)
  FROM SomeTable
 END


The other way is to run a while loop which is always true and then check if @@rowcount is 0, if it is 0 then break out of the loop

 WHILE 1 = 1  
    BEGIN  
        DELETE TOP(5000)  
        FROM SomeTable

        IF @@ROWCOUNT = 0  
     BREAK  
     END


I have always used WHILE @@rowcount > 0 but you have to be careful because @@rowcount could be 0 when your while loop starts

Let's take a look at an example. This is a simplified example without a where clause..but let's say you have to delete several million rows from a table with many more millions of rows and the table is replicated... in that case you want to batch the deletes so that your log file doesn't fill up, replication has a chance to catch up and in general the deletes should run faster

SELECT TOP 20000 row_number() OVER(ORDER BY t1.id) AS SomeId,
  getutcdate() AS SomeDate, newid() AS SomeValue
INTO SomeTable
FROM sys.sysobjects t1
CROSS JOIN sys.sysobjects t2

SELECT COUNT(*) from SomeTable 

SELECT * FROM SomeTable WHERE 1= 0

WHILE @@rowcount > 0
 BEGIN
  DELETE TOP (5000)
  FROM SomeTable
 END

SELECT COUNT(*) from SomeTable 

DROP TABLE SomeTable -- Added here as cleanup in case people run the example

This is of course a silly example because why would you do a count like that against a different table before a delete

But what if you had this instead, you put a nice print statement there so that from the output you see when it started and you would also see the rowcounts?

SELECT TOP 20000 row_number() OVER(ORDER BY t1.id) AS SomeId,
  getutcdate() AS SomeDate, newid() AS SomeValue
INTO SomeTable
FROM sys.sysobjects t1
CROSS JOIN sys.sysobjects t2

SELECT COUNT(*) from SomeTable 

PRINT' Starting my update now....'

WHILE @@rowcount > 0
 BEGIN
  DELETE TOP (5000)
  FROM SomeTable
 END

SELECT COUNT(*) from SomeTable 

DROP TABLE SomeTable -- Added here as cleanup in case people run the example



The count is 20000 before and after the loop, nothing got delete, this is because a print statement will reset @@rowcount  to 0

Take a look by running this simple set of queries

SELECT 1 UNION ALL SELECT 2
SELECT @@rowcount as 'Rowcount'
PRINT '1'
SELECT @@rowcount as 'RowcountAfterPrint'

Here is what the output looks like


After the  PRINT line @@rowcount is reset back to 0

So if you want to use a while loop while checking @@rowcount, do this instead by running the delete first once outside the loop



SELECT TOP 20000 row_number() OVER(ORDER BY t1.id) AS SomeId,
  getutcdate() AS SomeDate, newid() AS SomeValue
INTO SomeTable
FROM sys.sysobjects t1
CROSS JOIN sys.sysobjects t2

SELECT COUNT(*) from SomeTable 

PRINT' Starting my update now....'

DELETE TOP (5000)
FROM SomeTable

WHILE @@rowcount > 0
 BEGIN
  DELETE TOP (5000)
  FROM SomeTable
 END

SELECT COUNT(*) from SomeTable 

DROP TABLE SomeTable -- Added here as cleanup in case people run the example

If you run the delete this way if there was something to delete, the while loop would be entered, if the table was empty then there would no need to enter the while loop

Also keep in mind that it is not just PRINT that will reset @@rowcount back to 0.

From Books On Line:

Statements such as USE, SET <option>, DEALLOCATE CURSOR, CLOSE CURSOR, BEGIN TRANSACTION, or COMMIT TRANSACTION reset the ROWCOUNT value to 0.

That's all... hopefully this helps someone out in the future if they notice nothing gets deleted

Wednesday, September 26, 2018

A little less hate for: String or binary data would be truncated in table


The error String or binary data would be truncated is one of the more annoying things in SQL Server. You would insert some data and you would get back the error String or binary data would be truncated. Then you would have to spend a good amount of time to see what caused the error.

I even posted about this as part of  T-SQL Tuesday #86: String or binary data would be truncated

I read the SQL Server 2019 CTP 2 whitepaper and on page 17 it has the following

Improve truncation message for ETL DW scenarios—the error message ID 8152 String or binary data would be truncated is familiar to many SQL Server developers and administrators who develop or maintain data movement workloads; the error is raised during data transfers between a source and a destination with different schemas when the source data is too large to fit into the destination data type. This error message can be time-consuming to troubleshoot because of its lack of specificity. SQL Server 2019 introduces a new, more specific error message for this scenario: ID 2628 String or binary data would be truncated in table '%.*ls', column '%.*ls'. Truncated value: '%.*ls'. The new error message provides more context for the problem, simplifying the troubleshooting process. So that it cannot break existing applications parsing message ID 8152, this new message ID 2628 is an opt-in replacement, which can be enabled with trace flag 460. 


Oh really... they fixed this? Let's take a look

First I downloaded SQL Server 2019 CTP 2 and installed it. Then I created a database with a simple table, I also inserted some data that wouldn't fit

CREATE DATABASE truncatetest
GO

USE truncatetest
GO

CREATE TABLE TruncateMe(somevalue varchar(5),somevalue2 varchar(5), somedecimal decimal(5,2))
GO

INSERT TruncateMe
SELECT '333333','444444',5.3

I then received the following error message, so this is the same as in SQL Server 2018 and earlier, notice message id 8152

Msg 8152, Level 16, State 30, Line 10
String or binary data would be truncated.
The statement has been terminated.

To enable the new functionality, we need to enable trace flag 460, you can do that by running the DBCC TRACEON command like this

DBCC TRACEON(460)

Now let's try that insert statement again

INSERT TruncateMe
SELECT '333333','444444',5.3

And there we go, you get the table name, the column name as well as the value, notice that the message id changed from 8152 to 2628 now

Msg 2628, Level 16, State 1, Line 20
String or binary data would be truncated in table 'truncatetest.dbo.TruncateMe', column 'somevalue'. Truncated value: '33333'.
The statement has been terminated.


So it looks it only returns the first value that generates the error, let's change the first value to fit into the column and execute the insert statement again

INSERT TruncateMe
SELECT '3','444444',5.3

Now you will see that the error is for the somevalue2 column


Msg 2628, Level 16, State 1, Line 27
String or binary data would be truncated in table 'truncatetest.dbo.TruncateMe', column 'somevalue2'. Truncated value: '44444'.
The statement has been terminated.



What will happen if you have more than one row that fails?

insert TruncateMe
select  '333333','444444',5.3 union all
select '3','444444',5.3

Here is the error

Msg 2628, Level 16, State 1, Line 37
String or binary data would be truncated in table 'truncatetest.dbo.TruncateMe', column 'somevalue'. Truncated value: '33333'.
The statement has been terminated.

As you can see the error is only for the first row, not the second one

What about Table Variables, will you also get an error with the column and value like for real tables?

declare  @table table (somevalue varchar(5),somevalue2 varchar(5), somedecimal decimal(5,2))


insert @table
select '333333','444444',5.3


Here is the error

Msg 2628, Level 16, State 1, Line 53
String or binary data would be truncated in table 'tempdb.dbo.#A6AD698B', column 'somevalue'. Truncated value: '33333'.

As you can see you also get the error, the table name is the internal table name for the table variable tied to your session


What about Table Valued Parameters?

CREATE TYPE TestTypeTrunc AS TABLE   
( somevalue varchar(5),somevalue2 varchar(5));  
GO

DECLARE @table TestTypeTrunc
INSERT @table
SELECT '555555555','444444'

Here is the error for that 

Msg 2628, Level 16, State 1, Line 58
String or binary data would be truncated in table 'tempdb.dbo.#AC6642E1', column 'somevalue'. Truncated value: '55555'.


Pretty much the same message as for the table variable


I do applaud Microsoft for finally fixing this.

Here it is also in beautiful technicolor