This came up the other day, someone wanted to know the percentage of NULL values in a column
Then I said "I bet you I can run that query without using a NULL in the WHERE clause, as a matter of fact, I can run that query without a WHERE clause at all!!"
The person then wanted to know more, so you know what that means.. it becomes a blog post :-)
BTW, the PostgreSQL version of this blog post can be found here: A quick and easy way to count the percentage of nulls without a where clause in PostgreSQL
To get started, first create this table and verify you have 9 rows
CREATE TABLE foo(bar int) INSERT foo values(1),(null),(2),(3),(4), (null),(5),(6),(7) SELECT * FROM foo
Here is what the output should be
bar
1
NULL
2
3
4
NULL
5
6
7
To get the NULL values and NON NULL values, you can do something like this
SELECT COUNT(*) as CountAll FROM foo WHERE bar IS NOT NULL SELECT COUNT(*) as CountAll FROM foo WHERE bar IS NULL
However, there is another way
Did you know that COUNT behaves differently if you use a column name compared to when you use *
Take a look
SELECT COUNT(*) as CountAll, COUNT(bar) as CountColumn FROM foo
If you ran that query, the result is the following
CountAll CountColumn
----------- -----------
9 7
Warning: Null value is eliminated by an aggregate or other SET operation.
And did you notice the warning? That came from the count against the column
Let's see what Books On Line has to say
COUNT(*) returns the number of items in a group. This includes NULL values and duplicates.
COUNT(ALL expression) evaluates expression for each row in a group, and returns the number of nonnull values.
COUNT(DISTINCT expression) evaluates expression for each row in a group, and returns the number of unique, nonnull values.
This is indeed documented behavior
So now, lets change our query to return the percentage of non null values in the column
SELECT COUNT(*) as CountAll, COUNT(bar) as CountColumn, (COUNT(bar)*1.0/COUNT(*))*100 as PercentageOfNonNullValues FROM foo
Here is the output
CountAll CountColumn percentageOfNonNullValues
----------- ----------- ---------------------------------------
9 7 77.777777777700
I just want to point out one thing, the reason I have this * 1.0 in the query
(COUNT(bar)*1.0/COUNT(*))*100
I am doing * 1.0 here because count returns an integer, so you will end up with integer math and the PercentageOfNonNullValues would be 0 instead of 77.7777...
That's it for this short post.. hopefully you knew this, if not, then you know it now :-)
No comments:
Post a Comment