Sunday 23 June 2013

Finding the Median in T-SQL

Median is a good measure of central tendency in many cases, especially when the distribution is skewed. This mode function is readily available in excel, but you will be surprised; T-SQL does not have a median function.
What is median?
Median is the middle value when sorted in order. If you have 7 values then the 3rd value is the median. If there are even, say 10 elements in the set then it is average of 5th and 6th value.
Example: Let’s say our ages are as follows (note that they are sorted by age) median is 5
SL NO
Age
1
2
2
3
3
5
4
5
5
6
6
10
7
15

If we had another row then our median age would have been (5+6)/2
SL NO
Age
1
2
2
3
3
5
4
5
5
6
6
10
7
15
8
15

So if you have a hundred million rows you cannot do a quick

SELECT MEDIAN ([AGE])  FROM TABLE1
Nor can you paste the data on to excel and get the median because the data you have exceeds the number of rows excel can handle.

Here is the solution I am offering: (although this solution is not very novel it works well) sort the table by age (or the variable that you wish to find the mean for) and a row number for every row.  Then get the average of [(n+1)/2]+.25 and [(n+1)/2]-.25 th elements rounded off to the nearest whole number. ( where n is the total count of rows.) . This way if the count of row number is odd, let's say is 5 then we average 3rd and 3rd elements. If it is even say 6 then we average 3rd  and 4th elements.

 Example query: (Here, say, we are required to get the median age per country)
With CTE_AGE_RNK as(
select [SLNo]
,[Country]
      ,[Age]
,ROW_NUMBER() OVER(PARTITION by Country ORDER by  [AgeAtInterest])[rnk]
      ,cast (COUNT(1) over(PARTITION by I. Country ) AS decimal)[cnt] /*this is needed*/
from Table1 I (NOLOCK)
)
SELECT 
Country, avg([Age]) from CTE_AGE_RNK C
where  rnk IN (  ROUND(((([cnt]+1)/2)-0.25),0),ROUND(((([cnt]+1)/2)-0.25),0))
GROUP BY Country


Was this useful Information?