Median is a good measure of central tendency in many cases,
especially when the distribution is skewed. This mode function is readily
available in excel, but you will be surprised; T-SQL does not have a median
function.
Example query: (Here, say, we are required to get the median age per country)
What is median?
Median is the middle value when sorted in order. If you have
7 values then the 3rd value is the median. If there are even, say 10
elements in the set then it is average of 5th and 6th
value.
Example: Let’s say our ages are as follows (note that they
are sorted by age) median is 5
SL
NO
|
Age
|
1
|
2
|
2
|
3
|
3
|
5
|
4
|
5
|
5
|
6
|
6
|
10
|
7
|
15
|
If we had another row then our median age would have been
(5+6)/2
SL
NO
|
Age
|
1
|
2
|
2
|
3
|
3
|
5
|
4
|
5
|
5
|
6
|
6
|
10
|
7
|
15
|
8
|
15
|
So if you have a hundred
million rows you cannot do a quick
SELECT MEDIAN ([AGE]) FROM TABLE1
Nor can you paste the data on to excel and get the median
because the data you have exceeds the number of rows excel can handle.
Here is the solution I am offering: (although this solution
is not very novel it works well) sort the table by age (or the variable that
you wish to find the mean for) and a row number for every row. Then get the average of [(n+1)/2]+.25 and [(n+1)/2]-.25
th elements rounded off to the nearest whole number. ( where n
is the total count of rows.) . This way if the count of row number is odd, let's
say is 5 then we average 3rd and 3rd elements. If it is
even say 6 then we average 3rd and 4th elements.
With
CTE_AGE_RNK as(
select [SLNo]
,[Country]
,[Age]
,ROW_NUMBER() OVER(PARTITION by Country ORDER by [AgeAtInterest])[rnk]
,cast (COUNT(1) over(PARTITION by I. Country ) AS decimal)[cnt] /*this is needed*/
from Table1 I (NOLOCK)
)
SELECT
Country, avg([Age]) from CTE_AGE_RNK C
where rnk IN ( ROUND(((([cnt]+1)/2)-0.25),0),ROUND(((([cnt]+1)/2)-0.25),0))
GROUP BY Country
No comments:
Post a Comment