SlideShare a Scribd company logo
APPLIED STATISTICS (FOR
HUMANITIES)
Muhammad Ghazi
Spring 2023
Lecture 3: Descriptive Statistics
WHAT ARE
DESCRIPTIVE
STATISTICS?
Descriptive statistics are methods to
summarize data
Allows us to tell something about the
data without showing the full dataset
In practice, first thing we do when we
get a dataset
Helps better understand what we’re
dealing with
Analysis when we
understand the data well
Analysis when we don’t
understand the data well
DESCRIPTIVE
STATISTICS WE
WILL STUDY
Counts and percentages
Central tendency
• Mean
• Median
• Mode
Measures of spread
• Variance
• Standard deviation
• Percentiles
DESCRIPTIVE STATISTICS (A ROUGH
FRAMEWORK)
Qualitative
variables
Counts
Percentages
Quantitative
variables
Central
tendency
Mean
Median
Mode
Spread
Variance
Std Dev
Percentiles
COUNTS AND PERCENTAGES
• Most basic way to describe qualitative / categorical variables
• Counts are the number of observations in each category
• Percentages express these as a fraction of total observations
𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑥 =
𝑁𝑜. 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑥
𝑇𝑜𝑡𝑎𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
COUNTS AND PERCENTAGES: EXAMPLE
• Consider the following dataset
• How can we describe gender? Political affiliation?
ID Name Age Gender Income Political affiliation Voted in the last election?
1 Alfred 67 Male $ 28,500 Liberal Yes
2 Peter 27 Male $ 275,000 Conservative Yes
3 George 18 Male $ 31,000 Liberal No
4 Jannet 38 Female $ 39,000 Liberal Yes
5 Meagan 19 Female $ 52,000 Moderate Yes
6 Ivan 35 Male $ 27,000 Conservative No
7 Jenny 78 Female $ 38,000 Conservative Yes
8 Sam 43 Male $ 33,000 Conservative Yes
9 Emily 39 Female $ 37,000 Moderate No
10 Hellen 57 Female $ 43,000 Liberal Yes
COUNTS AND PERCENTAGES: EXAMPLE
• There are 5 males and 5 females in the previous dataset
• The most common way to formally present counts and percentages is to
use tables:
Gender No. of people Percentage
Male 5 50%
Female 5 50%
TOTAL: 10 100%
CROSS TABULATIONS
• Often we will need to summarize information from two categorical
variables
• For example, how many males are politically liberal?
• This type of table is called a cross tabulation (cross tab)
• One variable will be in rows, while the other in columns
• Consider the cross tab of political affiliation and gender
Male Female
Liberal 2 2
Moderate 3 1
Conservative 0 2
Cross tabulating
counts
Cross tabulating
counts
Cross tabulating
percentages
CROSS TABULATIONS: PERCENTAGES
• Counts in cross tabulations are simple
• Percentages are sometimes not obvious
• What percentage we use depends on what our frame of reference is
• For example, are we asking what percentage of males are liberal leaning?
• In this case we will divide the no. of males who are liberal by total no. of
males
• Or what percentage of liberals are males?
• In this case we will divide the no. of males who are liberal by total no. of
liberal people
• In practice, both are correct and which one we use depends on the context
COUNTS AND PERCENTAGES: IN EXCEL
Covered in the excel tutorial
CENTRAL TENDENCY
• Often the most informative to describe numerical variable is to describe
where the ‘center’ is
• The most common way of computing the center is to either use:
• Mean
• Median
• Mode (less common)
MEANS
• A mean is a simple average of all numbers
𝑥 =
𝑖=1
𝑖=𝑁
𝑥𝑖
𝑛
1. Add up all the numbers in the variable
2. Divide by the number of observations in the variable
MEAN: EXAMPLE
• Calculate the mean of 10,27,12,9,18,21,92
MEAN: EXAMPLE
• Calculate the mean of 10,27,12,9,18,21,92
𝑀𝑒𝑎𝑛 𝑥 =
10 + 12 + 9 + 18 + 21 + 27 + 92
7
=
189
7
= 27
MEDIAN
• The number in the middle
1. Check if there are an odd or even number of observations
2. Order the numbers from smallest to largest.
3. If the data set contains an odd number of numbers, the one exactly in
the middle is the median.
4. If the data set contains an even number of numbers, take the two
numbers that appear exactly in the middle and average them to find the
median.
MEDIAN: EXAMPLES
• Calculate the median of 10,27,12,9,18,21,92
• Calculate the median of 21,15,20,14
MEDIAN: EXAMPLES
• Calculate the median of 10,27,12,9,18,21,92
1. There are 7 numbers (odd no.)
2. Order: 9,10,12,18,21,27,92
3. Middle number (median) is 9,10,12,18,21,27,92
• Calculate the median of 21,15,20,14
1. There are 4 numbers (even no.)
2. Order: 14,15,20,21
3. Middle numbers are 14,15,20,21
4. Take their average to get median:
15+20
2
= 17.5
MODE
• The number that occurs the most number of times
• Calculate the modal value of: 3,3,3,3,3,4,5,6,3,2,1
• Normally used for categorical variables
Category Frequency
A 10
B 21
C 5
PICKING BEST
MEASURE OF CENTER
• Calculating mean, median or mode is
easy
• Picking the right measure is the tricky
bit
Mean
Median
Mode
WHEN TO USE MEAN OR MEDIAN (OR
MODE)
Mean
The default method
+ Universal and intuitive
+ Mathematically sound
(we'll see later)
- Susceptible to outliers
Median
Report when data is very
skewed or has noticeable
outliers. How do we
know?
Incomes are usually
reported as median
Mode
Less common
Report when categories
OR one dominant figure
Usually with other
measures
KEEP CONTEXT IN MIND
BE FLEXIBLE!
MEAN, MEDIAN OR MODE
• Let’s go back to our dataset
• What’s the best central tendency measure to report for:
• Income
• Age
• Political affiliation
ID Name Age Gender Income Political affiliation Voted in the last election?
1 Alfred 67 Male $ 28,500 Liberal Yes
2 Peter 27 Male $ 275,000 Conservative Yes
3 George 18 Male $ 31,000 Liberal No
4 Jannet 38 Female $ 39,000 Liberal Yes
5 Meagan 19 Female $ 52,000 Moderate Yes
6 Ivan 35 Male $ 27,000 Conservative No
7 Jenny 78 Female $ 38,000 Conservative Yes
8 Sam 43 Male $ 33,000 Conservative Yes
9 Emily 39 Female $ 37,000 Moderate No
10 Hellen 57 Female $ 43,000 Liberal Yes
MEAN, MEDIAN OR MODE
• Let’s go back to our dataset
• What’s the best central tendency measure to report for:
• Income: Median
• Age: Mean or median
• Political affiliation: Mode
ID Name Age Gender Income Political affiliation Voted in the last election?
1 Alfred 67 Male $ 28,500 Liberal Yes
2 Peter 27 Male $ 275,000 Conservative Yes
3 George 18 Male $ 31,000 Liberal No
4 Jannet 38 Female $ 39,000 Liberal Yes
5 Meagan 19 Female $ 52,000 Moderate Yes
6 Ivan 35 Male $ 27,000 Conservative No
7 Jenny 78 Female $ 38,000 Conservative Yes
8 Sam 43 Male $ 33,000 Conservative Yes
9 Emily 39 Female $ 37,000 Moderate No
10 Hellen 57 Female $ 43,000 Liberal Yes
EXERCISE: BEST MEASURE OF CENTER
What is the best measure of central tendency for the following:
1. Length of Christopher Nolan movies in minutes
2. U.S. household income
3. Platform with most engagement
MEASURES OF SPREAD
• Consider two sets of numbers
Set A : { 1 4 6 7 12 }
Set B: { 5 6 6 6 7 }
• What is the mean of each set?
MEASURED OF SPREAD
Set A : { 1 4 6 7 12 }
Set B: { 5 6 6 6 7 }
• If the mean of both the sets is the same, what’s the difference between
the two?
• Set A is obviously more ‘spread out’ than Set B
• Is there some way we can quantify this?
MEASURED OF SPREAD
• We can try taking the difference of each number in the set from the mean of
the set
• And sum up these differences
• But positive and negative differences will cancel out
Set A Mean
Deviation (Difference
from mean)
1 6 -5
4 6 -2
6 6 0
7 6 1
12 6 6
SUM: 0
MEASURED OF SPREAD
• Instead, we can take the ‘squared differences from mean’
• And sum them up
• Divide by number of observations (minus 1) 66/4 = 16.5
• This is called variance
Set A Mean Difference from mean Squared difference from mean
1 6 -5 25
4 6 -2 4
6 6 0 0
7 6 1 1
12 6 6 36
SUM: 0 66
*in practice we always divide by no. of observations - 1 to account for the fact that this is a sample
(more on this later in the course)
MEASURED OF SPREAD
• Do the same for Set B
• Divide by number of observations 2/4 = 0.5
Set B Mean Difference from mean Squared difference from mean
5 6 -1 1
6 6 0 0
6 6 0 0
6 6 0 0
7 6 1 1
SUM: 0 2
MEASURED OF SPREAD
Set A : { 1 4 6 7 12 }
Set B: { 5 6 6 6 7 }
• The variance for set A is 16.5 but for Set B is only 0.5
• This means Set A is more spread out than Set B
MEASURED OF
SPREAD
Variances don’t mean much
Instead, if we take it’s root √, we get a
standard deviation
Standard deviation is the universal way of
quantifying spread
A high standard deviation means that
observations are spread away from the mean
A low standard deviation indicates
observations are close to the mean
RECAP OF VARIANCE AND STANDARD
DEVIATION
1. Find the mean of the variable (𝑥)
2. Subtract the mean from each value (xi − 𝑥)
3. Square this difference xi − 𝑥 2
4. Sum this squared difference for all values xi − 𝑥
2
5. Divide by the number of observations minus 1* to get the variance:
Variance =
xi−𝑥
2
𝑛−1
6. Take the square root of variance to get the standard deviation
Standard Deviation = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒
*We divide by n-1 instead of n as a standard practice to correct for the fact that the data was
collected as a sample (more on this later)
CAUTION:
ALWAYS
DIVIDE BY
N-1 FOR STD
DEV AND
VARIANCE!
CAUTION:
ALWAYS
DIVIDE BY
N-1 FOR STD
DEV AND
VARIANCE!
CAUTION:
ALWAYS
DIVIDE BY
N-1 FOR STD
DEV AND
VARIANCE!
EXERCISE:
• Calculate the standard deviation of the following
variable:
{10, 12, 23, 45, 120}
• On Paper
• On Excel

More Related Content

Similar to Lecture 3 - Descriptive statistics Spring 2023.pptx (20)

PPTX
Descriptive Statistics.pptx
Shashank Mishra
 
PPTX
PRESENTATION.pptx
MedicalEducation7
 
PPTX
Introduction to Educational Statistics.pptx
mubasherakram1
 
PPT
Statistical Method for engineers and science
usaproductservices
 
PPT
Basic Concepts of Statistics & Its Analysis
chachachola
 
PPT
Measurement of central tendency for PGs.ppt
MuhammadSaqibBaloch
 
PPTX
3. BIOSTATISTICS III measures of central tendency and dispersion by SM - Cop...
aribahimtenan
 
PPTX
Basics of Educational Statistics (Descriptive statistics)
HennaAnsari
 
PPTX
Basic biostatistics dr.eezn
EhealthMoHS
 
PDF
Biostatistic ( descriptive statistics) MOHS
leocanon82
 
PPTX
Complete Biostatistics (Descriptive and Inferential analysis)
DrAbdiwaliMohamedAbd
 
PDF
SUMMARY MEASURES.pdf
GillaMarieLeopardas1
 
PPT
presentation
Pwalmiki
 
PPT
Student’s presentation
Pwalmiki
 
PPTX
Lesson3 lpart one - Measures mean [Autosaved].pptx
hebaelkouly
 
PPT
Class1.ppt Class StructureBasics of Statistics
deepanoel
 
ODP
QT1 - 03 - Measures of Central Tendency
Prithwis Mukerjee
 
ODP
QT1 - 03 - Measures of Central Tendency
Prithwis Mukerjee
 
PPTX
Measures of Central Tendency, Variability and Shapes
ScholarsPoint1
 
PPTX
Introduction to statistics.pptx
MuddaAbdo1
 
Descriptive Statistics.pptx
Shashank Mishra
 
PRESENTATION.pptx
MedicalEducation7
 
Introduction to Educational Statistics.pptx
mubasherakram1
 
Statistical Method for engineers and science
usaproductservices
 
Basic Concepts of Statistics & Its Analysis
chachachola
 
Measurement of central tendency for PGs.ppt
MuhammadSaqibBaloch
 
3. BIOSTATISTICS III measures of central tendency and dispersion by SM - Cop...
aribahimtenan
 
Basics of Educational Statistics (Descriptive statistics)
HennaAnsari
 
Basic biostatistics dr.eezn
EhealthMoHS
 
Biostatistic ( descriptive statistics) MOHS
leocanon82
 
Complete Biostatistics (Descriptive and Inferential analysis)
DrAbdiwaliMohamedAbd
 
SUMMARY MEASURES.pdf
GillaMarieLeopardas1
 
presentation
Pwalmiki
 
Student’s presentation
Pwalmiki
 
Lesson3 lpart one - Measures mean [Autosaved].pptx
hebaelkouly
 
Class1.ppt Class StructureBasics of Statistics
deepanoel
 
QT1 - 03 - Measures of Central Tendency
Prithwis Mukerjee
 
QT1 - 03 - Measures of Central Tendency
Prithwis Mukerjee
 
Measures of Central Tendency, Variability and Shapes
ScholarsPoint1
 
Introduction to statistics.pptx
MuddaAbdo1
 

Recently uploaded (20)

PPTX
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
DOCX
AI/ML Applications in Financial domain projects
Rituparna De
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PPTX
things that used in cleaning of the things
drkaran1421
 
PPTX
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PDF
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
PPTX
apidays Munich 2025 - Federated API Management and Governance, Vince Baker (D...
apidays
 
PPT
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
PPTX
apidays Munich 2025 - Streamline & Secure LLM Traffic with APISIX AI Gateway ...
apidays
 
PDF
apidays Munich 2025 - Let’s build, debug and test a magic MCP server in Postm...
apidays
 
PPTX
DATA-COLLECTION METHODS, TYPES AND SOURCES
biggdaad011
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
apidays Munich 2025 - GraphQL 101: I won't REST, until you GraphQL, Surbhi Si...
apidays
 
PPTX
isaacnewton-250718125311-e7ewqeqweqwa74d99.pptx
MahmoudHalim13
 
PDF
Performance Report Sample (Draft7).pdf
AmgadMaher5
 
PDF
apidays Munich 2025 - Automating Operations Without Reinventing the Wheel, Ma...
apidays
 
PDF
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
PPTX
Introduction to Artificial Intelligence.pptx
StarToon1
 
PPT
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
AI Project Cycle and Ethical Frameworks.pptx
RiddhimaVarshney1
 
AI/ML Applications in Financial domain projects
Rituparna De
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
things that used in cleaning of the things
drkaran1421
 
recruitment Presentation.pptxhdhshhshshhehh
devraj40467
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
T2_01 Apuntes La Materia.pdfxxxxxxxxxxxxxxxxxxxxxxxxxxxxxskksk
mathiasdasilvabarcia
 
apidays Munich 2025 - Federated API Management and Governance, Vince Baker (D...
apidays
 
01 presentation finyyyal معهد معايره.ppt
eltohamym057
 
apidays Munich 2025 - Streamline & Secure LLM Traffic with APISIX AI Gateway ...
apidays
 
apidays Munich 2025 - Let’s build, debug and test a magic MCP server in Postm...
apidays
 
DATA-COLLECTION METHODS, TYPES AND SOURCES
biggdaad011
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Munich 2025 - GraphQL 101: I won't REST, until you GraphQL, Surbhi Si...
apidays
 
isaacnewton-250718125311-e7ewqeqweqwa74d99.pptx
MahmoudHalim13
 
Performance Report Sample (Draft7).pdf
AmgadMaher5
 
apidays Munich 2025 - Automating Operations Without Reinventing the Wheel, Ma...
apidays
 
R Cookbook - Processing and Manipulating Geological spatial data with R.pdf
OtnielSimopiaref2
 
Introduction to Artificial Intelligence.pptx
StarToon1
 
1 DATALINK CONTROL and it's applications
karunanidhilithesh
 
Ad

Lecture 3 - Descriptive statistics Spring 2023.pptx

  • 1. APPLIED STATISTICS (FOR HUMANITIES) Muhammad Ghazi Spring 2023 Lecture 3: Descriptive Statistics
  • 2. WHAT ARE DESCRIPTIVE STATISTICS? Descriptive statistics are methods to summarize data Allows us to tell something about the data without showing the full dataset In practice, first thing we do when we get a dataset Helps better understand what we’re dealing with
  • 3. Analysis when we understand the data well Analysis when we don’t understand the data well
  • 4. DESCRIPTIVE STATISTICS WE WILL STUDY Counts and percentages Central tendency • Mean • Median • Mode Measures of spread • Variance • Standard deviation • Percentiles
  • 5. DESCRIPTIVE STATISTICS (A ROUGH FRAMEWORK) Qualitative variables Counts Percentages Quantitative variables Central tendency Mean Median Mode Spread Variance Std Dev Percentiles
  • 6. COUNTS AND PERCENTAGES • Most basic way to describe qualitative / categorical variables • Counts are the number of observations in each category • Percentages express these as a fraction of total observations 𝑃𝑒𝑟𝑐𝑒𝑛𝑡𝑎𝑔𝑒 𝑜𝑓 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 𝑥 = 𝑁𝑜. 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑡ℎ𝑎𝑡 𝑎𝑟𝑒 𝑥 𝑇𝑜𝑡𝑎𝑙 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦
  • 7. COUNTS AND PERCENTAGES: EXAMPLE • Consider the following dataset • How can we describe gender? Political affiliation? ID Name Age Gender Income Political affiliation Voted in the last election? 1 Alfred 67 Male $ 28,500 Liberal Yes 2 Peter 27 Male $ 275,000 Conservative Yes 3 George 18 Male $ 31,000 Liberal No 4 Jannet 38 Female $ 39,000 Liberal Yes 5 Meagan 19 Female $ 52,000 Moderate Yes 6 Ivan 35 Male $ 27,000 Conservative No 7 Jenny 78 Female $ 38,000 Conservative Yes 8 Sam 43 Male $ 33,000 Conservative Yes 9 Emily 39 Female $ 37,000 Moderate No 10 Hellen 57 Female $ 43,000 Liberal Yes
  • 8. COUNTS AND PERCENTAGES: EXAMPLE • There are 5 males and 5 females in the previous dataset • The most common way to formally present counts and percentages is to use tables: Gender No. of people Percentage Male 5 50% Female 5 50% TOTAL: 10 100%
  • 9. CROSS TABULATIONS • Often we will need to summarize information from two categorical variables • For example, how many males are politically liberal? • This type of table is called a cross tabulation (cross tab) • One variable will be in rows, while the other in columns • Consider the cross tab of political affiliation and gender Male Female Liberal 2 2 Moderate 3 1 Conservative 0 2
  • 12. CROSS TABULATIONS: PERCENTAGES • Counts in cross tabulations are simple • Percentages are sometimes not obvious • What percentage we use depends on what our frame of reference is • For example, are we asking what percentage of males are liberal leaning? • In this case we will divide the no. of males who are liberal by total no. of males • Or what percentage of liberals are males? • In this case we will divide the no. of males who are liberal by total no. of liberal people • In practice, both are correct and which one we use depends on the context
  • 13. COUNTS AND PERCENTAGES: IN EXCEL Covered in the excel tutorial
  • 14. CENTRAL TENDENCY • Often the most informative to describe numerical variable is to describe where the ‘center’ is • The most common way of computing the center is to either use: • Mean • Median • Mode (less common)
  • 15. MEANS • A mean is a simple average of all numbers 𝑥 = 𝑖=1 𝑖=𝑁 𝑥𝑖 𝑛 1. Add up all the numbers in the variable 2. Divide by the number of observations in the variable
  • 16. MEAN: EXAMPLE • Calculate the mean of 10,27,12,9,18,21,92
  • 17. MEAN: EXAMPLE • Calculate the mean of 10,27,12,9,18,21,92 𝑀𝑒𝑎𝑛 𝑥 = 10 + 12 + 9 + 18 + 21 + 27 + 92 7 = 189 7 = 27
  • 18. MEDIAN • The number in the middle 1. Check if there are an odd or even number of observations 2. Order the numbers from smallest to largest. 3. If the data set contains an odd number of numbers, the one exactly in the middle is the median. 4. If the data set contains an even number of numbers, take the two numbers that appear exactly in the middle and average them to find the median.
  • 19. MEDIAN: EXAMPLES • Calculate the median of 10,27,12,9,18,21,92 • Calculate the median of 21,15,20,14
  • 20. MEDIAN: EXAMPLES • Calculate the median of 10,27,12,9,18,21,92 1. There are 7 numbers (odd no.) 2. Order: 9,10,12,18,21,27,92 3. Middle number (median) is 9,10,12,18,21,27,92 • Calculate the median of 21,15,20,14 1. There are 4 numbers (even no.) 2. Order: 14,15,20,21 3. Middle numbers are 14,15,20,21 4. Take their average to get median: 15+20 2 = 17.5
  • 21. MODE • The number that occurs the most number of times • Calculate the modal value of: 3,3,3,3,3,4,5,6,3,2,1 • Normally used for categorical variables Category Frequency A 10 B 21 C 5
  • 22. PICKING BEST MEASURE OF CENTER • Calculating mean, median or mode is easy • Picking the right measure is the tricky bit Mean Median Mode
  • 23. WHEN TO USE MEAN OR MEDIAN (OR MODE) Mean The default method + Universal and intuitive + Mathematically sound (we'll see later) - Susceptible to outliers Median Report when data is very skewed or has noticeable outliers. How do we know? Incomes are usually reported as median Mode Less common Report when categories OR one dominant figure Usually with other measures KEEP CONTEXT IN MIND BE FLEXIBLE!
  • 24. MEAN, MEDIAN OR MODE • Let’s go back to our dataset • What’s the best central tendency measure to report for: • Income • Age • Political affiliation ID Name Age Gender Income Political affiliation Voted in the last election? 1 Alfred 67 Male $ 28,500 Liberal Yes 2 Peter 27 Male $ 275,000 Conservative Yes 3 George 18 Male $ 31,000 Liberal No 4 Jannet 38 Female $ 39,000 Liberal Yes 5 Meagan 19 Female $ 52,000 Moderate Yes 6 Ivan 35 Male $ 27,000 Conservative No 7 Jenny 78 Female $ 38,000 Conservative Yes 8 Sam 43 Male $ 33,000 Conservative Yes 9 Emily 39 Female $ 37,000 Moderate No 10 Hellen 57 Female $ 43,000 Liberal Yes
  • 25. MEAN, MEDIAN OR MODE • Let’s go back to our dataset • What’s the best central tendency measure to report for: • Income: Median • Age: Mean or median • Political affiliation: Mode ID Name Age Gender Income Political affiliation Voted in the last election? 1 Alfred 67 Male $ 28,500 Liberal Yes 2 Peter 27 Male $ 275,000 Conservative Yes 3 George 18 Male $ 31,000 Liberal No 4 Jannet 38 Female $ 39,000 Liberal Yes 5 Meagan 19 Female $ 52,000 Moderate Yes 6 Ivan 35 Male $ 27,000 Conservative No 7 Jenny 78 Female $ 38,000 Conservative Yes 8 Sam 43 Male $ 33,000 Conservative Yes 9 Emily 39 Female $ 37,000 Moderate No 10 Hellen 57 Female $ 43,000 Liberal Yes
  • 26. EXERCISE: BEST MEASURE OF CENTER What is the best measure of central tendency for the following: 1. Length of Christopher Nolan movies in minutes 2. U.S. household income 3. Platform with most engagement
  • 27. MEASURES OF SPREAD • Consider two sets of numbers Set A : { 1 4 6 7 12 } Set B: { 5 6 6 6 7 } • What is the mean of each set?
  • 28. MEASURED OF SPREAD Set A : { 1 4 6 7 12 } Set B: { 5 6 6 6 7 } • If the mean of both the sets is the same, what’s the difference between the two? • Set A is obviously more ‘spread out’ than Set B • Is there some way we can quantify this?
  • 29. MEASURED OF SPREAD • We can try taking the difference of each number in the set from the mean of the set • And sum up these differences • But positive and negative differences will cancel out Set A Mean Deviation (Difference from mean) 1 6 -5 4 6 -2 6 6 0 7 6 1 12 6 6 SUM: 0
  • 30. MEASURED OF SPREAD • Instead, we can take the ‘squared differences from mean’ • And sum them up • Divide by number of observations (minus 1) 66/4 = 16.5 • This is called variance Set A Mean Difference from mean Squared difference from mean 1 6 -5 25 4 6 -2 4 6 6 0 0 7 6 1 1 12 6 6 36 SUM: 0 66 *in practice we always divide by no. of observations - 1 to account for the fact that this is a sample (more on this later in the course)
  • 31. MEASURED OF SPREAD • Do the same for Set B • Divide by number of observations 2/4 = 0.5 Set B Mean Difference from mean Squared difference from mean 5 6 -1 1 6 6 0 0 6 6 0 0 6 6 0 0 7 6 1 1 SUM: 0 2
  • 32. MEASURED OF SPREAD Set A : { 1 4 6 7 12 } Set B: { 5 6 6 6 7 } • The variance for set A is 16.5 but for Set B is only 0.5 • This means Set A is more spread out than Set B
  • 33. MEASURED OF SPREAD Variances don’t mean much Instead, if we take it’s root √, we get a standard deviation Standard deviation is the universal way of quantifying spread A high standard deviation means that observations are spread away from the mean A low standard deviation indicates observations are close to the mean
  • 34. RECAP OF VARIANCE AND STANDARD DEVIATION 1. Find the mean of the variable (𝑥) 2. Subtract the mean from each value (xi − 𝑥) 3. Square this difference xi − 𝑥 2 4. Sum this squared difference for all values xi − 𝑥 2 5. Divide by the number of observations minus 1* to get the variance: Variance = xi−𝑥 2 𝑛−1 6. Take the square root of variance to get the standard deviation Standard Deviation = 𝑉𝑎𝑟𝑖𝑎𝑛𝑐𝑒 *We divide by n-1 instead of n as a standard practice to correct for the fact that the data was collected as a sample (more on this later)
  • 35. CAUTION: ALWAYS DIVIDE BY N-1 FOR STD DEV AND VARIANCE!
  • 36. CAUTION: ALWAYS DIVIDE BY N-1 FOR STD DEV AND VARIANCE!
  • 37. CAUTION: ALWAYS DIVIDE BY N-1 FOR STD DEV AND VARIANCE!
  • 38. EXERCISE: • Calculate the standard deviation of the following variable: {10, 12, 23, 45, 120} • On Paper • On Excel