SlideShare a Scribd company logo
15/02/2014
1
Biostatistics
Department of Community Health
Al Baha University
Dr Shaun Cochrane
cochrane.shaun@gmail.com
Introduction, Definitions and Sampling
Revision
1. Calculate x:
𝑥 =
6(3 + 7)
60
2. Solve for x:
x2 + 5x + 1 = -3
3. Calculate A:
𝐴 =
𝑥=2
5
𝑥
4. Calculate the average of the following four numbers:
5, 8, 12, 19
15/02/2014
2
Introduction
• Why do we use statistics:
1. Organise and summarise data
2. Reach decisions about the data using a subset
(small part of) the data
EG. Take the heights of everyone in the class and use
it to infer (tell us about) the height of males in KSA
Definitions
• Data – Numbers, raw material
• Statistics - A field of study concerned with (1) the collection,
organization, summarization, and analysis of data; and (2) the
drawing of inferences about a body of data when only a part of
the data is observed.
• Sources of Data – 1. Records, Surveys, Experiments, External
Sources
• Biostatistics – Using data from biological sciences and
medicine
15/02/2014
3
Definitions
• Variable – characteristic that takes on a different value in diferent
persons, places, things.
• Quantitative Variable – measurable using numbers.
• Qualitative Variable – not measurable using numbers. Use categories
instead.
• Random Variable – A variable that occurs because of chance and
cannot be predicted accurately up front.
• Discrete Random Variable - Characterized by gaps or interruptions in
the values that it can assume.
Definitions
• Continuous Random Variable – Does not possess the gaps or
interruptions characteristic of a discrete random variable.
• Population - Collection of entities for which we have an interest at a
particular time.
• Sample - Part of a population.
15/02/2014
4
Measurement and Measurement Scales
• Measurement – Assignment of numbers to an object or event according to
a set of rules.
There are different types of measurement/scales:
• Nominal Scale – Naming/Classifying observations into various mutually
exclusive categories.
• Ordinal Scale – Allows for ranking of observations that are different
between categories.
• Interval Scale – Allows for the ordering of observations and the measuring
of distance between observations (interval).
• Ratio Scale – Allows for the calculation of the equality of ratios and the
equality of intervals.
Examples (Data)
Definition Example
Quantitative
Variable
Height of Adult Male
Qualitative Variable Country of birth
Random Variable Height (many factors influence height)
Discrete Random
Variable
Number of admissions to a hospital per day (1,2,3…..)
Continuous
Random Variable
Height
Population Weights of males in Saudi Arabia
Sample Weights of males in Al Baha, Saudi Arabia
15/02/2014
5
Examples (Measurement)
Definition Example
Nominal Scale Male – Female; Healthy - Sick
Ordinal Scale Obese, Overweight, healthy, Underweight,
Interval Scale Temperature (0°C - 37°C). The 0 is arbitrary
Ratio Scale Height (0cm means no height. The 0 is real.
Statistical Inference
Statistical inference is the procedure by which we reach a conclusion
about a population on the basis of the information contained in a
sample that has been drawn from that population.
We use statistical inference to prove or disprove results.
15/02/2014
6
Sampling
Sample – If a sample n is drawn from a population of size N in such a way
that every possible sample of size n has the same chance of being selected,
the sample is called a simple random sample.
Population = class, Random samples = 5 numbers randomly selected
Sample Age Sample Age
1 21 8 31
2 34 9 49
3 56 10 38
4 18 11 19
5 24 12 27
6 45 13 34
7 23 14 50
Research and Experiments
• Research – A research study is a scientific study of a phenomenon of
interest. Research studies involve designing sampling protocols,
collecting and analysing data, and providing valid conclusions based
on the results of the analysis.
• Experiments – Experiments are a special type of research study in
which observations are made after specific manipulations of
conditions have been out. They provide the foundation for scientific
research.
15/02/2014
7
Scientific Method
Arrays and Frequency Distribution
15/02/2014
8
Revision
•Give examples of sources of data.
•Write down the steps of the scientific
process.
•Give an example of a qualitative variable.
•What do we mean by the ordinal scale.
Descriptive Statistics. (Chapter 2)
•Arrays
•Frequency
•Distribution
•Stem and Leaf Displays/ Diagrams
15/02/2014
9
Arrays
• Organising of data into ordered arrays.
• Ordered Array: Listing of the values of a collection (population or
sample) in order of magnitude from the smallest to the largest.
• Allows us to:
• Determine the largest and smallest value.
• We can use Excel to order numbers from smallest to largest.
30 34 35 37 37 38 38 38 38 39 39 40 40 42 42
43 43 43 43 43 43 44 44 44 44 44 44 44 45 45
45 46 46 46 46 46 46 47 47 47 47 47 47 48 48
48 48 48 48 48 49 49 49 49 49 49 49 50 50 50
50 50 50 50 50 51 51 51 51 52 52 52 52 52 52
53 53 53 53 53 53 53 53 53 53 53 53 53 53 53
53 53 54 54 54 54 54 54 54 54 54 54 54 55 55
55 56 56 56 56 56 56 57 57 57 57 57 57 57 58
58 59 59 59 59 59 59 60 60 60 60 61 61 61 61
61 61 61 61 61 61 61 62 62 62 62 62 62 62 63
63 64 64 64 64 64 64 65 65 66 66 66 66 66 66
67 68 68 68 69 69 69 70 71 71 71 71 71 71 71
72 73 75 76 77 78 78 78 82
Ordered Array
15/02/2014
10
Frequency Distribution
• An ordered array only gives us so much information.
• Very useful to further analyse/summarise the data.
• We can group the data into class intervals.
• Eg. Annual, monthly, 0 – 5, 6 – 10.
• Number of intervals is important. Must not have to few or too many.
• Should not have more than 15 and not less than 5
• Can use the following equation:
k = 1 + 3.322(log10n) where n = number of values and k = number of
intervals
Frequency distribution
•Calculate the number of intervals
that should be used if you have 275
values in your sample.
15/02/2014
11
Frequency Distribution
Class Interval Frequency
30 – 39 11
40 – 49 46
50 – 59 70
60 – 69 45
70 – 79 16
80 - 89 1
Total 189
Frequency
Distribution
Relative Frequency
• Sometimes it is useful to know the proportion of values falling with in
a class interval rather than just the number of values.
• This is known as the relative frequency of occurrence.
• In order to determine the frequencies we need to calculate:
• Cumulative Frequency (add the number of values as you go down the column)
• We can then calculate relative frequency as well as cumulative
relative frequency.
15/02/2014
12
Relative Frequency
Class Interval Frequency Cumulative
Frequency
Relative
Frequency
Cumulative Relative
Frequency
30 – 39 11 11 0.0582 (11/189) 0.0582
40 – 49 46 57 (11 + 46) 0.2434 0.3016 (0.0582 + 0.2434)
50 – 59 70 127 0.3704 0.6720
60 – 69 45 172 0.2381 0.9101
70 – 79 16 188 0.0847 0.9948
80 - 89 1 189 0.0053 1.0001
Total 189 1.0001
Histograms
• Frequency Distributions can be displayed as histograms.
• Charts and graphs are much easier to interpret and read than tables.
0
10
20
30
40
50
60
70
80
30 – 39 40 – 49 50 – 59 60 – 69 70 – 79 80 - 89
Histogram
15/02/2014
13
Stem and Leaf Display
11 3 04577888899
57 4 0022333333444444455566666677777788888889999999
(70) 5 00000000111122222233333333333333333444444444445556666667777777889+
62 6 000011111111111222222233444444556666667888999
17 7 0111111123567888
1 8 2
Stem: 30, 40, 50
etc.
Leaf: 30, 34, 35, 37
etc.
Frequency
Assignment 1
In a study of the oral home care practice and reasons for seeking dental care among individuals on renal dialysis, Atassi (A-1) studied 90
subjects on renal dialysis. The oral hygiene status of all subjects was examined using a plaque index with a range of 0 to 3 (0 = no soft
plaque deposits, 3 = an abundance of soft plaque deposits). The following table shows the plaque index scores for all 90 subjects.
1.17 2.50 2.00 2.33 1.67 1.33
1.17 2.17 2.17 1.33 2.17 2.00
2.17 1.17 2.50 2.00 1.50 1.50
1.00 2.17 2.17 1.67 2.00 2.00
1.33 2.17 2.83 1.50 2.50 2.33
0.33 2.17 1.83 2.00 2.17 2.00
1.00 2.17 2.17 1.33 2.17 2.50
0.83 1.17 2.17 2.50 2.00 2.50
0.50 1.50 2.00 2.00 2.00 2.00
1.17 1.33 1.67 2.17 1.50 2.00
1.67 0.33 1.50 2.17 2.33 2.33
1.17 0.00 1.50 2.33 1.83 2.67
0.83 1.17 1.50 2.17 2.67 1.50
2.00 2.17 1.33 2.00 2.33 2.00
2.17 2.17 2.00 2.17 2.00 2.17
15/02/2014
14
(a) Use these data to prepare:
• A frequency distribution
• A relative frequency distribution
• A cumulative frequency distribution
• A cumulative relative frequency distribution
• A histogram
Assignment 1
Work in Groups of 5 or less.
Assignment due at BEGINNING of NEXT Lecture! Late Assignments will get 0.
Total Mark = 5
Descriptive Statistics
15/02/2014
15
Descriptive Statistics Cont. (Chapter 2)
• Mean
• Median
• Mode
• Dispersion
• Standard Deviation
• Coefficient of Variation
• Percentiles
• Quartiles
• Box and Whisker Plot
Measures of Central Tendency
• Sometimes we just want a single number to describe the data. This is
called a descriptive tendency.
• Statistic: A descriptive measure computed from the data of a sample.
• Parameter: A descriptive measure computed from the data of a
population.
• Most common measures of central tendency are:
• Mean, Median and Mode
15/02/2014
16
Arithmetic Mean
• Mean of 1+2+3+4 = 10/4 = 2.5
• Equation for Mean:
𝜇 =
𝑖=1
𝑁
𝑥𝑖
𝑁
• Equation for Sample Mean
𝑥 =
𝑖=1
𝑛
𝑥𝑖
𝑛
Properties of the Mean
• Unique – only one mean for a set of data
• Simple – easy to calculate and easy to understand
• All value contribute to the calculation – but extreme values then
influence the calculation of the mean.
eg. Cost of dentist in 5 areas of Al Baha
SAR40 SAR45 SAR50 SAR50 SAR150
15/02/2014
17
Median
• Divides data into two sets of equal size in the middle.
• Eg.
1 2 3 4 5 6 7 = 4 (Middle)
1 2 3 4 5 6 7 8 = (9/2) Middle two
numbers)
IMPORTANT – Numbers must be ranked (smallest to largest)
Properties of the Median
• Unique and easy to understand
• Simple to calculate
• Not really effected by extreme values.
15/02/2014
18
Mode
• The value in the dataset that occurs most frequently
• Eg.
1 1 2 2 2 3 4 4 5
The mode = 2 (occurs 3 times)
A dataset can have no mode or more than one mode.
Measures of dispersion
• Dispersion = variety = differences.
• Measures of dispersion include:
• Range
• Variance
• Standard deviation
• Coeffecient of variation
• Percentiles
• Quartiles
• Interquartile range
15/02/2014
19
Dispersion
Range
• The Range is the difference between the largest number and the
smallest number.
• Range = R
• Largest number = xL
• Smallest number = xs
Range = xL - xS
15/02/2014
20
Variance
• Variance is the dispersion of the data relative to the scatter of the
values about their mean.
Variance
• s2 = Sample Variance
• xi = value
• n = total values
15/02/2014
21
Standard Deviation
• Variance is s2
• This is not the original units of the data
• Standard deviation is = s (original units)
Assignment 2
Work in Groups of 5 or less and calculate the Variance and Deviation of the above data.
Assignment due at BEGINNING of NEXT Lecture! Late Assignments will get 0.
Total Mark = 5
15/02/2014
22
Descriptive Statistics Continued
Coefficient of Variation
• Remember Standard Deviation
• Sometimes we want to compare the variance of two samples but they
have different units.
• Eg. Weight (kg) and Cholesterol Concentration (g/dl)
• We then use the Coeffecient of Variation:
s = standard Deviation
= Mean
15/02/2014
23
Coefficient of Variation
Which sample has more variation?
Percentiles and Quartiles
• Percentiles:
15/02/2014
24
Quartiles
These equations give the position of the
percentiles not the values.
The most commonly used percentiles are 25%, 50% and 75%. These are
known as the quartiles. These calculations tell us how much data is above
or below each percentage.
Quartiles
i xi Quartile
1 102
2 104
3 105 Q1
4 107
5 108
6 109
Q2
(median)
7 110
8 112
9 115 Q3
10 116
11 118
3
6
9
15/02/2014
25
Interquartile
Q1 = 105
Q3 = 115
IQR = 115 – 105 = 10
• The bigger the IQR, the more variability in the
middle 50% of numbers.
• The smaller the number, the less variability in the
middle 50% of numbers.
Probability
15/02/2014
26
Introduction
1. Given some process (or experiment) with n mutually exclusive
outcomes (called events), E1, E2…….En, the probability of any event
Ei is assigned a non-negative number.
P(Ei) ≥ 0
Mutually Elusive: Events cannot occur simultaneously.
Introduction
15/02/2014
27
Introduction
2. The sum of the probabilities of mutually exclusive events is equal to
1.
P(E1) + P(E2) + ……… + P(En) = 1
3. The probability of two mutually exclusive events is equal to the sum
of the individual probabilities.
P(Ei + Ej) = P(Ei) + P(Ej)
Example
15/02/2014
28
Example
• What is the probability that we randomly pick a patient that is 18
years or younger:
 = 141/318
• What is the probability that we choose a patient that is over 18 years
old:
 = 177/318
Conditional Probability
• When probabilities are calculated from a subset of the total denominator
(eg. From the total number of subjects/people surveyed for mood disorder)
• Example: What is the probability that a person 18 years old or younger will
have no family history of mood disorder.
 Total patients 18 years or younger = 141
Total subjects with no mood disorder = 28
Probability = 28/141
P(A|E) = 28/141
15/02/2014
29
Joint Probability
• Sometimes we want to find the probability that a subject picked at
random from a group of subjects possesses two characteristics at the
same time.
• Example: What is the probability that a subject picked at random will
be 18 years or younger and will have no family history of mood
disorder?
 P(E∩A) = 28/318
Multiplication Rule
• We can calculate probabilities from other probabilities. A joint
probability can be calculated as the product of a marginal probability
and the conditional probability
• Example: What is the joint probability of early (18 or below) onset of
mood disorder and a negative history of mood disorder.
 P(E) = 141/318 = 0.4434
P(A|E) = 28/141 = 0.1986
 We need to calculate P(E∩A)
P(E∩A) = P(E)P(A|E) = 0.4434 * 0.1986 = 0.0881
15/02/2014
30
Multiplication Rule
Addition Rule
• The probability of the occurrence of either one or the other of two
other mutually exclusive events is equal to the sum of their two
individual probabilities. The events do not have to be mutually
exclusive.
P(AᴜB)= P(A) + P(B) – P(A∩B)
15/02/2014
31
Addition Rule
• Example: If we pick a person at random what is the probability that this
person will have early stage onset of mood disorder or will have no family
history of mood disorders or both.
P(EᴜA)= P(E) + P(A) – P(E∩A)
• P(E) = 141/318 = 0.4434
• P(E∩A) = 28/318 = 0.0881
• P(A) = 63/318 = 0.1981
• P(EᴜA) = 0.4434 + 0.1981 – 0.0881 = 0.05534
Independent Events
• A and B are independent events if the probability of event A
happening is the same whether event B occurs or not.
• You use the multiplication rule in this case:
P(A∩B) = P(A)P(B)
15/02/2014
32
Independent Events
• In a high school with 60 girls and 40 boys, 24 girls and 16 boys where
glasses. What is the probability that a student picked at random is a boy
and wears eye glasses.
• Being a boy and wearing eye glasses are independent.
P(B∩E) = P(B)P(E)
• P(B) = 16/40 = 0.4
• P(E) = 40/100 = 0.4
• P(B∩E) = 0.4 * 0.4 = 0.16
Complementary Events
• Complementary events are mutually exclusive.
• Example: Being early stage onset is mutually exclusive for late stage
onset.
P(Ā) = 1 – P(A)
15/02/2014
33
Complementary Events
• Example: If there are 1200 admissions to a hospital and 750
admissions are private then 450 patients must be state patients. So:
• P(A) = 750/1200 = 0.625
• Then P(Ā) = 1 – P(A) = 1 – 0.625 = 0.375
• Therefore the probability of a patient being a state patient is 0.375
Bayes Theorem
15/02/2014
34
Bayes Theorem
• In the health sciences we often need to:
• Predict the presence or absence of a disease from test results (+ or -)
• Predict the outcome of a diagnostic test from previous test results
• Important to know what the following mean:
Must always be able to answer the following
question to determine the accuracy of diagnostic
tests:
15/02/2014
35
Sensitivity (Q1)
The above two way table allows us to calculate the sensitivity of a diagnostic test
Specificity (Q2)
15/02/2014
36
Predictive Value Positive (Q3)
= P(D│T)
Predicative Positive Negative
= P(T’│D’)
15/02/2014
37
Bayes Theorem
• Predictive Positive Value
• Predictive Negative Value
Example
A medical research team wished to evaluate a proposed screening test for Alzheimer’s disease. The
test was given to a random sample of 450 patients with Alzheimer’s disease and an independent
random sample of 500 patients without symptoms of the disease. The two samples were drawn from
populations of subjects who were 65 years of age or older. The results are as follows:
15/02/2014
38
Example
• Calculate the specificity of the test P(T│D)
• Calculate the sensitivity of the test P(T’│D’)
• Calculate the Predictive Positive Value P(D│T)
• Calculate the Predictive Negative Value P(D’│T’)
Example
(Sensitivity)
(Specificty)
15/02/2014
39
Examples
(Positive Predictive Value)
(Positive Negative Value)
Estimation, z-Value and t-value
15/02/2014
40
Remember:
Statistical inference needs to be made with confidence (certainty)
but most populations of interest are so large so we need t
ESTIMATE (we cannot look at 100% of the population).
Estimations: Definitions
15/02/2014
41
Sampling: Definitions
Confidence Intervals
Think about mean and look at the distribution of the numbers around
the mean. This Normal Distribution. In all calculations we will assume
normal distribution.
15/02/2014
42
Confidence Intervals
• We know that 95% of our data lies within two standard deviations of
the mean (jus know this)
• This means that we can be 95% confident about where a number is in
our data set.
• Equation for 95% Confidence = μ +/- 2s (s = standard deviation)
Confidence Intervals
15/02/2014
43
Example
z-Value
We get the z-Value from tables
α – standard error (eg 1% = 0.01)
z – reliability coeffecient
15/02/2014
44
Example
t-Test
• z-Value is useful for large populations (above 30) but what if have
small population.
• Use a t-Value.
t – confidence coefficient.Will be given to you in all questionsbut can be
obtained from statisticaltables.
15/02/2014
45
t-Value Table

More Related Content

What's hot (20)

PPTX
1.3 collecting sample data
Long Beach City College
 
PPTX
Statstics in nursing
Monika Devi NR
 
PPTX
Machine learning pre requisite
Ram Singh
 
PPT
Aed1222 lesson 2
nurun2010
 
PDF
Schuster how to_avoid_mistakes_with_statistics_31052013
Thomas_Schuster
 
PPTX
Basic statistics
Seth Anandaram Jaipuria College
 
PPT
Measures of central tendency
Alex Chris
 
PPTX
Basic Descriptive statistics
Ajendra Sharma
 
PPTX
Intro to statistics
Ratheeshkrishnakripa
 
PPTX
Basics of Educational Statistics (Descriptive statistics)
HennaAnsari
 
PPTX
Biostatistics /certified fixed orthodontic courses by Indian dental academy
Indian dental academy
 
PPT
Statistics Vocabulary Chapter 1
Debra Wallace
 
PPTX
Statistics in research by dr. sudhir sahu
Sudhir INDIA
 
PPTX
Lab 1 intro
Erik D. Davenport
 
PPTX
2.7.21 sampling methods data analysis
Ashish965416
 
PPTX
Statistics in Physical Education
dryadav1300
 
PDF
Descriptive statistics
Venkata Reddy Konasani
 
PPTX
Introduction to Statistics in Nursing.
Johny Kutty Joseph
 
DOCX
Assigment 1
Shaf Fik
 
1.3 collecting sample data
Long Beach City College
 
Statstics in nursing
Monika Devi NR
 
Machine learning pre requisite
Ram Singh
 
Aed1222 lesson 2
nurun2010
 
Schuster how to_avoid_mistakes_with_statistics_31052013
Thomas_Schuster
 
Measures of central tendency
Alex Chris
 
Basic Descriptive statistics
Ajendra Sharma
 
Intro to statistics
Ratheeshkrishnakripa
 
Basics of Educational Statistics (Descriptive statistics)
HennaAnsari
 
Biostatistics /certified fixed orthodontic courses by Indian dental academy
Indian dental academy
 
Statistics Vocabulary Chapter 1
Debra Wallace
 
Statistics in research by dr. sudhir sahu
Sudhir INDIA
 
Lab 1 intro
Erik D. Davenport
 
2.7.21 sampling methods data analysis
Ashish965416
 
Statistics in Physical Education
dryadav1300
 
Descriptive statistics
Venkata Reddy Konasani
 
Introduction to Statistics in Nursing.
Johny Kutty Joseph
 
Assigment 1
Shaf Fik
 

Similar to Biostatistics CH Lecture Pack (20)

PPTX
PARAMETRIC TESTS.pptx
DrLasya
 
PPTX
Stat-Lesson.pptx
JennilynFeliciano2
 
PPT
Manpreet kay bhatia Business Statistics.ppt
Noorien3
 
PPTX
BIOSTATISTICS OVERALL JUNE 20241234567.pptx
anasabdulmajeed3sker
 
PPTX
Understanding statistics in research
Dr. Senthilvel Vasudevan
 
PPTX
1. Descriptive statistics.pptx engineering
levis1659
 
PPT
Introduction to statistics
Shaamma(Simi_ch) Fiverr
 
PPT
SFEPart1toolgraphs10 containing main things.ppt
onlyforstalking1122
 
PDF
1Basic biostatistics.pdf
YomifDeksisaHerpa
 
PPTX
Complete Biostatistics (Descriptive and Inferential analysis)
DrAbdiwaliMohamedAbd
 
PPTX
Introduction of biostatistics
khushbu
 
PPT
Statistical methods
venkataramanan Thiru
 
PPTX
Biostatistics_descriptive stats.pptx
MohammedAbdela7
 
PDF
Lessontwo - Measures of Tendency.pptx.pdf
hebaelkouly
 
PDF
Lesson2 - chapter 2 Measures of Tendency.pptx.pdf
hebaelkouly
 
PDF
Lesson2 - chapter two Measures of Tendency.pptx.pdf
hebaelkouly
 
PPT
Business statistics (Basics)
AhmedToheed3
 
PPTX
Probability in statistics
Sukirti Garg
 
PPT
businessstatistics-stat10022-200411201812.ppt
tejashreegurav243
 
PDF
Basic Statistics, Biostatistics, and Frequency Distribution
Gaurav Patil
 
PARAMETRIC TESTS.pptx
DrLasya
 
Stat-Lesson.pptx
JennilynFeliciano2
 
Manpreet kay bhatia Business Statistics.ppt
Noorien3
 
BIOSTATISTICS OVERALL JUNE 20241234567.pptx
anasabdulmajeed3sker
 
Understanding statistics in research
Dr. Senthilvel Vasudevan
 
1. Descriptive statistics.pptx engineering
levis1659
 
Introduction to statistics
Shaamma(Simi_ch) Fiverr
 
SFEPart1toolgraphs10 containing main things.ppt
onlyforstalking1122
 
1Basic biostatistics.pdf
YomifDeksisaHerpa
 
Complete Biostatistics (Descriptive and Inferential analysis)
DrAbdiwaliMohamedAbd
 
Introduction of biostatistics
khushbu
 
Statistical methods
venkataramanan Thiru
 
Biostatistics_descriptive stats.pptx
MohammedAbdela7
 
Lessontwo - Measures of Tendency.pptx.pdf
hebaelkouly
 
Lesson2 - chapter 2 Measures of Tendency.pptx.pdf
hebaelkouly
 
Lesson2 - chapter two Measures of Tendency.pptx.pdf
hebaelkouly
 
Business statistics (Basics)
AhmedToheed3
 
Probability in statistics
Sukirti Garg
 
businessstatistics-stat10022-200411201812.ppt
tejashreegurav243
 
Basic Statistics, Biostatistics, and Frequency Distribution
Gaurav Patil
 
Ad

Biostatistics CH Lecture Pack

  • 1. 15/02/2014 1 Biostatistics Department of Community Health Al Baha University Dr Shaun Cochrane cochrane.shaun@gmail.com Introduction, Definitions and Sampling Revision 1. Calculate x: 𝑥 = 6(3 + 7) 60 2. Solve for x: x2 + 5x + 1 = -3 3. Calculate A: 𝐴 = 𝑥=2 5 𝑥 4. Calculate the average of the following four numbers: 5, 8, 12, 19
  • 2. 15/02/2014 2 Introduction • Why do we use statistics: 1. Organise and summarise data 2. Reach decisions about the data using a subset (small part of) the data EG. Take the heights of everyone in the class and use it to infer (tell us about) the height of males in KSA Definitions • Data – Numbers, raw material • Statistics - A field of study concerned with (1) the collection, organization, summarization, and analysis of data; and (2) the drawing of inferences about a body of data when only a part of the data is observed. • Sources of Data – 1. Records, Surveys, Experiments, External Sources • Biostatistics – Using data from biological sciences and medicine
  • 3. 15/02/2014 3 Definitions • Variable – characteristic that takes on a different value in diferent persons, places, things. • Quantitative Variable – measurable using numbers. • Qualitative Variable – not measurable using numbers. Use categories instead. • Random Variable – A variable that occurs because of chance and cannot be predicted accurately up front. • Discrete Random Variable - Characterized by gaps or interruptions in the values that it can assume. Definitions • Continuous Random Variable – Does not possess the gaps or interruptions characteristic of a discrete random variable. • Population - Collection of entities for which we have an interest at a particular time. • Sample - Part of a population.
  • 4. 15/02/2014 4 Measurement and Measurement Scales • Measurement – Assignment of numbers to an object or event according to a set of rules. There are different types of measurement/scales: • Nominal Scale – Naming/Classifying observations into various mutually exclusive categories. • Ordinal Scale – Allows for ranking of observations that are different between categories. • Interval Scale – Allows for the ordering of observations and the measuring of distance between observations (interval). • Ratio Scale – Allows for the calculation of the equality of ratios and the equality of intervals. Examples (Data) Definition Example Quantitative Variable Height of Adult Male Qualitative Variable Country of birth Random Variable Height (many factors influence height) Discrete Random Variable Number of admissions to a hospital per day (1,2,3…..) Continuous Random Variable Height Population Weights of males in Saudi Arabia Sample Weights of males in Al Baha, Saudi Arabia
  • 5. 15/02/2014 5 Examples (Measurement) Definition Example Nominal Scale Male – Female; Healthy - Sick Ordinal Scale Obese, Overweight, healthy, Underweight, Interval Scale Temperature (0°C - 37°C). The 0 is arbitrary Ratio Scale Height (0cm means no height. The 0 is real. Statistical Inference Statistical inference is the procedure by which we reach a conclusion about a population on the basis of the information contained in a sample that has been drawn from that population. We use statistical inference to prove or disprove results.
  • 6. 15/02/2014 6 Sampling Sample – If a sample n is drawn from a population of size N in such a way that every possible sample of size n has the same chance of being selected, the sample is called a simple random sample. Population = class, Random samples = 5 numbers randomly selected Sample Age Sample Age 1 21 8 31 2 34 9 49 3 56 10 38 4 18 11 19 5 24 12 27 6 45 13 34 7 23 14 50 Research and Experiments • Research – A research study is a scientific study of a phenomenon of interest. Research studies involve designing sampling protocols, collecting and analysing data, and providing valid conclusions based on the results of the analysis. • Experiments – Experiments are a special type of research study in which observations are made after specific manipulations of conditions have been out. They provide the foundation for scientific research.
  • 8. 15/02/2014 8 Revision •Give examples of sources of data. •Write down the steps of the scientific process. •Give an example of a qualitative variable. •What do we mean by the ordinal scale. Descriptive Statistics. (Chapter 2) •Arrays •Frequency •Distribution •Stem and Leaf Displays/ Diagrams
  • 9. 15/02/2014 9 Arrays • Organising of data into ordered arrays. • Ordered Array: Listing of the values of a collection (population or sample) in order of magnitude from the smallest to the largest. • Allows us to: • Determine the largest and smallest value. • We can use Excel to order numbers from smallest to largest. 30 34 35 37 37 38 38 38 38 39 39 40 40 42 42 43 43 43 43 43 43 44 44 44 44 44 44 44 45 45 45 46 46 46 46 46 46 47 47 47 47 47 47 48 48 48 48 48 48 48 49 49 49 49 49 49 49 50 50 50 50 50 50 50 50 51 51 51 51 52 52 52 52 52 52 53 53 53 53 53 53 53 53 53 53 53 53 53 53 53 53 53 54 54 54 54 54 54 54 54 54 54 54 55 55 55 56 56 56 56 56 56 57 57 57 57 57 57 57 58 58 59 59 59 59 59 59 60 60 60 60 61 61 61 61 61 61 61 61 61 61 61 62 62 62 62 62 62 62 63 63 64 64 64 64 64 64 65 65 66 66 66 66 66 66 67 68 68 68 69 69 69 70 71 71 71 71 71 71 71 72 73 75 76 77 78 78 78 82 Ordered Array
  • 10. 15/02/2014 10 Frequency Distribution • An ordered array only gives us so much information. • Very useful to further analyse/summarise the data. • We can group the data into class intervals. • Eg. Annual, monthly, 0 – 5, 6 – 10. • Number of intervals is important. Must not have to few or too many. • Should not have more than 15 and not less than 5 • Can use the following equation: k = 1 + 3.322(log10n) where n = number of values and k = number of intervals Frequency distribution •Calculate the number of intervals that should be used if you have 275 values in your sample.
  • 11. 15/02/2014 11 Frequency Distribution Class Interval Frequency 30 – 39 11 40 – 49 46 50 – 59 70 60 – 69 45 70 – 79 16 80 - 89 1 Total 189 Frequency Distribution Relative Frequency • Sometimes it is useful to know the proportion of values falling with in a class interval rather than just the number of values. • This is known as the relative frequency of occurrence. • In order to determine the frequencies we need to calculate: • Cumulative Frequency (add the number of values as you go down the column) • We can then calculate relative frequency as well as cumulative relative frequency.
  • 12. 15/02/2014 12 Relative Frequency Class Interval Frequency Cumulative Frequency Relative Frequency Cumulative Relative Frequency 30 – 39 11 11 0.0582 (11/189) 0.0582 40 – 49 46 57 (11 + 46) 0.2434 0.3016 (0.0582 + 0.2434) 50 – 59 70 127 0.3704 0.6720 60 – 69 45 172 0.2381 0.9101 70 – 79 16 188 0.0847 0.9948 80 - 89 1 189 0.0053 1.0001 Total 189 1.0001 Histograms • Frequency Distributions can be displayed as histograms. • Charts and graphs are much easier to interpret and read than tables. 0 10 20 30 40 50 60 70 80 30 – 39 40 – 49 50 – 59 60 – 69 70 – 79 80 - 89 Histogram
  • 13. 15/02/2014 13 Stem and Leaf Display 11 3 04577888899 57 4 0022333333444444455566666677777788888889999999 (70) 5 00000000111122222233333333333333333444444444445556666667777777889+ 62 6 000011111111111222222233444444556666667888999 17 7 0111111123567888 1 8 2 Stem: 30, 40, 50 etc. Leaf: 30, 34, 35, 37 etc. Frequency Assignment 1 In a study of the oral home care practice and reasons for seeking dental care among individuals on renal dialysis, Atassi (A-1) studied 90 subjects on renal dialysis. The oral hygiene status of all subjects was examined using a plaque index with a range of 0 to 3 (0 = no soft plaque deposits, 3 = an abundance of soft plaque deposits). The following table shows the plaque index scores for all 90 subjects. 1.17 2.50 2.00 2.33 1.67 1.33 1.17 2.17 2.17 1.33 2.17 2.00 2.17 1.17 2.50 2.00 1.50 1.50 1.00 2.17 2.17 1.67 2.00 2.00 1.33 2.17 2.83 1.50 2.50 2.33 0.33 2.17 1.83 2.00 2.17 2.00 1.00 2.17 2.17 1.33 2.17 2.50 0.83 1.17 2.17 2.50 2.00 2.50 0.50 1.50 2.00 2.00 2.00 2.00 1.17 1.33 1.67 2.17 1.50 2.00 1.67 0.33 1.50 2.17 2.33 2.33 1.17 0.00 1.50 2.33 1.83 2.67 0.83 1.17 1.50 2.17 2.67 1.50 2.00 2.17 1.33 2.00 2.33 2.00 2.17 2.17 2.00 2.17 2.00 2.17
  • 14. 15/02/2014 14 (a) Use these data to prepare: • A frequency distribution • A relative frequency distribution • A cumulative frequency distribution • A cumulative relative frequency distribution • A histogram Assignment 1 Work in Groups of 5 or less. Assignment due at BEGINNING of NEXT Lecture! Late Assignments will get 0. Total Mark = 5 Descriptive Statistics
  • 15. 15/02/2014 15 Descriptive Statistics Cont. (Chapter 2) • Mean • Median • Mode • Dispersion • Standard Deviation • Coefficient of Variation • Percentiles • Quartiles • Box and Whisker Plot Measures of Central Tendency • Sometimes we just want a single number to describe the data. This is called a descriptive tendency. • Statistic: A descriptive measure computed from the data of a sample. • Parameter: A descriptive measure computed from the data of a population. • Most common measures of central tendency are: • Mean, Median and Mode
  • 16. 15/02/2014 16 Arithmetic Mean • Mean of 1+2+3+4 = 10/4 = 2.5 • Equation for Mean: 𝜇 = 𝑖=1 𝑁 𝑥𝑖 𝑁 • Equation for Sample Mean 𝑥 = 𝑖=1 𝑛 𝑥𝑖 𝑛 Properties of the Mean • Unique – only one mean for a set of data • Simple – easy to calculate and easy to understand • All value contribute to the calculation – but extreme values then influence the calculation of the mean. eg. Cost of dentist in 5 areas of Al Baha SAR40 SAR45 SAR50 SAR50 SAR150
  • 17. 15/02/2014 17 Median • Divides data into two sets of equal size in the middle. • Eg. 1 2 3 4 5 6 7 = 4 (Middle) 1 2 3 4 5 6 7 8 = (9/2) Middle two numbers) IMPORTANT – Numbers must be ranked (smallest to largest) Properties of the Median • Unique and easy to understand • Simple to calculate • Not really effected by extreme values.
  • 18. 15/02/2014 18 Mode • The value in the dataset that occurs most frequently • Eg. 1 1 2 2 2 3 4 4 5 The mode = 2 (occurs 3 times) A dataset can have no mode or more than one mode. Measures of dispersion • Dispersion = variety = differences. • Measures of dispersion include: • Range • Variance • Standard deviation • Coeffecient of variation • Percentiles • Quartiles • Interquartile range
  • 19. 15/02/2014 19 Dispersion Range • The Range is the difference between the largest number and the smallest number. • Range = R • Largest number = xL • Smallest number = xs Range = xL - xS
  • 20. 15/02/2014 20 Variance • Variance is the dispersion of the data relative to the scatter of the values about their mean. Variance • s2 = Sample Variance • xi = value • n = total values
  • 21. 15/02/2014 21 Standard Deviation • Variance is s2 • This is not the original units of the data • Standard deviation is = s (original units) Assignment 2 Work in Groups of 5 or less and calculate the Variance and Deviation of the above data. Assignment due at BEGINNING of NEXT Lecture! Late Assignments will get 0. Total Mark = 5
  • 22. 15/02/2014 22 Descriptive Statistics Continued Coefficient of Variation • Remember Standard Deviation • Sometimes we want to compare the variance of two samples but they have different units. • Eg. Weight (kg) and Cholesterol Concentration (g/dl) • We then use the Coeffecient of Variation: s = standard Deviation = Mean
  • 23. 15/02/2014 23 Coefficient of Variation Which sample has more variation? Percentiles and Quartiles • Percentiles:
  • 24. 15/02/2014 24 Quartiles These equations give the position of the percentiles not the values. The most commonly used percentiles are 25%, 50% and 75%. These are known as the quartiles. These calculations tell us how much data is above or below each percentage. Quartiles i xi Quartile 1 102 2 104 3 105 Q1 4 107 5 108 6 109 Q2 (median) 7 110 8 112 9 115 Q3 10 116 11 118 3 6 9
  • 25. 15/02/2014 25 Interquartile Q1 = 105 Q3 = 115 IQR = 115 – 105 = 10 • The bigger the IQR, the more variability in the middle 50% of numbers. • The smaller the number, the less variability in the middle 50% of numbers. Probability
  • 26. 15/02/2014 26 Introduction 1. Given some process (or experiment) with n mutually exclusive outcomes (called events), E1, E2…….En, the probability of any event Ei is assigned a non-negative number. P(Ei) ≥ 0 Mutually Elusive: Events cannot occur simultaneously. Introduction
  • 27. 15/02/2014 27 Introduction 2. The sum of the probabilities of mutually exclusive events is equal to 1. P(E1) + P(E2) + ……… + P(En) = 1 3. The probability of two mutually exclusive events is equal to the sum of the individual probabilities. P(Ei + Ej) = P(Ei) + P(Ej) Example
  • 28. 15/02/2014 28 Example • What is the probability that we randomly pick a patient that is 18 years or younger:  = 141/318 • What is the probability that we choose a patient that is over 18 years old:  = 177/318 Conditional Probability • When probabilities are calculated from a subset of the total denominator (eg. From the total number of subjects/people surveyed for mood disorder) • Example: What is the probability that a person 18 years old or younger will have no family history of mood disorder.  Total patients 18 years or younger = 141 Total subjects with no mood disorder = 28 Probability = 28/141 P(A|E) = 28/141
  • 29. 15/02/2014 29 Joint Probability • Sometimes we want to find the probability that a subject picked at random from a group of subjects possesses two characteristics at the same time. • Example: What is the probability that a subject picked at random will be 18 years or younger and will have no family history of mood disorder?  P(E∩A) = 28/318 Multiplication Rule • We can calculate probabilities from other probabilities. A joint probability can be calculated as the product of a marginal probability and the conditional probability • Example: What is the joint probability of early (18 or below) onset of mood disorder and a negative history of mood disorder.  P(E) = 141/318 = 0.4434 P(A|E) = 28/141 = 0.1986  We need to calculate P(E∩A) P(E∩A) = P(E)P(A|E) = 0.4434 * 0.1986 = 0.0881
  • 30. 15/02/2014 30 Multiplication Rule Addition Rule • The probability of the occurrence of either one or the other of two other mutually exclusive events is equal to the sum of their two individual probabilities. The events do not have to be mutually exclusive. P(AᴜB)= P(A) + P(B) – P(A∩B)
  • 31. 15/02/2014 31 Addition Rule • Example: If we pick a person at random what is the probability that this person will have early stage onset of mood disorder or will have no family history of mood disorders or both. P(EᴜA)= P(E) + P(A) – P(E∩A) • P(E) = 141/318 = 0.4434 • P(E∩A) = 28/318 = 0.0881 • P(A) = 63/318 = 0.1981 • P(EᴜA) = 0.4434 + 0.1981 – 0.0881 = 0.05534 Independent Events • A and B are independent events if the probability of event A happening is the same whether event B occurs or not. • You use the multiplication rule in this case: P(A∩B) = P(A)P(B)
  • 32. 15/02/2014 32 Independent Events • In a high school with 60 girls and 40 boys, 24 girls and 16 boys where glasses. What is the probability that a student picked at random is a boy and wears eye glasses. • Being a boy and wearing eye glasses are independent. P(B∩E) = P(B)P(E) • P(B) = 16/40 = 0.4 • P(E) = 40/100 = 0.4 • P(B∩E) = 0.4 * 0.4 = 0.16 Complementary Events • Complementary events are mutually exclusive. • Example: Being early stage onset is mutually exclusive for late stage onset. P(Ā) = 1 – P(A)
  • 33. 15/02/2014 33 Complementary Events • Example: If there are 1200 admissions to a hospital and 750 admissions are private then 450 patients must be state patients. So: • P(A) = 750/1200 = 0.625 • Then P(Ā) = 1 – P(A) = 1 – 0.625 = 0.375 • Therefore the probability of a patient being a state patient is 0.375 Bayes Theorem
  • 34. 15/02/2014 34 Bayes Theorem • In the health sciences we often need to: • Predict the presence or absence of a disease from test results (+ or -) • Predict the outcome of a diagnostic test from previous test results • Important to know what the following mean: Must always be able to answer the following question to determine the accuracy of diagnostic tests:
  • 35. 15/02/2014 35 Sensitivity (Q1) The above two way table allows us to calculate the sensitivity of a diagnostic test Specificity (Q2)
  • 36. 15/02/2014 36 Predictive Value Positive (Q3) = P(D│T) Predicative Positive Negative = P(T’│D’)
  • 37. 15/02/2014 37 Bayes Theorem • Predictive Positive Value • Predictive Negative Value Example A medical research team wished to evaluate a proposed screening test for Alzheimer’s disease. The test was given to a random sample of 450 patients with Alzheimer’s disease and an independent random sample of 500 patients without symptoms of the disease. The two samples were drawn from populations of subjects who were 65 years of age or older. The results are as follows:
  • 38. 15/02/2014 38 Example • Calculate the specificity of the test P(T│D) • Calculate the sensitivity of the test P(T’│D’) • Calculate the Predictive Positive Value P(D│T) • Calculate the Predictive Negative Value P(D’│T’) Example (Sensitivity) (Specificty)
  • 39. 15/02/2014 39 Examples (Positive Predictive Value) (Positive Negative Value) Estimation, z-Value and t-value
  • 40. 15/02/2014 40 Remember: Statistical inference needs to be made with confidence (certainty) but most populations of interest are so large so we need t ESTIMATE (we cannot look at 100% of the population). Estimations: Definitions
  • 41. 15/02/2014 41 Sampling: Definitions Confidence Intervals Think about mean and look at the distribution of the numbers around the mean. This Normal Distribution. In all calculations we will assume normal distribution.
  • 42. 15/02/2014 42 Confidence Intervals • We know that 95% of our data lies within two standard deviations of the mean (jus know this) • This means that we can be 95% confident about where a number is in our data set. • Equation for 95% Confidence = μ +/- 2s (s = standard deviation) Confidence Intervals
  • 43. 15/02/2014 43 Example z-Value We get the z-Value from tables α – standard error (eg 1% = 0.01) z – reliability coeffecient
  • 44. 15/02/2014 44 Example t-Test • z-Value is useful for large populations (above 30) but what if have small population. • Use a t-Value. t – confidence coefficient.Will be given to you in all questionsbut can be obtained from statisticaltables.