Biostatistics in Research Methodoloyg Presentation.pptx

By
Dr Muhammad Safdar Baig
Associate Professor
Oral & Dental Surgery
BVH/QMC, Bahawalpur
For
Post Graduate Trainees
Bahawal Victoria Hospital &
Quaid-e-Azam Medical College, Bahawalpur 1

2
safdarbeg@gmail.com
safdar_b@yahoo.com
safdar_b@hotmail.com
03006821103
Kuzma & Rosner – Biostat
Leon Gordis – Fundamentals of
Epidemiology

 Collection
 Analysis &
 Interpretation
So as to find Solutions to
a problem.
4
RESEARCH is
A Process of Systematic,
Scientific Data

7
A Variable is a characteristic of a person, object or
phenomenon that can take on different values.
A simple example of a variable is a person’s age.
The variable age can take on different values
because a person can be 20 years old, 35 years
old, and so on.

8
Dependent and independent variables
Because in health system research you often look for
causal explanations, It is important to make distinction
between dependent and independent variables.
The variable that is used to describe or measure the
problem under study (outcome) is called the
DEPENDENT variable.
The variables that are used to describe or measure the
factors that are assumed to cause or at least to
influence the problem are called the INDEPENDENT
(exposure) variables.

9
Data
Data are values of the observation recorded
for variables (e.g. age, weight, gender).

10
TYPES of DATA
Qualitative or categorical data:-
The characteristic which can’t be expressed numerically like sex,
ethnicity , healing etc.
Quantitative data or numerical data:-
The characteristic which can be expressed numerically like age,
temperature, no. of children in a family.
Categorical Data
There are two types of categorical data:
• Nominal
• Ordinal data.

11
NOMINAL DATA
 In NOMINAL DATA, the variables are divided into
named categories. These categories however, cannot be
ordered one above another (as they are not greater or less
than each other).
 Example:
NOMINAL DATA CATEGORIES
Sex/ Gender: male, female
Marital status: single, married, widowed,
separated, divorced

12
ORDINAL DATA
 In ORDINAL DATA, the variables are also divided into a
number of categories, but they can be ordered one above
another, from lowest to highest or vice versa.
 Example:
ORDINAL DATA CATEGORIES
Level of knowledge: good, average, poor
Level of blood pressure: high, moderate, low

13
Presentation of Data
 Data once collected should be presented in a such a way
as to be easily understood . The style of presentation
depends, of course, on type of data.
 Data can be presented in as frequency tables, charts,
graphs, etc. Here we would discuss some of the
important means of presentation.

14
FREQUENCY TABLES
 In a FREQUENCY TABLE data is
presented in a tabular form. It gives the
frequency with which (or the number of
times) a particular value appears in the
data.

15
Systolic Blood Pressure of patients coming to a
tertiary care hospital OPD
Distribution Frequency Relative Cumulative
Relative
Below 100 6 0.10 0.10
100 – 120 9 0.15 0.25
121 – 140 24 0.40 0.65
141 – 160 15 0.25 0.90
Above 160 6 0.10 1.00
n = 60

16
Graphs
 Another way to summarize and display data is
through the use of graph or pictorial
representations of numerical data. Graphs should
be designed so that they convey at a single glance
the general patterns in a set of data.

17
Bar charts
 Bar charts are used for nominal or ordinal data.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
1900 1910 1920 1930 1940 1950 1960 1970 1980 1990
Years
No.
of
cigarettes
Cigarette consumption of persons 18 years
of age or older, United States, 1900 - 1990

18
Pie chart
 Pie charts can also be used to display nominal or ordinal data.
Male
70%
Female
30%
Gender distribution

19
Histogram
A histogram depicts a frequency distribution for quantative data
Histogram showing distribution of Age (years)

21
MEASURES OF CENTRAL TENDENCY
•Mean
•Median
•Mode

22
MEAN
 The MEAN (or arithmetic mean) is also
known as the AVERAGE. It Is calculated by
totaling the results of all the observations
and dividing by the total number of
observations. Note that the mean can only
be calculated for numerical data.

23
MEDIAN
The MEDIAN is the value that divides a distribution
into two equal halves.
 The median is useful when some measurements are much
bigger or much smaller than the rest. The mean of such
data will be biased toward these extreme values.
 The median is not influenced by extreme values.

24
MODE
 The MODE is the most frequently occurring
value in a set of observations.

25
MEASURES OF VARIATION
Range is defined as the difference in value between the
highest (maximum) and the lowest (minimum) observation
Variance Quantifies the amount of variability or spread
about the mean of the sample.
Standard deviation it is the square root of the variance

26
Standard Deviation
 The STANDARD DEVIATION is a measure, which
describes how much individual measurements differ,
on the average, from the mean.
 A large standard deviation shows that there is a wide
scatter of measured values around the mean, while a
small standard deviation shows that the individual
values are concentrated around the mean with little
variation among them.

27
Standard error of the mean
When we draw a sample from study population and compute its
sample mean it is not likely to be identical to the population
mean. If we draw another sample from same population and
compute its sample mean, this may also not be identical to the
first sample mean. It probably also differs from the true mean of
the total population from which the sample was drawn this
phenomenon is called sampling variation.
Standard error:- The standard error gives an estimate of the
degree to which the sample mean varies from the population
mean and this measures is used to calculate CI.

28
THE NORMAL DISTRIBUTION
 Many variables have a normal distribution. This is a
bell shaped curve with most of the values clustered
near the mean and a few values out near the tails.

29
 The normal distribution is symmetrical around
the mean. The mean, the median and the mode
of a normal distribution have the same value.
 An important characteristic of a normally
distributed variable is that 95% of the
measurements have value which are
approximately within 2 standard deviations
(SD) of the mean.

31
Estimation
The process of using sample information to
draw conclusion about the value of a
population parameter is known as
estimation.

32
 A point estimate is a specific numerical value
estimate of a parameter.
 The best point estimate of the population mean µ is
the sample mean
 But how good is a point estimate?
 There is no way of knowing how close the point
estimate is to the population mean
 Statisticians prefer another type of estimate called
an interval estimate
Point Estimate
X

33
Interval Estimate
 An interval estimate of a parameter is an interval or a range of
values used to estimate the parameter
Confidence Level
 The confidence level of an interval estimate of a parameter is
the probability that the interval estimate will contain the
parameter
 Three commonly used confidence levels are 90%, 95% and 99%
 If one desires to be more confident then the sample size must be
larger

35
RATIO
 The most basic measure of distribution.
 Obtained by simply dividing one quantity by
another without implying any specific
relationship between the numerator and
denominator, such as the number of
stillbirths per thousand live births.
 In ratio, the numerator & denominator are
mutually exclusive.

36
PROPORTION
 A proportion is a type of ratio in which those who
are included in the numerator must also be
included in the denominator.
 For example: the proportion of women over the age
of 50 who have had a hysterectomy, or the number
of fetal deaths out of the total number of births (live
births plus fetal deaths).

37
RATE
 A rate is a proportion with specifications of time.
There is a distinct relationship between the
numerator and denominator with a measure of
time being an intrinsic part of the denominator.
 For example, the number of newly diagnosed
cases of breast cancer per 100,000 women during
a given year.

38
IMPORTANT POINT
 It is necessary to be very specific about what constitutes
both the numerator and the denominator. In some
circumstances, it is important to make clear whether the
measure represents the number of events or the number
of individuals.
 For example, the frequency of myopia among a
population of school children could represent the
number of affected eyes in relation to total eyes, or the
number of children affected in one or both eyes relative
to all students.

39
PREVALENCE
 Prevalence quantifies the proportion of individuals
in a population who have the disease at a specific
instant and provides an estimate of the probability
(risk) that an individual will be ill at a point in time
 The formula for calculating the prevalence P =
number of existing cases of a disease/ total
population (at a given point in time)

40
POINT PREVALENCE
 Prevalence can be thought of as the status of the
disease in a population at a point in time and as such
is also referred to as point prevalence.
 This "point" can refer to a specific point in calendar
time or to a fixed point in the course of events that
varies in real time from person to person, such as the
onset of menopause or puberty or the third
postoperative day.

41
PERIOD PREVALENCE
 It represents the proportion of cases that exist within a
population at any point during a specified period of time.
 The numerator thus includes cases that were present at
the start of the period plus new cases that developed
during this time
E.g. Frequency of patients receiving Psychiatric Rx
between May 31 – Dec 01 2008

42
INCIDENCE:
 Incidence quantifies the number of new
events or cases of disease that develop in a
population of individuals at risk during a
specified time interval.

43
Cumulative incidence (CI)
 Is the proportion of people who become
diseased during a specified period of time. It
provides an estimate of the probability, or
risk, that an individual will develop a disease
during a specified period of time
CI = No. of new cases of a disease
Total population at risk

44
Issues in the Calculation of Measures of Incidence
 For any measure of disease frequency,
precise definition of the denominator is
essential for both accuracy and clarity. This
is a particular concern in the calculation of
incidence. The denominator of a measure of
incidence should include only those who are
considered "at risk" of developing the
disease.

45
Contd.
 That is, the total population from which
the new cases could arise. Consequently,
those who currently have or have already
had the disease under study or persons
who cannot develop the disease for reasons
such as age, immunization, or prior
removal of the involved organ should be
excluded from the denominator.

46
Special Types of Incidence Rates
MORBIDITY RATE
Is the incidence rate of non fatal cases in the total population at risk
during a specified period of time.
For example, the morbidity rate of tuberculosis (TB) in the U.S. in 1982
can be calculated by dividing the number of nonfatal cases newly
reported during that year by the total U.S. midyear population.
Total no of nonfatal cases of TB in POP at risk
Mid year POP
25,520
231,534,000
= 11.0 per 100,000 population

47
MORTALITY RATE
 It expresses the incidence of deaths in a particular
population during a period of time.
 It is calculated by dividing the number of fatalities
during that period by the total population.
 This can be further divided into cause specific or
all case mortality.

49
Measures of Association
 Relative risk (cohort study)
 Odds ratio (case control)

50
Cohort Studies
a b
c d
Exposed
Non Exposed
Diseased Non Diseased
a+b
c+d

51
Relative Risk
 Incidence in exposed individuals=a/a+b
Or proportion of exposed people who developed the disease
 Incidence in non-exposed individuals =c/c+d
Or proportion of non exposed people who develop disease
Relative Risk= Incidence in exposed
Incidence in non exposed
RR = a/a+b
c/c+d

52
Calculating the Relative Risk
CHD + CHD – Total
112 176 288
88 224 312
Disease Status
Smoker
Non smoker
Incidence in exposed = a /a+b = 112 / 288= 0.38
Incidence in non exposed = c /c+d= 88 / 312= 0.28
RR= 0.38 / 0.28 = 1.38

53
Interpretation of RR
 Compared to non smokers, the smokers
have a 1.38 times greater risk of
developing CHD

54
Odds Ratio
 Incidence cannot be measured in case control
studies because we start with the diseased
people (cases) and non diseased people
(controls), hence we calculate OR

55
Case Control
a b
c d
Exposed
Non Exposed
Cases Controls
a+b
c+d
b+d
a+c
OR=a/c b/d or ad/bc

56
Passive Smoking & Breast Cancer
Breast
cancer
No Breast
cancer
Total
140 (a) 370 (b) 510
40 (c) 234 (d) 274
Exposed (Passive
Smokers)
Not exposed
Odds=140 / 40=3.5 Odds=370 / 234=1.6
OR=3.5 / 1.6=2.2
Compared to the control, the odds of being a passive smoker are 2.2 >
in Ca breast cases

59
Bias is:
Any systematic error that results in an incorrect
estimate of the association between the exposure and
outcome. Usually introduced by the experimenter or
the researcher himself due to non-standardized
measuring techniques.

60
Type of Bias:-
Selection Bias
Observation (Information/Misclassification)
Bias
Recall Bias
Interviewers Bias
Lost-to-follow up

61
Can Control Bias:-
In study design through
Choice of study population
Data collection:-
Uniform Source of information
Efficient Questionnaire development
Standardization of measurement technique
Blinding

63
The concept of confounding is a central one in the
interpretation of any epidemiological study.
Confounding can be thought of as mixing of the effect
of the exposure under study on the disease with that
of an extraneous factor.
This external factor or variable must be associated
with the exposure and,independent of the exposure
must be a risk factor for the disease.

64
Example of confounding
Smoking MI
age

65
Table 1. Relation of Myocardial infarction (MI)to
Recent Oral Contraceptive (OC) Use.
MI +ve MI -ve Estimated
relative risk
OC
Yes 29 135 =1.68
No 205 1607
Total 234 1742

66
Table2:-Age -specific Relation of Myocardial infarction (MI) to
recent Oral Contraceptive (OC) Use.
Age (yrs) Recent OC
use
MI +ve MI -ve Estimated
age-Specific
relative risk
25 – 29 Yes
No
4
2
62
224
7.2
30 – 34 Yes
No
9
12
33
390
8.9
35 – 39 Yes
No
4
33
26
330
1.5
40 – 44 Yes
No
6
65
9
362
3.7
45 – 49 Yes
No
6
93
5
301
3.9
Total 234 1742

67
Confounding can be controlled in study design
through:
 Restriction
Matching exposure
 Randomization
Confounding can be controlled in analysis
through:
 Stratification
 Multivariate analysis

68
The role of confounding, chance and bias have to be
evaluated in studies appropriate selection of the
population to be studied , with proper study design, so
that the results can be applied to other population i.e.,
they are valid and generalizable.

69
Evaluation of the role of chance consists
of two components:-
1. Hypothesis testing
2. Estimation of the confidence interval

71
WHAT IS HYPOTHESIS?
Hypothesis: A testable theory, or
statement of belief used in evaluation
of a population parameter of interest
e.g. Mean or proportion

72
 Suppose a study is being conducted to answer
questions about differences between two regimens for the
management of diarrhea in children:
the sugar based modern ORS and the time-tested indigenous
herbal solution made from locally available herbs.
 One question that could be asked is:
"In the population is there a difference in overall
improvement (after three days of treatment) between the ORS
and the herbal solution?"

73
There could be only two
answers to this question:
Yes
No

74
Null Hypothesis
"There is no difference between the 2 regimens in
term of improvement” (null hypothesis).
A null hypothesis is usually a statement that there
is no difference between groups or that one factor is
not dependent on another and corresponds to the
No answer.

75
Alternative Hypothesis
 "There is a difference in terms of improvement achieved by a
three days treatment with the ORS and that of the herbal
solution" (alternative hypothesis).
 Associated with the null hypothesis there is always another
hypothesis or implied statement concerning the true relationship
among the variables or conditions under study if no is an
implausible answer. This statement is called the alternative
hypothesis and corresponds to the “Yes” answer.

76
TYPES OF ALTERNATE HYPOTHESIS
o Directional
o Non Directional

78
WHY TEST HYPOTHESIS
Hypothesis testing permits generalization
of an association or a difference obtained
from a sample to the population from which
it came.
Hypothesis testing involves conducting a
test of statistical significance and
quantifying the degree to which sampling
variability may account for the result
observed in a particular study. It entails the
following steps.

79
STEPS IN HYPOTHESIS TESTING
1. Statement of research question in terms of
statistical hypothesis (Null and alternate
hypothesis)
2. Selection of an appropriate level of
significance. The significance level is the
risk we are willing to take that a sample
which showed a difference was misleading.
5% significance level means that we are
ready to take a 5% chance of wrong results.

80
3. Choosing an appropriate statistics
t test, z test for continuous data, chi square for
proportions etc.
Test statistics is computed from the sample
data and is used to determine whether the
null hypothesis should be rejected or
retained.
Test statistics generates p value
STEPS IN HYPOTHESIS TESTING

81
P value: Indicates the probability or likelihood of
obtaining a result at least as extreme as that observed in
a study by chance alone, assuming that there is truly no
association between exposure and outcome under
consideration.
By convention the p value is set at 0.05 level. Thus any
value of p less than or equal to 0.05 indicates that there is
at most a 5% probability of observing an association as
large or larger than that found in the study due to chance
alone given that there is no association between
exposure and outcome. If p value0.05 do not reject the
null hypothesis .

82
4. Performing calculations and obtaining p value
5. Drawing conclusions, rejecting null
hypothesis if the p value is less than the set
significance level

84
Sample size calculations depend on:
1. Type of study.
2. Magnitude of the outcome of interest derived
from previous studies.
3. Type of statistical analysis
required (comparing means or proportions).
4. Level of significance / Power.

85
Sample size for single proportion
depends on:
1. The prevalence of the
condition/attribute of interest.
2. Level of confidence.
3. Margin of error.

86
Example of Sample size calculation for single
proportion
 A local health department wishes to estimate the
prevalence of tuberculosis among children under 5
year of age in a locality. How many children should
be included in the sample so that the prevalence
may be estimated within 5% point of the true value
with 95% confidence, if it is known that the true
rate is unlikely to exceed 20%?

87
Sample size calculation and formula for single
proportion

88
Sample size for single group mean
depends on:
1. The Mean of the
condition of interest.
3. Margin of error.

89
Example of Sample size calculation for single group
mean
 A district medical officer seeks to estimate the mean
hemoglobin level among pregnant women in his
district. A previous study of pregnant women showed
average hemoglobin level 8.2 g/dl and standard
deviation of 4.2 g/dl. Assuming a sample of pregnant
women is to be selected, how many pregnant women
must be studied if he wanted the estimate should fall
within 1 g/dl with 95% confidence?

90
Sample size calculation and formula for single
group mean

91
Sample size for two proportions
depends on:
1. The prevalence of the condition /
attribute of interest for both groups.
3. Power of the test.

92
Example of Sample size calculation for two
proportions
 It is believed that the proportion of patient who
develop complications after undergoing one type
of surgery is 5% while the proportion of the
patients who develop complication after a
second type of surgery is 15%. How large should
the sample size be in each of the two groups of
patients if an investigator wishes to detect with a
power of 90%, wether the second procedure has
a complication rate significantly higher than the
first at the 5% level of significance?

93
Sample size calculation and formula for two
proportions

94
Sample size for two group means
depends on:
1. The means/variance for both groups.
3. Power of the test.

95
Example of Sample size calculation for two
group means
Suppose the true mean systolic blood pressure
(SBP) of 35 to 39 year old OC users is (132.86
mmHg) and standard deviation (15.34 mmHg).
Similarly, for non-OC users, the mean SBP is
(127.44 mmHg) with standard deviation (18.23
mmHg). If we desire to estimate the difference
between 2 groups of equal size, what would be the
minimal sample size required with a power of 80%
at 95% confidence level?

98
Sample size for sensitivity and specificity
depends on:
1. The prevalence of the
condition/attribute of interest.
2. Estimated sensitivity.
3. Estimated specificity.
4. Level of significance.
5. Margin of error.

99
Example of Sample size calculation for sensitivity
and specificity
 If we want to determine the sensitivity and
specificity of graded compression
ultrasonography in the diagnosis of acute
appendicitis by the gold standard
histopathology. How many patients should be
included in the sample .The prevalence OF
AA is 77% and estimated sensitivity of US is
96.5% and estimated specificity is 94.1% with
95% confidence, if we want to keep margin of
error as 10%?

100
Sample size calculation and formula for sensitivity
and specificity studies

101
Suggested websites for sample size calculators
1.http://www.raosoft.com/samplesize.html
2.http://www.quantitativeskills.com/sisa/calculati
ons/samsize.htm
3.http://www.openepi.com/Menu/OpenEpiMenu.
htm

103
Screening
 Screening for disease control can be defined as
the examination of asymptomatic people in
order to classify them as likely or unlikely to
have the disease that is object of screening.
 If done in large groups---mass screening or
population screening.

104
Characteristics of Disease to be Screened
 Disease must pass through preclinical phase
during which it is undiagnosed but detectable
 Early treatment must offer some advantage

105
Validity
 The ability of a test to distinguish between who
has disease and who does not

106
Sensitivity
 Of a test is its ability to detect people who do
have disease.
 If a Test is always positive for all diseased
persons then sensitivity of the Test will be
100%.

107
Specificity
 It is the ability of a Test to detect people who
don’t have disease.
 Thus a Test which is always negative in non-
diseased individuals is called to have 100%
specificity.

108
Validity
a b
c d
Positive
Negative
Diseased Non Diseased
a+b
c+d
FP
T P
FN TN

109
Test
FNA
CA Breast
Positive
CA Breast
Negative
Total
Positive 60 a + + 50 b - + a + b
110
Negative 20 c - + 70 d - - c +d
90
Total 80 a + c 120 b + d a + b + c + d
200

110
 Sensitivity = a x 100 = 60 x100 = 75%
a + c 80
I.e. Test (FNA) is 75% sensitive in detecting disease
 Specificity = d x 100 = 70 x100 = 58%
d + b 120
I.e. Specificity of (FNA) is 58% to detect non- diseased
persons

111
Positive Predictive Value i.e PPV
 PPV = a x 100 = 60 x100 = 55%
a + b 110
I.e. 55% persons are actually suffering from
disease.
PPV  Prevalence
Negative Predictive Value i.e NPV
 NPV = d x 100 = 70 x100 = 78%
c + d 90
I.e. 78% persons are actually free from disease.

112
Test Disease
Present
Disease Not
Present
Total
Positive True Positive
(TP) + +
False Positive
(FP) - +
TP + FP
Negative False Negative
(FN) + -
True Negative
(TN) - -
FN + TN
Total TP +FN TN + FP TP+FP+TN+FN
•Sensitivity = TP x 100
TP + FN
•Specificity = TN x100
TN + FP
•PPV= TP x100
TP + FP
•NPV = TN x100
TN + FN

113
Relationship of Disease Prevalence to PPV
Dis. Prev Test
Results
Disease Not
Disease
Total
1% Positive 99 495 594
Negative 1 9405 9406
Total 100 9900 10,000
PPV = 99/594 = 17%
Example: Sensitivity = 99%; Specificity = 95% In a population of 10, 000
with a disease prevalence of 1%

114
Relationship of Disease Prevalence to PPV
Dis. Prev Test
Results
Disease Not
Disease
Total
5% Positive 495 475 970
Negative 5 9025 9030
Total 500 9500 10,000
PPV = 495 / 970 = 51%
Example: Sensitivity = 99%; Specificity = 95% In a population of 10, 000
with a disease prevalence of 5%

115
Relationship between PPV & Prevalence
 A screening program is most effective and
beneficial if it is directed to a high-risk target
population
 Screening a total population for a relatively
infrequent disease can be very wasteful of
resources and may yield very few previously
undetected cases

117
POINTS OF IMPORT IN DESIGNING A QUESTIONNAIRE
 It should be ensured that the format of the questionnaire be
attractive and easy for the respondents to fill, overcrowding or
clutter should be avoided and all questions and pages clearly
numbered
 The questionnaire should not be too long
 To maintain flow of the instrument, questions concerning
major areas should be grouped together
 Simple questions about age, birth date etc should be put at the
beginning to warm up the respondent

118
POINTS OF IMPORT IN DESIGNING A
QUESTIONNAIRE
 Questions should be close ended, possible answers to close
ended questions should be lined vertically, preceded by boxes,
brackets or numbers
Example
How many different medicines do you take daily (check one)
[ ] None
[ ] 1-2
[ ] 3-4
[ ] 5-6
[ ] 7 or more

119
 If more details are required pertaining to a question , then the
filter/skip technique should be used to save time and allow
respondents to avoid irrelevant questions.
Example :Have you ever been told that you have hypertension.
Yes
No
If yes proceed to next question
How long back were you told that you have hypertension
POINTS OF IMPORT IN DESIGNING
A QUESTIONNAIRE

120
POINTS OF IMPORT IN DESIGNING A
QUESTIONNAIRE
 Wordings of questions should be simple and free
from ambiguity, non judgmental and be soliciting
only one response.
 For behaviors that may change overtime specific
time span should be asked for in the question
Example :During the past 12 months how many
doctor visits did you make.
 Always choose a appropriate means of measurement
e.g. score /scales.

121
 Sensitive topic questions should be left for the end
 If similar research instruments are available it may
be a good idea to review and if required borrow
questions.
 Always try to ensure that if questions are to be
asked in any language besides English they shall be
so written too
POINTS OF IMPORT IN DESIGNING
A QUESTIONNAIRE

Biostatistics in Research Methodoloyg Presentation.pptx

More Related Content

Similar to Biostatistics in Research Methodoloyg Presentation.pptx (20)

More from ssuser40fd68 (7)

Recently uploaded (20)

Biostatistics in Research Methodoloyg Presentation.pptx