SlideShare a Scribd company logo
Statistical Analysis In
Machine Learning
By Shifa Noorulain
Overview to Statistics
▪ What is Statistics
▪ Terminologies in Statistics
▪ Understanding Variables in Statistics
▪ Categories in Statistics
▪ Descriptive Vs Inferential Statistics
Introduction
What is Statistical Analysis? Why is it required?
▪ It is done to understand/discover underlying patterns and trends
▪ Which involves collecting, exploring and presenting large amounts of data
“Statistics is a Mathematical Science pertaining to data collection,
analysis, interpretation and presentation.”
Terminologies
▪ The population is the set of sources from which data has to be collected.
▪ A Sample is a subset of the Population
▪ A Variable is any characteristics, number, or quantity that can be measured or counted. A
variable may also be called a data item.
▪ A statistical Parameter or population parameter is a quantity that indexes a family of
probability distributions. For example, the mean, median, etc of a population.
Data Source
Data comes from many sources:
▪ sensor measurements, events, text, images, and
videos. The Internet of Things (IoT) is spewing out
streams of information.
▪ Stock Market
▪ Retail
▪ Weather
Structured Data Format
Variable
Numerical/Quantitative Categorical/Qualitative
Continuous Discrete Ordinal Nominal
Measure Characteristics
Ex: Temperature, Square
feet,
wind speed or time duration
Counted Items
Ex: Defects per hour
Data is arranged in order or
ranked and can be compared
Ex: Grades, Star Reviews,
Position in Race, Date
Categorized using names,
labels or qualities
Ex: Brand Name, ZipCode,
Gender
Note: Categorical Data can be
visualized by Bar Plot, Pie Chart.
Numerical Data can be visualized by
Histogram, Line Plot, Scatter Plot
Types of Statistical Analysis
Statistical Analysis
Descriptive Statistics Inferential Statistics
Descriptive statistics describe a sample.
Statistical measures used: Central
tendency, Dispersion, Skewness
Inferential statistics takes data from a sample
and makes inferences about the larger
population from which the sample was drawn.
And generalizes to population
Common methods used: hypothesis tests,
confidence intervals, and regression analysis.
Population
Sample
Select a sample
from the population
Generalize
conclusions from
samples to
Population
Descriptive Statistics
Topics of Discussion
▪ Measures of Central Tendency
▪ Measures of the Spread
▪ Measures of Asymmetry(Skewness)
Descriptive Statistics uses the data to
provide descriptions of the population, either
through numerical calculations or graphs or
tables.
Measures of Central Tendency
Mean: Measure of average of all the values in a sample is called Mean.
▪ It susceptible to outliers when unusual values are added it gets skewed i.e
deviates from the typical central value.
▪ suitable to continuous data with no outliers
Median:
▪ Measure of the central value of the sample set is called Median.
▪ It is the middle value for a dataset that has been arranged in order of magnitude.
▪ Median is a better alternative to mean as it is less affected by outliers and
skewness of the data. The median value is much closer than the typical central
value.
▪ Suitable for continuous data with outliers
If the total number of values is odd
If the total number of values is
even
Mode:
▪ The value most recurrent in the sample set is known as Mode.
▪ Suitable for categorical data(both nominal and ordinal)
For Example, In a dataset containing {13,35,54,54,55,56,57,67,85,89,96} values.
Mean is 60.09. Median is 56. Mode is 54.
Measures of the Spread/Dispersion
Range:
▪ It is the given measure of how spread apart the values in a data set are.The difference between the
largest and the smallest value of a data, is termed as the range of the distribution.
▪ Range does not consider all the values of a series, i.e. it takes only the extreme items and middle
items are not considered significant. eg: For {13,33,45,67,70} the range is 57 i.e(70–13).
▪ Note: If one of the components of range, maximum or minimum value becomes an extreme
value(Outlier) then range should not be used
This indicates how large the spread of the distribution is around the central tendency. It answers
unambiguously the question “What is the magnitude of departure from the average value for different
groups having identical averages?”
Variance: It describes how much a random variable differs from its expected value.
It entails computing squares of deviations.
▪ Deviation is the difference between each element from the mean.
▪ Population Variance is the average of squared deviations
▪ Sample Variance is the average of squared differences from the mean
Variance is the average of all squared deviations.
Note: The units of values and variance is not equal so we use another variability
measure.
Standard Deviation:
▪ It is the measure of the dispersion of a set of data from its mean.
▪ As Variance suffers from unit difference so standard deviation is used.
▪ The square root of the variance is the standard deviation. It tells about
the concentration of the data around the mean of the data set.
For eg: {3,5,6,9,10} are the values in a dataset.
Coefficient of Variation(CV):
▪ It is also called as the relative standard deviation. It is the ratio of
standard deviation to the mean of the dataset.
Note: Standard deviation is the variability of a single dataset. Whereas the
coefficient of variance can be used for comparing 2 datasets.
From this example, we can see that the CV
is the same. Both methods are precise. So it
is perfect for comparisons.
Inter Quartile Range (IQR): It is the measure of variability, based on dividing a
data set into quartiles.
• The line that divides the box into 2 parts represents the median of the data.
• The end of the box shows the upper quartile(75%). Also called as 3rd quartile
• The start of the box represents the lower quartile(25%). Also called as 1st quartile.
• The region between lower quartile and the upper quartile is called as Inter Quartile
Range(IQR) and it is used to approximate the 50% spread in the middle data(75–
25=50%)
• The maximum is the highest value in data, similarly minimum is the lowest value in
data, it is also called as caps
• The points outside the boxes and between the maximum and maximum are called as
whiskers, they show the range of values in data.
• The extreme points are outliers to the data. A commonly used rule is that a value is an
outlier if it’s less than lower quartile-1.5 * IQR or high than the upper quartile + 1.5*
IQR. Where IQR = Q3 – Q1
A box plot perfectly illustrates what we can do with basic statistical features:
▪ When the box plot is short it implies that much of your data points are similar, since there are
many values in a small range
▪ When the box plot is tall it implies that much of your data points are quite different, since the
values are spread over a wide range
▪ If the median value is closer to the bottom then we know that most of the data has lower
values. If the median value is closer to the top then we know that most of the data has higher
values. Basically, if the median line is not in the middle of the box then it is an indication of
skewed data.
▪ Are the whiskers very long? That means your data has a high standard deviation and
variance i.e the values are spread out and highly varying. If you have long whiskers on one
side of the box but not the other, then your data may be highly varying only in one direction.
▪ All of that information from a few simple statistical features that are easy to calculate! Try these
out whenever you need a quick yet informative view of your data.
Measures of Asymmetry
Skewness: Skewness is the asymmetry in a statistical distribution, in which the curve
appears distorted or skewed towards to the left or to the right. Skewness indicates whether
the data is concentrated on one side.
Positive Skewness: Positive Skewness is
when the mean>median>mode. The outliers
are skewed to the right i.e the tail is skewed
to the right.
Negative Skewness: Negative Skewness is
when the mean<median<mode. The outliers
are skewed to the left i.e the tail is skewed to
the left.
Skewness is important as it tells us about where the data is distributed.
The Empirical Rule
The Empirical Rule states that almost all data lies within
3 standard deviations of the mean for a normal
distribution. Under this rule,
▪ About 68% of the data is within 1 standard deviation of
the mean.
▪ About 95% of the data is within 2 standard deviations
of the mean.
▪ About 99.7% of the data is within 3 standard
deviations of the mean.
Ad

More Related Content

What's hot (20)

Data clustering and optimization techniques
Data clustering and optimization techniquesData clustering and optimization techniques
Data clustering and optimization techniques
Spyros Ktenas
 
ΜΕΣΑ ΚΟΙΝΩΝΙΚΗΣ ΔΙΚΤΥΩΣΗΣ
ΜΕΣΑ ΚΟΙΝΩΝΙΚΗΣ ΔΙΚΤΥΩΣΗΣ ΜΕΣΑ ΚΟΙΝΩΝΙΚΗΣ ΔΙΚΤΥΩΣΗΣ
ΜΕΣΑ ΚΟΙΝΩΝΙΚΗΣ ΔΙΚΤΥΩΣΗΣ
1lykspartis
 
Big Data in Practice.pdf
Big Data in Practice.pdfBig Data in Practice.pdf
Big Data in Practice.pdf
Tom Tan
 
Microsoft Access Θεωρία 1/6
Microsoft Access Θεωρία 1/6Microsoft Access Θεωρία 1/6
Microsoft Access Θεωρία 1/6
Michael Ntallas
 
Το εσωτερικό του Υπολογιστή
Το εσωτερικό του ΥπολογιστήΤο εσωτερικό του Υπολογιστή
Το εσωτερικό του Υπολογιστή
Elli Ntrikou
 
σχεδιο μαθηματοσ φυλλο εργασιασ δικτυα υπολογιστων
σχεδιο μαθηματοσ φυλλο εργασιασ δικτυα  υπολογιστωνσχεδιο μαθηματοσ φυλλο εργασιασ δικτυα  υπολογιστων
σχεδιο μαθηματοσ φυλλο εργασιασ δικτυα υπολογιστων
thanslide
 
Σενάριο Διδασκαλίας : Εισαγωγή στο προγραμματιστικό περιβάλλον App Inventor
Σενάριο Διδασκαλίας : Εισαγωγή στο προγραμματιστικό περιβάλλον App InventorΣενάριο Διδασκαλίας : Εισαγωγή στο προγραμματιστικό περιβάλλον App Inventor
Σενάριο Διδασκαλίας : Εισαγωγή στο προγραμματιστικό περιβάλλον App Inventor
Vasilis Drimtzias
 
Big Data Storage Challenges and Solutions
Big Data Storage Challenges and SolutionsBig Data Storage Challenges and Solutions
Big Data Storage Challenges and Solutions
WSO2
 
Παρκάρω το αυτοκίνητο μου στο Scratch
Παρκάρω το αυτοκίνητο μου στο ScratchΠαρκάρω το αυτοκίνητο μου στο Scratch
Παρκάρω το αυτοκίνητο μου στο Scratch
Ιωάννης Σαρημπαλίδης
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Lect12 graph mining
Lect12 graph miningLect12 graph mining
Lect12 graph mining
Houw Liong The
 
Big data architectures
Big data architecturesBig data architectures
Big data architectures
Mariem Khalfaoui
 
Έγγραφα Google - Drive
Έγγραφα Google - DriveΈγγραφα Google - Drive
Έγγραφα Google - Drive
Manolis Pomonis
 
Αρχεία Βοήθειας του GeoGebra
Αρχεία Βοήθειας του GeoGebraΑρχεία Βοήθειας του GeoGebra
Αρχεία Βοήθειας του GeoGebra
makrib
 
Vector space classification
Vector space classificationVector space classification
Vector space classification
Ujjawal
 
Microsoft Fabric.pptx
Microsoft Fabric.pptxMicrosoft Fabric.pptx
Microsoft Fabric.pptx
Shruti Chaurasia
 
Τα οφέλη της χρήσης των εργαλείων σχεδιασμού και ανάπτυξης ψηφιακού εκπαιδευτ...
Τα οφέλη της χρήσης των εργαλείων σχεδιασμού και ανάπτυξης ψηφιακού εκπαιδευτ...Τα οφέλη της χρήσης των εργαλείων σχεδιασμού και ανάπτυξης ψηφιακού εκπαιδευτ...
Τα οφέλη της χρήσης των εργαλείων σχεδιασμού και ανάπτυξης ψηφιακού εκπαιδευτ...
Antonia - Maria Hartofylaka
 
βασικές θύρες η/υ
βασικές θύρες η/υβασικές θύρες η/υ
βασικές θύρες η/υ
gogosf
 
BIG DATA.pdf
BIG DATA.pdfBIG DATA.pdf
BIG DATA.pdf
naveenlingala2
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
 
Data clustering and optimization techniques
Data clustering and optimization techniquesData clustering and optimization techniques
Data clustering and optimization techniques
Spyros Ktenas
 
ΜΕΣΑ ΚΟΙΝΩΝΙΚΗΣ ΔΙΚΤΥΩΣΗΣ
ΜΕΣΑ ΚΟΙΝΩΝΙΚΗΣ ΔΙΚΤΥΩΣΗΣ ΜΕΣΑ ΚΟΙΝΩΝΙΚΗΣ ΔΙΚΤΥΩΣΗΣ
ΜΕΣΑ ΚΟΙΝΩΝΙΚΗΣ ΔΙΚΤΥΩΣΗΣ
1lykspartis
 
Big Data in Practice.pdf
Big Data in Practice.pdfBig Data in Practice.pdf
Big Data in Practice.pdf
Tom Tan
 
Microsoft Access Θεωρία 1/6
Microsoft Access Θεωρία 1/6Microsoft Access Θεωρία 1/6
Microsoft Access Θεωρία 1/6
Michael Ntallas
 
Το εσωτερικό του Υπολογιστή
Το εσωτερικό του ΥπολογιστήΤο εσωτερικό του Υπολογιστή
Το εσωτερικό του Υπολογιστή
Elli Ntrikou
 
σχεδιο μαθηματοσ φυλλο εργασιασ δικτυα υπολογιστων
σχεδιο μαθηματοσ φυλλο εργασιασ δικτυα  υπολογιστωνσχεδιο μαθηματοσ φυλλο εργασιασ δικτυα  υπολογιστων
σχεδιο μαθηματοσ φυλλο εργασιασ δικτυα υπολογιστων
thanslide
 
Σενάριο Διδασκαλίας : Εισαγωγή στο προγραμματιστικό περιβάλλον App Inventor
Σενάριο Διδασκαλίας : Εισαγωγή στο προγραμματιστικό περιβάλλον App InventorΣενάριο Διδασκαλίας : Εισαγωγή στο προγραμματιστικό περιβάλλον App Inventor
Σενάριο Διδασκαλίας : Εισαγωγή στο προγραμματιστικό περιβάλλον App Inventor
Vasilis Drimtzias
 
Big Data Storage Challenges and Solutions
Big Data Storage Challenges and SolutionsBig Data Storage Challenges and Solutions
Big Data Storage Challenges and Solutions
WSO2
 
Apache Spark Introduction
Apache Spark IntroductionApache Spark Introduction
Apache Spark Introduction
sudhakara st
 
Έγγραφα Google - Drive
Έγγραφα Google - DriveΈγγραφα Google - Drive
Έγγραφα Google - Drive
Manolis Pomonis
 
Αρχεία Βοήθειας του GeoGebra
Αρχεία Βοήθειας του GeoGebraΑρχεία Βοήθειας του GeoGebra
Αρχεία Βοήθειας του GeoGebra
makrib
 
Vector space classification
Vector space classificationVector space classification
Vector space classification
Ujjawal
 
Τα οφέλη της χρήσης των εργαλείων σχεδιασμού και ανάπτυξης ψηφιακού εκπαιδευτ...
Τα οφέλη της χρήσης των εργαλείων σχεδιασμού και ανάπτυξης ψηφιακού εκπαιδευτ...Τα οφέλη της χρήσης των εργαλείων σχεδιασμού και ανάπτυξης ψηφιακού εκπαιδευτ...
Τα οφέλη της χρήσης των εργαλείων σχεδιασμού και ανάπτυξης ψηφιακού εκπαιδευτ...
Antonia - Maria Hartofylaka
 
βασικές θύρες η/υ
βασικές θύρες η/υβασικές θύρες η/υ
βασικές θύρες η/υ
gogosf
 

Similar to Statistics for machine learning shifa noorulain (20)

Basic statisctis -Anandh Shankar
Basic statisctis -Anandh ShankarBasic statisctis -Anandh Shankar
Basic statisctis -Anandh Shankar
Anandh Shankar Sundararajan
 
Introduction to Descriptive Statistics
Introduction to Descriptive StatisticsIntroduction to Descriptive Statistics
Introduction to Descriptive Statistics
Sanju Rusara Seneviratne
 
classX_Data Science_Teacher_Presentation.pptx
classX_Data Science_Teacher_Presentation.pptxclassX_Data Science_Teacher_Presentation.pptx
classX_Data Science_Teacher_Presentation.pptx
vishalkumar238357
 
igdfabjfjdfjfnjnngfsdkjfjjgjgjjgjdjjgdopjgoj
igdfabjfjdfjfnjnngfsdkjfjjgjgjjgjdjjgdopjgojigdfabjfjdfjfnjnngfsdkjfjjgjgjjgjdjjgdopjgoj
igdfabjfjdfjfnjnngfsdkjfjjgjgjjgjdjjgdopjgoj
AkKumar43
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
IndhuGreen
 
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxSTATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
MuhammadNafees42
 
Statr sessions 4 to 6
Statr sessions 4 to 6Statr sessions 4 to 6
Statr sessions 4 to 6
Ruru Chowdhury
 
Basic Statistical Concepts in Machine Learning.pptx
Basic Statistical Concepts in Machine Learning.pptxBasic Statistical Concepts in Machine Learning.pptx
Basic Statistical Concepts in Machine Learning.pptx
bajajrishabh96tech
 
Chapter 12 Data Analysis Descriptive Methods and Index Numbers
Chapter 12 Data Analysis Descriptive Methods and Index NumbersChapter 12 Data Analysis Descriptive Methods and Index Numbers
Chapter 12 Data Analysis Descriptive Methods and Index Numbers
International advisers
 
Descriptive Analysis.pptx
Descriptive Analysis.pptxDescriptive Analysis.pptx
Descriptive Analysis.pptx
Parveen Vashisth
 
measures of central tendency.pptx
measures of central tendency.pptxmeasures of central tendency.pptx
measures of central tendency.pptx
Manish Agarwal
 
SUMMARY MEASURES.pdf
SUMMARY MEASURES.pdfSUMMARY MEASURES.pdf
SUMMARY MEASURES.pdf
GillaMarieLeopardas1
 
R training4
R training4R training4
R training4
Hellen Gakuruh
 
Ch2 Data Description
Ch2 Data DescriptionCh2 Data Description
Ch2 Data Description
Farhan Alfin
 
Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptx
Anusuya123
 
Descriptive Statistics– Summarizing and Visualizing Data.pptx
Descriptive Statistics– Summarizing and Visualizing Data.pptxDescriptive Statistics– Summarizing and Visualizing Data.pptx
Descriptive Statistics– Summarizing and Visualizing Data.pptx
Nsakib4
 
050325Online SPSS.pptx spss social science
050325Online SPSS.pptx spss social science050325Online SPSS.pptx spss social science
050325Online SPSS.pptx spss social science
NurFatin805963
 
Stats !.pdf
Stats !.pdfStats !.pdf
Stats !.pdf
phweb
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
Sarfraz Ahmad
 
Business statistics
Business statisticsBusiness statistics
Business statistics
Ravi Prakash
 
classX_Data Science_Teacher_Presentation.pptx
classX_Data Science_Teacher_Presentation.pptxclassX_Data Science_Teacher_Presentation.pptx
classX_Data Science_Teacher_Presentation.pptx
vishalkumar238357
 
igdfabjfjdfjfnjnngfsdkjfjjgjgjjgjdjjgdopjgoj
igdfabjfjdfjfnjnngfsdkjfjjgjgjjgjdjjgdopjgojigdfabjfjdfjfnjnngfsdkjfjjgjgjjgjdjjgdopjgoj
igdfabjfjdfjfnjnngfsdkjfjjgjgjjgjdjjgdopjgoj
AkKumar43
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
IndhuGreen
 
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptxSTATISTICAL PROCEDURES (Discriptive Statistics).pptx
STATISTICAL PROCEDURES (Discriptive Statistics).pptx
MuhammadNafees42
 
Basic Statistical Concepts in Machine Learning.pptx
Basic Statistical Concepts in Machine Learning.pptxBasic Statistical Concepts in Machine Learning.pptx
Basic Statistical Concepts in Machine Learning.pptx
bajajrishabh96tech
 
Chapter 12 Data Analysis Descriptive Methods and Index Numbers
Chapter 12 Data Analysis Descriptive Methods and Index NumbersChapter 12 Data Analysis Descriptive Methods and Index Numbers
Chapter 12 Data Analysis Descriptive Methods and Index Numbers
International advisers
 
measures of central tendency.pptx
measures of central tendency.pptxmeasures of central tendency.pptx
measures of central tendency.pptx
Manish Agarwal
 
Ch2 Data Description
Ch2 Data DescriptionCh2 Data Description
Ch2 Data Description
Farhan Alfin
 
Basic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptxBasic Statistical Descriptions of Data.pptx
Basic Statistical Descriptions of Data.pptx
Anusuya123
 
Descriptive Statistics– Summarizing and Visualizing Data.pptx
Descriptive Statistics– Summarizing and Visualizing Data.pptxDescriptive Statistics– Summarizing and Visualizing Data.pptx
Descriptive Statistics– Summarizing and Visualizing Data.pptx
Nsakib4
 
050325Online SPSS.pptx spss social science
050325Online SPSS.pptx spss social science050325Online SPSS.pptx spss social science
050325Online SPSS.pptx spss social science
NurFatin805963
 
Stats !.pdf
Stats !.pdfStats !.pdf
Stats !.pdf
phweb
 
Descriptive statistics
Descriptive statisticsDescriptive statistics
Descriptive statistics
Sarfraz Ahmad
 
Business statistics
Business statisticsBusiness statistics
Business statistics
Ravi Prakash
 
Ad

Recently uploaded (20)

E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahahE-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
RyanRahardjo2
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
Process Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBSProcess Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBS
Process mining Evangelist
 
Collibra DQ Installation setup and debug
Collibra DQ Installation setup and debugCollibra DQ Installation setup and debug
Collibra DQ Installation setup and debug
karthikprince20
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
Process Mining at Rabobank - Organizational challenges
Process Mining at Rabobank - Organizational challengesProcess Mining at Rabobank - Organizational challenges
Process Mining at Rabobank - Organizational challenges
Process mining Evangelist
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest InsurerSuncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Process mining Evangelist
 
717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx
dharmendrasingh31102
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682real illuminati Uganda agent 0782561496/0756664682
real illuminati Uganda agent 0782561496/0756664682
way to join real illuminati Agent In Kampala Call/WhatsApp+256782561496/0756664682
 
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhhChapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
ChrisjohnAlfiler
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
Volkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing ProcessVolkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing Process
Process mining Evangelist
 
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahahE-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
E-Book-TOEFL-Masuk-PTN.pdf hahahahaahahahah
RyanRahardjo2
 
Adopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use caseAdopting Process Mining at the Rabobank - use case
Adopting Process Mining at the Rabobank - use case
Process mining Evangelist
 
Process Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBSProcess Mining and Official Statistics - CBS
Process Mining and Official Statistics - CBS
Process mining Evangelist
 
Collibra DQ Installation setup and debug
Collibra DQ Installation setup and debugCollibra DQ Installation setup and debug
Collibra DQ Installation setup and debug
karthikprince20
 
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.pptJust-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
Just-In-Timeasdfffffffghhhhhhhhhhj Systems.ppt
ssuser5f8f49
 
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
录取通知书加拿大TMU毕业证多伦多都会大学电子版毕业证成绩单
Taqyea
 
Deloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining ProjectsDeloitte - A Framework for Process Mining Projects
Deloitte - A Framework for Process Mining Projects
Process mining Evangelist
 
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证定制学历(美国Purdue毕业证)普渡大学电子版毕业证
定制学历(美国Purdue毕业证)普渡大学电子版毕业证
Taqyea
 
Process Mining at Rabobank - Organizational challenges
Process Mining at Rabobank - Organizational challengesProcess Mining at Rabobank - Organizational challenges
Process Mining at Rabobank - Organizational challenges
Process mining Evangelist
 
AWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdfAWS Certified Machine Learning Slides.pdf
AWS Certified Machine Learning Slides.pdf
philsparkshome
 
4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf4. Multivariable statistics_Using Stata_2025.pdf
4. Multivariable statistics_Using Stata_2025.pdf
axonneurologycenter1
 
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
文凭证书美国SDSU文凭圣地亚哥州立大学学生证学历认证查询
Taqyea
 
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest InsurerSuncorp - Integrating Process Mining at Australia's Largest Insurer
Suncorp - Integrating Process Mining at Australia's Largest Insurer
Process mining Evangelist
 
717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx717239550-Hotel-Management-Ppt-Final.pptx
717239550-Hotel-Management-Ppt-Final.pptx
dharmendrasingh31102
 
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
CERTIFIED BUSINESS ANALYSIS PROFESSIONAL™
muhammed84essa
 
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhhChapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
Chapter-3-PROBLEM-SOLVING.pdf hhhhhhhhhh
ChrisjohnAlfiler
 
50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd50_questions_full.pptxdddddddddddddddddd
50_questions_full.pptxdddddddddddddddddd
emir73065
 
hersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distributionhersh's midterm project.pdf music retail and distribution
hersh's midterm project.pdf music retail and distribution
hershtara1
 
Volkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing ProcessVolkswagen - Analyzing the World's Biggest Purchasing Process
Volkswagen - Analyzing the World's Biggest Purchasing Process
Process mining Evangelist
 
Ad

Statistics for machine learning shifa noorulain

  • 1. Statistical Analysis In Machine Learning By Shifa Noorulain
  • 2. Overview to Statistics ▪ What is Statistics ▪ Terminologies in Statistics ▪ Understanding Variables in Statistics ▪ Categories in Statistics ▪ Descriptive Vs Inferential Statistics
  • 3. Introduction What is Statistical Analysis? Why is it required? ▪ It is done to understand/discover underlying patterns and trends ▪ Which involves collecting, exploring and presenting large amounts of data “Statistics is a Mathematical Science pertaining to data collection, analysis, interpretation and presentation.”
  • 4. Terminologies ▪ The population is the set of sources from which data has to be collected. ▪ A Sample is a subset of the Population ▪ A Variable is any characteristics, number, or quantity that can be measured or counted. A variable may also be called a data item. ▪ A statistical Parameter or population parameter is a quantity that indexes a family of probability distributions. For example, the mean, median, etc of a population.
  • 5. Data Source Data comes from many sources: ▪ sensor measurements, events, text, images, and videos. The Internet of Things (IoT) is spewing out streams of information. ▪ Stock Market ▪ Retail ▪ Weather
  • 6. Structured Data Format Variable Numerical/Quantitative Categorical/Qualitative Continuous Discrete Ordinal Nominal Measure Characteristics Ex: Temperature, Square feet, wind speed or time duration Counted Items Ex: Defects per hour Data is arranged in order or ranked and can be compared Ex: Grades, Star Reviews, Position in Race, Date Categorized using names, labels or qualities Ex: Brand Name, ZipCode, Gender Note: Categorical Data can be visualized by Bar Plot, Pie Chart. Numerical Data can be visualized by Histogram, Line Plot, Scatter Plot
  • 7. Types of Statistical Analysis Statistical Analysis Descriptive Statistics Inferential Statistics Descriptive statistics describe a sample. Statistical measures used: Central tendency, Dispersion, Skewness Inferential statistics takes data from a sample and makes inferences about the larger population from which the sample was drawn. And generalizes to population Common methods used: hypothesis tests, confidence intervals, and regression analysis. Population Sample Select a sample from the population Generalize conclusions from samples to Population
  • 8. Descriptive Statistics Topics of Discussion ▪ Measures of Central Tendency ▪ Measures of the Spread ▪ Measures of Asymmetry(Skewness) Descriptive Statistics uses the data to provide descriptions of the population, either through numerical calculations or graphs or tables.
  • 9. Measures of Central Tendency Mean: Measure of average of all the values in a sample is called Mean. ▪ It susceptible to outliers when unusual values are added it gets skewed i.e deviates from the typical central value. ▪ suitable to continuous data with no outliers
  • 10. Median: ▪ Measure of the central value of the sample set is called Median. ▪ It is the middle value for a dataset that has been arranged in order of magnitude. ▪ Median is a better alternative to mean as it is less affected by outliers and skewness of the data. The median value is much closer than the typical central value. ▪ Suitable for continuous data with outliers If the total number of values is odd If the total number of values is even
  • 11. Mode: ▪ The value most recurrent in the sample set is known as Mode. ▪ Suitable for categorical data(both nominal and ordinal) For Example, In a dataset containing {13,35,54,54,55,56,57,67,85,89,96} values. Mean is 60.09. Median is 56. Mode is 54.
  • 12. Measures of the Spread/Dispersion Range: ▪ It is the given measure of how spread apart the values in a data set are.The difference between the largest and the smallest value of a data, is termed as the range of the distribution. ▪ Range does not consider all the values of a series, i.e. it takes only the extreme items and middle items are not considered significant. eg: For {13,33,45,67,70} the range is 57 i.e(70–13). ▪ Note: If one of the components of range, maximum or minimum value becomes an extreme value(Outlier) then range should not be used This indicates how large the spread of the distribution is around the central tendency. It answers unambiguously the question “What is the magnitude of departure from the average value for different groups having identical averages?”
  • 13. Variance: It describes how much a random variable differs from its expected value. It entails computing squares of deviations. ▪ Deviation is the difference between each element from the mean. ▪ Population Variance is the average of squared deviations ▪ Sample Variance is the average of squared differences from the mean Variance is the average of all squared deviations. Note: The units of values and variance is not equal so we use another variability measure.
  • 14. Standard Deviation: ▪ It is the measure of the dispersion of a set of data from its mean. ▪ As Variance suffers from unit difference so standard deviation is used. ▪ The square root of the variance is the standard deviation. It tells about the concentration of the data around the mean of the data set. For eg: {3,5,6,9,10} are the values in a dataset.
  • 15. Coefficient of Variation(CV): ▪ It is also called as the relative standard deviation. It is the ratio of standard deviation to the mean of the dataset. Note: Standard deviation is the variability of a single dataset. Whereas the coefficient of variance can be used for comparing 2 datasets. From this example, we can see that the CV is the same. Both methods are precise. So it is perfect for comparisons.
  • 16. Inter Quartile Range (IQR): It is the measure of variability, based on dividing a data set into quartiles. • The line that divides the box into 2 parts represents the median of the data. • The end of the box shows the upper quartile(75%). Also called as 3rd quartile • The start of the box represents the lower quartile(25%). Also called as 1st quartile. • The region between lower quartile and the upper quartile is called as Inter Quartile Range(IQR) and it is used to approximate the 50% spread in the middle data(75– 25=50%) • The maximum is the highest value in data, similarly minimum is the lowest value in data, it is also called as caps • The points outside the boxes and between the maximum and maximum are called as whiskers, they show the range of values in data. • The extreme points are outliers to the data. A commonly used rule is that a value is an outlier if it’s less than lower quartile-1.5 * IQR or high than the upper quartile + 1.5* IQR. Where IQR = Q3 – Q1
  • 17. A box plot perfectly illustrates what we can do with basic statistical features: ▪ When the box plot is short it implies that much of your data points are similar, since there are many values in a small range ▪ When the box plot is tall it implies that much of your data points are quite different, since the values are spread over a wide range ▪ If the median value is closer to the bottom then we know that most of the data has lower values. If the median value is closer to the top then we know that most of the data has higher values. Basically, if the median line is not in the middle of the box then it is an indication of skewed data. ▪ Are the whiskers very long? That means your data has a high standard deviation and variance i.e the values are spread out and highly varying. If you have long whiskers on one side of the box but not the other, then your data may be highly varying only in one direction. ▪ All of that information from a few simple statistical features that are easy to calculate! Try these out whenever you need a quick yet informative view of your data.
  • 18. Measures of Asymmetry Skewness: Skewness is the asymmetry in a statistical distribution, in which the curve appears distorted or skewed towards to the left or to the right. Skewness indicates whether the data is concentrated on one side. Positive Skewness: Positive Skewness is when the mean>median>mode. The outliers are skewed to the right i.e the tail is skewed to the right. Negative Skewness: Negative Skewness is when the mean<median<mode. The outliers are skewed to the left i.e the tail is skewed to the left. Skewness is important as it tells us about where the data is distributed.
  • 19. The Empirical Rule The Empirical Rule states that almost all data lies within 3 standard deviations of the mean for a normal distribution. Under this rule, ▪ About 68% of the data is within 1 standard deviation of the mean. ▪ About 95% of the data is within 2 standard deviations of the mean. ▪ About 99.7% of the data is within 3 standard deviations of the mean.