Don't get confused with Summary Statistics. Learn in-depth types of summary statistics from measures of central tendency, measures of dispersion and much more.
Let me know if anything is required. ping me at google #bobrupakroy
1. Probability is the study of randomness and uncertainty of outcomes from experiments or processes. It allows us to make statements about the likelihood of events occurring.
2. Events are outcomes or sets of outcomes from random experiments. The probability of an event is calculated based on the number of outcomes in the event compared to the total number of possible outcomes.
3. Conditional probability is the likelihood of one event occurring given that another event has occurred. It is calculated as the probability of both events occurring divided by the probability of the first event. Conditional probabilities are useful for problems involving dependent events.
This document discusses various statistical tests used to analyze categorical data, including contingency tables and chi-square tests. It begins by defining continuous and categorical variables. It then discusses how to represent associations between categorical variables using contingency tables. It explains how to calculate expected frequencies and chi-square values to test for relationships between categorical variables. Finally, it discusses other tests that can be used for contingency tables like Fisher's exact test, McNemar's test, and Yates correction.
ROC curves are used to evaluate machine learning algorithms and visualize the tradeoff between true positives and false positives. An ROC curve plots the true positive rate against the false positive rate for different discrimination thresholds. The area under the ROC curve (AUC) provides a single measure of performance, with higher values indicating better classification. While ROC curves are commonly used, precision-recall curves may provide a better evaluation for some applications by focusing on precision and recall rather than false positives.
This document discusses confidence intervals, which provide a range of values that is likely to include an unknown population parameter based on a sample statistic. It defines key concepts like confidence level, confidence limits, and factors that determine how to set the confidence interval like sample size, population variability, and precision of values. It explains how larger sample sizes and more precise measurements result in narrower confidence intervals. Applications to clinical trials are discussed, showing how sample size impacts the ability to make definitive recommendations based on trial results.
SAMPLE SIZE CALCULATION IN DIFFERENT STUDY DESIGNS AT.pptxssuserd509321
The document discusses factors that affect sample size calculation in different study designs. It provides examples of calculating sample sizes for descriptive cross-sectional studies, case-control studies, cohort studies, comparative studies, and randomized controlled trials. The key factors discussed are the level of confidence, power, expected proportions or means in groups, margin of error, and standard deviation. Sample size is affected by the type of study design, variables being qualitative or quantitative, and the goal of establishing equivalence, superiority or non-inferiority between groups. Electronic resources are provided for calculating sample sizes.
The document discusses the standard normal distribution. It defines the standard normal distribution as having a mean of 0, a standard deviation of 1, and a bell-shaped curve. It provides examples of how to find probabilities and z-scores using the standard normal distribution table or calculator. For example, it shows how to find the probability of an event being below or above a given z-score, or between two z-scores. It also shows how to find the z-score corresponding to a given cumulative probability.
Biostatistics is the science of collecting, summarizing, analyzing, and interpreting data in the fields of medicine, biology, and public health. It involves both descriptive and inferential statistics. Descriptive statistics summarize data through measures of central tendency like mean, median, and mode, and measures of dispersion like range and standard deviation. Inferential statistics allow generalization from samples to populations through techniques like hypothesis testing, confidence intervals, and estimation. Sample size determination and random sampling help ensure validity and minimize errors in statistical analyses.
HERE I'VE GIVE THE DESCRIPTION ABOUT OUTLIERS AND IT'S TYPES WITH SOME EXAMPLE.
My Facebook handle _ https://www.facebook.com/profile.php?id=100066529451772
Incidence (Epidemiology lecture)
short ppt to understand incidence. primary incidence rate, secondary incidence rate, incidence rate, examples of incidence, incidence rate related question are discussed in this lec.
Statistics is the methodology used to interpret and draw conclusions from collected data. It provides methods for designing research studies, summarizing and exploring data, and making predictions about phenomena represented by the data. A population is the set of all individuals of interest, while a sample is a subset of individuals from the population used for measurements. Parameters describe characteristics of the entire population, while statistics describe characteristics of a sample and can be used to infer parameters. Basic descriptive statistics used to summarize samples include the mean, standard deviation, and variance, which measure central tendency, spread, and how far data points are from the mean, respectively. The goal of statistical data analysis is to gain understanding from data through defined steps.
The document discusses the standard normal distribution and provides examples of how to calculate probabilities for a normal distribution. It defines the standard normal distribution as having a mean of 0 and standard deviation of 1. It then shows how to standardize a normal variable by subtracting the mean and dividing by the standard deviation. Examples calculate probabilities such as the area under or above a value and between two values by using the standard normal distribution table.
ODDS RATIO AND RELATIVE RISK EVALUATIONKanhu Charan
Relative risk and odds ratio are measures used to quantify the strength of association between an exposure and an outcome. Relative risk is calculated as the incidence of an outcome in an exposed group divided by the incidence in an unexposed group. It is the preferred measure for cohort studies where the number of people at risk is known. Odds ratio is calculated as the odds of exposure in those with the outcome divided by the odds of exposure in those without the outcome. It is used for case-control studies where the total number exposed is not known. Both measures can help determine if a risk factor increases, decreases, or has no effect on the risk of an outcome. The key difference is that relative risk utilizes probabilities while odds ratio uses odds.
This document discusses key concepts in statistics for engineers and scientists such as point estimates, properties of good estimators, confidence intervals, and the t-distribution. A point estimate is a single numerical value used to estimate a population parameter from a sample. A good estimator must be unbiased, consistent, and relatively efficient. A confidence interval provides a range of values that is likely to contain the true population parameter based on the sample data and confidence level. The t-distribution is similar to the normal distribution but has greater variance and depends on degrees of freedom. Examples are provided to demonstrate how to calculate confidence intervals for means using the normal and t-distributions.
The document provides information about biostatistics and statistical methodology. It begins with definitions of statistics and biostatistics. It then discusses topics like sampling, types of sampling techniques, measures of central tendency, measures of dispersion, and tests of significance. Specifically, it covers [1] the differences between probability and non-probability sampling, [2] common measures of central tendency like mean, median and mode, [3] measures of dispersion like range, mean deviation and standard deviation, and [4] tests of significance like the standard error test and chi-square test.
The document discusses time-to-event variables and survival analysis. It notes that time-to-event variables measure the time until an event of interest occurs, such as death, disease onset, or recovery. Standard statistical analyses cannot be used as the distributions are often skewed. Censoring is common, where the event is not observed for all participants by the end of the study. Survival analysis techniques like the Kaplan-Meier method are used to estimate survival probabilities over time while accounting for censoring. The log-rank test can determine if survival curves are statistically different between groups.
Standardization of rates by Dr. Basil TumainiBasil Tumaini
Standardization of rates by Dr. Basil Tumaini, presented during the residency at Muhimbili University of Health and Allied Sciences, Epidemiology class
This document discusses key concepts related to determining sample size for surveys:
- Confidence interval and confidence level describe the level of certainty or precision in a sample - a 95% confidence level means the true population value would fall within the confidence interval 95% of the time.
- Sample size, population size, and response distribution (how answers are split) all impact the required sample size to achieve a given confidence level and interval. Higher confidence or lower intervals require larger samples.
- For a population of 20,000, with a 50-50 response split, and 95% confidence level, the required sample size is 377 people.
This document provides an introduction to Poisson regression models for count data. It outlines that Poisson regression can be used to model count variables that have a Poisson distribution. A simple equiprobable model is presented where the expected count is equal across all categories. This equiprobable model establishes a null hypothesis that can be tested using likelihood ratio or Pearson's test statistics. Residual analysis is also discussed. Finally, the document introduces how a covariate can be added to a Poisson regression model to establish relationships between the count variable and explanatory variables.
The document discusses quartiles, which divide a data set into four equal parts. The first quartile contains the smallest 25% of values, the second quartile contains values between the 25th and 50th percentiles, the third quartile contains values between the 50th and 75th percentiles, and the fourth quartile contains the largest 25% of values. Formulas are provided for calculating the lower quartile (Q1), median (Q2), and upper quartile (Q3). The quartile deviation is defined as half the distance between Q3 and Q1, while the interquartile range is the full distance between Q3 and Q1. Examples are given to illustrate quartile calculations.
This document provides an overview of key concepts in biostatistics. It defines biostatistics as the application of statistical methods in the fields of biology, public health, and medicine. Some key points covered include:
- The types of data: qualitative, quantitative, discrete, continuous
- Descriptive statistics like mean, median, and mode
- Inferential statistics like hypothesis testing and estimating parameters
- Important statistical tests like t-tests, ANOVA, and chi-squared tests
- Measures of diagnostic accuracy like sensitivity, specificity, and predictive values
- The process of determining sample size for studies based on factors like confidence interval, power, and allowable error.
This document discusses sampling and sampling distributions. It begins by explaining why sampling is preferable to a census in terms of time, cost and practicality. It then defines the sampling frame as the listing of items that make up the population. Different types of samples are described, including probability and non-probability samples. Probability samples include simple random, systematic, stratified, and cluster samples. Key aspects of each type are defined. The document also discusses sampling distributions and how the distribution of sample statistics such as means and proportions can be approximated as normal even if the population is not normal, due to the central limit theorem. It provides examples of how to calculate probabilities and intervals for sampling distributions.
This document provides an introduction to biostatistics. It defines key concepts such as statistics, data, variables, populations, and samples. It discusses different types of variables including quantitative and qualitative variables. It also describes different measurement scales including nominal, ordinal, interval and ratio scales. Sources of data and descriptive statistics are introduced. Descriptive statistics help summarize and organize data using tables, graphs, and numerical measures.
This document provides an overview of biostatistics. It defines biostatistics and discusses topics like data collection, presentation through tables and charts, measures of central tendency and dispersion, sampling, tests of significance, and applications of biostatistics in various medical fields. The document aims to introduce students to important biostatistical concepts and their use in research, clinical trials, epidemiology and other areas of medicine.
Survival analysis is used to analyze time-to-event data from clinical studies where some subjects may be censored. Survival time refers to the time from treatment start to an endpoint. Censoring occurs when the exact event time is unknown. The Kaplan-Meier method estimates the survival function by assigning weights to observations and redistributing weights from censored observations. The log-rank test compares survival curves between groups. Cox proportional hazards models analyze the relationship between survival time and variables while adjusting for other factors.
Descriptive and Analytical Epidemiology coolboy101pk
This document provides an overview of a training session on descriptive and analytic epidemiology. Descriptive epidemiology involves describing disease frequency, distribution, and determinants in populations using measures like prevalence and incidence. Analytic epidemiology aims to understand why diseases occur using study designs like cohort studies and case-control studies to test hypotheses. Key terms discussed include measures of association like relative risk and odds ratio, and statistical tests like confidence intervals and p-values.
This document provides an introduction to biostatistics. It defines biostatistics as applying statistics to biology, medicine, and public health. Some key points covered include:
- Francis Galton is considered the father of biostatistics.
- There are two main types of data: primary data collected directly and secondary data collected previously.
- Variables can be qualitative (categorical) or quantitative (numeric).
- Biostatistics is applied in areas like medicine, public health, and research to analyze data and draw conclusions.
- Common sources of health data include censuses, vital records, surveys, and hospital/disease records.
The document introduces key concepts in biostatistics including variables, data, scales of measurement, and distinguishing between population and sample. It defines variables as any characteristic that can vary, and describes qualitative versus quantitative variables and discrete versus continuous variables. Finally, it outlines the four scales of measurement - nominal, ordinal, interval, and ratio - and provides examples of variables that fall under each scale.
Detailed discussion about the types of statistics form Measures of Central Tendency, Measures of Dispersion, Skewness, Kurtosis, Probability Distributions and much more with their uses cases
This document discusses descriptive statistics and exploratory data analysis. It defines descriptive statistics as procedures for summarizing quantitative data in a clear way, while exploratory data analysis involves examining data to understand its characteristics. The document outlines common descriptive statistics like the mean, median, mode, standard deviation, and frequency distributions. It also discusses examining distributions, central tendency, dispersion, and using SPSS to calculate descriptive statistics.
Incidence (Epidemiology lecture)
short ppt to understand incidence. primary incidence rate, secondary incidence rate, incidence rate, examples of incidence, incidence rate related question are discussed in this lec.
Statistics is the methodology used to interpret and draw conclusions from collected data. It provides methods for designing research studies, summarizing and exploring data, and making predictions about phenomena represented by the data. A population is the set of all individuals of interest, while a sample is a subset of individuals from the population used for measurements. Parameters describe characteristics of the entire population, while statistics describe characteristics of a sample and can be used to infer parameters. Basic descriptive statistics used to summarize samples include the mean, standard deviation, and variance, which measure central tendency, spread, and how far data points are from the mean, respectively. The goal of statistical data analysis is to gain understanding from data through defined steps.
The document discusses the standard normal distribution and provides examples of how to calculate probabilities for a normal distribution. It defines the standard normal distribution as having a mean of 0 and standard deviation of 1. It then shows how to standardize a normal variable by subtracting the mean and dividing by the standard deviation. Examples calculate probabilities such as the area under or above a value and between two values by using the standard normal distribution table.
ODDS RATIO AND RELATIVE RISK EVALUATIONKanhu Charan
Relative risk and odds ratio are measures used to quantify the strength of association between an exposure and an outcome. Relative risk is calculated as the incidence of an outcome in an exposed group divided by the incidence in an unexposed group. It is the preferred measure for cohort studies where the number of people at risk is known. Odds ratio is calculated as the odds of exposure in those with the outcome divided by the odds of exposure in those without the outcome. It is used for case-control studies where the total number exposed is not known. Both measures can help determine if a risk factor increases, decreases, or has no effect on the risk of an outcome. The key difference is that relative risk utilizes probabilities while odds ratio uses odds.
This document discusses key concepts in statistics for engineers and scientists such as point estimates, properties of good estimators, confidence intervals, and the t-distribution. A point estimate is a single numerical value used to estimate a population parameter from a sample. A good estimator must be unbiased, consistent, and relatively efficient. A confidence interval provides a range of values that is likely to contain the true population parameter based on the sample data and confidence level. The t-distribution is similar to the normal distribution but has greater variance and depends on degrees of freedom. Examples are provided to demonstrate how to calculate confidence intervals for means using the normal and t-distributions.
The document provides information about biostatistics and statistical methodology. It begins with definitions of statistics and biostatistics. It then discusses topics like sampling, types of sampling techniques, measures of central tendency, measures of dispersion, and tests of significance. Specifically, it covers [1] the differences between probability and non-probability sampling, [2] common measures of central tendency like mean, median and mode, [3] measures of dispersion like range, mean deviation and standard deviation, and [4] tests of significance like the standard error test and chi-square test.
The document discusses time-to-event variables and survival analysis. It notes that time-to-event variables measure the time until an event of interest occurs, such as death, disease onset, or recovery. Standard statistical analyses cannot be used as the distributions are often skewed. Censoring is common, where the event is not observed for all participants by the end of the study. Survival analysis techniques like the Kaplan-Meier method are used to estimate survival probabilities over time while accounting for censoring. The log-rank test can determine if survival curves are statistically different between groups.
Standardization of rates by Dr. Basil TumainiBasil Tumaini
Standardization of rates by Dr. Basil Tumaini, presented during the residency at Muhimbili University of Health and Allied Sciences, Epidemiology class
This document discusses key concepts related to determining sample size for surveys:
- Confidence interval and confidence level describe the level of certainty or precision in a sample - a 95% confidence level means the true population value would fall within the confidence interval 95% of the time.
- Sample size, population size, and response distribution (how answers are split) all impact the required sample size to achieve a given confidence level and interval. Higher confidence or lower intervals require larger samples.
- For a population of 20,000, with a 50-50 response split, and 95% confidence level, the required sample size is 377 people.
This document provides an introduction to Poisson regression models for count data. It outlines that Poisson regression can be used to model count variables that have a Poisson distribution. A simple equiprobable model is presented where the expected count is equal across all categories. This equiprobable model establishes a null hypothesis that can be tested using likelihood ratio or Pearson's test statistics. Residual analysis is also discussed. Finally, the document introduces how a covariate can be added to a Poisson regression model to establish relationships between the count variable and explanatory variables.
The document discusses quartiles, which divide a data set into four equal parts. The first quartile contains the smallest 25% of values, the second quartile contains values between the 25th and 50th percentiles, the third quartile contains values between the 50th and 75th percentiles, and the fourth quartile contains the largest 25% of values. Formulas are provided for calculating the lower quartile (Q1), median (Q2), and upper quartile (Q3). The quartile deviation is defined as half the distance between Q3 and Q1, while the interquartile range is the full distance between Q3 and Q1. Examples are given to illustrate quartile calculations.
This document provides an overview of key concepts in biostatistics. It defines biostatistics as the application of statistical methods in the fields of biology, public health, and medicine. Some key points covered include:
- The types of data: qualitative, quantitative, discrete, continuous
- Descriptive statistics like mean, median, and mode
- Inferential statistics like hypothesis testing and estimating parameters
- Important statistical tests like t-tests, ANOVA, and chi-squared tests
- Measures of diagnostic accuracy like sensitivity, specificity, and predictive values
- The process of determining sample size for studies based on factors like confidence interval, power, and allowable error.
This document discusses sampling and sampling distributions. It begins by explaining why sampling is preferable to a census in terms of time, cost and practicality. It then defines the sampling frame as the listing of items that make up the population. Different types of samples are described, including probability and non-probability samples. Probability samples include simple random, systematic, stratified, and cluster samples. Key aspects of each type are defined. The document also discusses sampling distributions and how the distribution of sample statistics such as means and proportions can be approximated as normal even if the population is not normal, due to the central limit theorem. It provides examples of how to calculate probabilities and intervals for sampling distributions.
This document provides an introduction to biostatistics. It defines key concepts such as statistics, data, variables, populations, and samples. It discusses different types of variables including quantitative and qualitative variables. It also describes different measurement scales including nominal, ordinal, interval and ratio scales. Sources of data and descriptive statistics are introduced. Descriptive statistics help summarize and organize data using tables, graphs, and numerical measures.
This document provides an overview of biostatistics. It defines biostatistics and discusses topics like data collection, presentation through tables and charts, measures of central tendency and dispersion, sampling, tests of significance, and applications of biostatistics in various medical fields. The document aims to introduce students to important biostatistical concepts and their use in research, clinical trials, epidemiology and other areas of medicine.
Survival analysis is used to analyze time-to-event data from clinical studies where some subjects may be censored. Survival time refers to the time from treatment start to an endpoint. Censoring occurs when the exact event time is unknown. The Kaplan-Meier method estimates the survival function by assigning weights to observations and redistributing weights from censored observations. The log-rank test compares survival curves between groups. Cox proportional hazards models analyze the relationship between survival time and variables while adjusting for other factors.
Descriptive and Analytical Epidemiology coolboy101pk
This document provides an overview of a training session on descriptive and analytic epidemiology. Descriptive epidemiology involves describing disease frequency, distribution, and determinants in populations using measures like prevalence and incidence. Analytic epidemiology aims to understand why diseases occur using study designs like cohort studies and case-control studies to test hypotheses. Key terms discussed include measures of association like relative risk and odds ratio, and statistical tests like confidence intervals and p-values.
This document provides an introduction to biostatistics. It defines biostatistics as applying statistics to biology, medicine, and public health. Some key points covered include:
- Francis Galton is considered the father of biostatistics.
- There are two main types of data: primary data collected directly and secondary data collected previously.
- Variables can be qualitative (categorical) or quantitative (numeric).
- Biostatistics is applied in areas like medicine, public health, and research to analyze data and draw conclusions.
- Common sources of health data include censuses, vital records, surveys, and hospital/disease records.
The document introduces key concepts in biostatistics including variables, data, scales of measurement, and distinguishing between population and sample. It defines variables as any characteristic that can vary, and describes qualitative versus quantitative variables and discrete versus continuous variables. Finally, it outlines the four scales of measurement - nominal, ordinal, interval, and ratio - and provides examples of variables that fall under each scale.
Detailed discussion about the types of statistics form Measures of Central Tendency, Measures of Dispersion, Skewness, Kurtosis, Probability Distributions and much more with their uses cases
This document discusses descriptive statistics and exploratory data analysis. It defines descriptive statistics as procedures for summarizing quantitative data in a clear way, while exploratory data analysis involves examining data to understand its characteristics. The document outlines common descriptive statistics like the mean, median, mode, standard deviation, and frequency distributions. It also discusses examining distributions, central tendency, dispersion, and using SPSS to calculate descriptive statistics.
This document provides an overview of basic statistical concepts for bio science students. It defines measures of central tendency including mean, median, and mode. It also discusses measures of dispersion like range and standard deviation. Common probability distributions such as binomial, Poisson, and normal distributions are explained. Hypothesis testing concepts like p-values and types of statistical tests for different types of data like t-tests for continuous variables and chi-square tests for categorical data are summarized along with examples.
This document defines and explains various measures of central tendency, dispersion, and distribution used in descriptive statistics. It discusses modes, medians, means, percentiles, quartiles, range, interquartile range, standard deviation, z-scores, and other key statistical concepts. These metrics are used to summarize and describe the central position and variability of data in distributions.
This document provides an overview of key statistical concepts used in data analysis. It defines common statistical terminology like population, parameter, sample, statistic, variables, levels of measurement, measures of center (mean, median, mode), measures of dispersion (range, standard deviation, variance), measures of relative position (z-scores, percentiles, quartiles), the normal distribution and empirical rule, and hypothesis testing. Examples are provided to illustrate how to apply these concepts when analyzing data and performing statistical tests concerning the mean.
This document discusses measures of central tendency and dispersion. It defines mean, median and mode as measures of central tendency, which describe the central location of data. The mean is the average value, median is the middle value, and mode is the most frequent value. It also defines measures of dispersion like range, interquartile range, variance and standard deviation, which describe how spread out the data are. Standard deviation in particular measures how far data values are from the mean. Approximately 68%, 95% and 99.7% of observations in a normal distribution fall within 1, 2 and 3 standard deviations of the mean respectively.
Variables describe attributes that can vary between entities. They can be qualitative (categorical) or quantitative (numeric). Common types of variables include continuous, discrete, ordinal, and nominal variables. Data can be presented graphically through bar charts, pie charts, histograms, box plots, and scatter plots to better understand patterns and trends. Key measures used to summarize data include measures of central tendency (mean, median, mode) and measures of variability (range, standard deviation, interquartile range).
The document discusses basic statistical descriptions of data including measures of central tendency (mean, median, mode), dispersion (range, variance, standard deviation), and position (quartiles, percentiles). It explains how to calculate and interpret these measures. It also covers estimating these values from grouped frequency data and identifying outliers. The key goals are to better understand relationships within a data set and analyze data at multiple levels of precision.
This document discusses measures of central tendency and dispersion used to analyze and summarize data. It defines key terms like mean, median, mode, range, variance, and standard deviation. It explains how to calculate these measures both mathematically and using grouped or sample data, and the importance of understanding the central tendency and dispersion of data distributions.
This document discusses measures of central tendency and dispersion used to analyze and summarize data. It defines key terms like mean, median, mode, range, variance, and standard deviation. It explains how to calculate these measures both mathematically and using grouped or sample data, and the importance of understanding the distribution, central tendency and dispersion of data.
The document discusses measures of dispersion, which describe how varied or spread out a data set is around the average value. It defines several measures of dispersion, including range, interquartile range, mean deviation, and standard deviation. The standard deviation is described as the most important measure, as it takes into account all values in the data set and is not overly influenced by outliers. The document provides a detailed example of calculating the standard deviation, which involves finding the differences from the mean, squaring those values, summing them, and taking the square root.
This document provides information about statistical methods for summarizing data, including measures of central tendency, variability, and position. It discusses the mean, median, mode, range, variance, standard deviation, z-scores, and percentiles. The mean is the average value and considers all data points. The median divides the data in half. The mode is the most frequent value. Variance and standard deviation measure how spread out values are around the mean. Percentiles and z-scores indicate a value's position relative to others in the data set.
Lect 3 background mathematics for Data Mininghktripathy
The document discusses various statistical measures used to describe data, including measures of central tendency and dispersion.
It introduces the mean, median, and mode as common measures of central tendency. The mean is the average value, the median is the middle value, and the mode is the most frequent value. It also discusses weighted means.
It then discusses various measures of data dispersion, including range, variance, standard deviation, quartiles, and interquartile range. The standard deviation specifically measures how far data values typically are from the mean and is important for describing the width of a distribution.
This document discusses different types of data analysis and measures of central tendency. It begins by outlining descriptive and inferential statistics, which provide an overview and measures of how well data supports hypotheses. It then discusses measures of central tendency including the mean, median, and mode. The document provides definitions, formulas, properties and comparisons of each measure. It explains that the mean, median and mode can be related for symmetrical distributions but may differ for asymmetrical distributions. Additional topics covered include weighted mean, geometric mean, harmonic mean, and partitions such as quartiles and deciles.
Basic Statistical Descriptions of Data.pptxAnusuya123
This document provides an overview of 7 basic statistical concepts for data science: 1) descriptive statistics such as mean, mode, median, and standard deviation, 2) measures of variability like variance and range, 3) correlation, 4) probability distributions, 5) regression, 6) normal distribution, and 7) types of bias. Descriptive statistics are used to summarize data, variability measures dispersion, correlation measures relationships between variables, and probability distributions specify likelihoods of events. Regression models relationships, normal distribution is often assumed, and biases can influence analyses.
This document discusses various measures of central tendency and dispersion that are commonly used in epidemiology to summarize data distributions. It describes the mean, median and mode as measures of central tendency that convey the average or typical value, and how the appropriate measure depends on the data's measurement level, shape and research purpose. Measures of dispersion like range, interquartile range, variance and standard deviation describe how spread out the data is from the central value. The document provides formulas and explanations for calculating and interpreting each measure.
Descriptive statistics are used to describe the basic features of the data in a study. They provide simple summaries about the sample and the measures. Together with simple graphics analysis, they form the basis of virtually every quantitative analysis of data.
Hierarchical Clustering - Text Mining/NLPRupak Roy
Documented Hierarchical clustering using Hclust for text mining, natural language processing.
Thanks, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Clustering K means and Hierarchical - NLPRupak Roy
Classify to cluster the natural language processing via K means, Hierarchical and more.
Thanks, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Network Analysis using 3D interactive plots along with their steps for implementation.
Thanks, for your time, if you enjoyed this short article there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Explore detailed Topic Modeling via LDA Laten Dirichlet Allocation and their steps.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Widely accepted steps for sentiment analysis.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Process the sentiments of NLP with Naive Bayes Rule, Random Forest, Support Vector Machine, and much more.
Thanks, for your time, if you enjoyed this short slide there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Detailed Pattern Search using regular expressions using grepl, grep, grepexpr and Replace with sub, gsub and much more.
Thanks, for your time, if you enjoyed this short slide there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Detailed documented with the definition of text mining along with challenges, implementing modeling techniques, word cloud and much more.
Thanks, for your time, if you enjoyed this short video there are tons of topics in advanced analytics, data science, and machine learning available in my medium repo. https://medium.com/@bobrupakroy
Bundled with the documentation to the introduction of Apache Hbase to the configuration.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Understand and implement the terminology of why partitioning the table is important and the Hive Query Language (HQL)
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Installing Apache Hive, internal and external table, import-export Rupak Roy
Perform Hive installation with internal and external table import-export and much more
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Well illustrated with definitions of Apache Hive with its architecture workflows plus with the types of data available for Apache Hive
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Automate the complete big data process from import to export data from HDFS to RDBMS like sql with apache sqoop
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Apache Scoop - Import with Append mode and Last Modified mode Rupak Roy
Familiar with scoop advanced functions like import with append and last modified mode.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Get acquainted with the differences in scoop, the added advantages with hands-on implementation
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Get acquainted with a distributed, reliable tool/service for collecting a large amount of streaming data to centralized storage with their architecture.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
take care!
Enhance analysis with detailed examples of Relational Operators - II includes Foreash, Filter, Join, Co-Group, Union and much more.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Talk soon!
Passing Parameters using File and Command LineRupak Roy
Explore well versed other functions, flatten operator and other available options to pass parameters
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Talk soon!
Get to know the implementation of apache Pig relational operators like order, limit, distinct, groupby.
Let me know if anything is required. Happy to help.
Ping me google #bobrupakroy.
Talk soon!
This chapter provides an in-depth overview of the viscosity of macromolecules, an essential concept in biophysics and medical sciences, especially in understanding fluid behavior like blood flow in the human body.
Key concepts covered include:
✅ Definition and Types of Viscosity: Dynamic vs. Kinematic viscosity, cohesion, and adhesion.
⚙️ Methods of Measuring Viscosity:
Rotary Viscometer
Vibrational Viscometer
Falling Object Method
Capillary Viscometer
🌡️ Factors Affecting Viscosity: Temperature, composition, flow rate.
🩺 Clinical Relevance: Impact of blood viscosity in cardiovascular health.
🌊 Fluid Dynamics: Laminar vs. turbulent flow, Reynolds number.
🔬 Extension Techniques:
Chromatography (adsorption, partition, TLC, etc.)
Electrophoresis (protein/DNA separation)
Sedimentation and Centrifugation methods.
How to Create A Todo List In Todo of Odoo 18Celine George
In this slide, we’ll discuss on how to create a Todo List In Todo of Odoo 18. Odoo 18’s Todo module provides a simple yet powerful way to create and manage your to-do lists, ensuring that no task is overlooked.
Lecture 1 Introduction history and institutes of entomology_1.pptxArshad Shaikh
*Entomology* is the scientific study of insects, including their behavior, ecology, evolution, classification, and management.
Entomology continues to evolve, incorporating new technologies and approaches to understand and manage insect populations.
What makes space feel generous, and how architecture address this generosity in terms of atmosphere, metrics, and the implications of its scale? This edition of #Untagged explores these and other questions in its presentation of the 2024 edition of the Master in Collective Housing. The Master of Architecture in Collective Housing, MCH, is a postgraduate full-time international professional program of advanced architecture design in collective housing presented by Universidad Politécnica of Madrid (UPM) and Swiss Federal Institute of Technology (ETH).
Yearbook MCH 2024. Master in Advanced Studies in Collective Housing UPM - ETH
Title: A Quick and Illustrated Guide to APA Style Referencing (7th Edition)
This visual and beginner-friendly guide simplifies the APA referencing style (7th edition) for academic writing. Designed especially for commerce students and research beginners, it includes:
✅ Real examples from original research papers
✅ Color-coded diagrams for clarity
✅ Key rules for in-text citation and reference list formatting
✅ Free citation tools like Mendeley & Zotero explained
Whether you're writing a college assignment, dissertation, or academic article, this guide will help you cite your sources correctly, confidently, and consistent.
Created by: Prof. Ishika Ghosh,
Faculty.
📩 For queries or feedback: ishikaghosh9@gmail.com
This slide is an exercise for the inquisitive students preparing for the competitive examinations of the undergraduate and postgraduate students. An attempt is being made to present the slide keeping in mind the New Education Policy (NEP). An attempt has been made to give the references of the facts at the end of the slide. If new facts are discovered in the near future, this slide will be revised.
This presentation is related to the brief History of Kashmir (Part-I) with special reference to Karkota Dynasty. In the seventh century a person named Durlabhvardhan founded the Karkot dynasty in Kashmir. He was a functionary of Baladitya, the last king of the Gonanda dynasty. This dynasty ruled Kashmir before the Karkot dynasty. He was a powerful king. Huansang tells us that in his time Taxila, Singhpur, Ursha, Punch and Rajputana were parts of the Kashmir state.
All About the 990 Unlocking Its Mysteries and Its Power.pdfTechSoup
In this webinar, nonprofit CPA Gregg S. Bossen shares some of the mysteries of the 990, IRS requirements — which form to file (990N, 990EZ, 990PF, or 990), and what it says about your organization, and how to leverage it to make your organization shine.
How to Clean Your Contacts Using the Deduplication Menu in Odoo 18Celine George
In this slide, we’ll discuss on how to clean your contacts using the Deduplication Menu in Odoo 18. Maintaining a clean and organized contact database is essential for effective business operations.
Ajanta Paintings: Study as a Source of HistoryVirag Sontakke
This Presentation is prepared for Graduate Students. A presentation that provides basic information about the topic. Students should seek further information from the recommended books and articles. This presentation is only for students and purely for academic purposes. I took/copied the pictures/maps included in the presentation are from the internet. The presenter is thankful to them and herewith courtesy is given to all. This presentation is only for academic purposes.
Link your Lead Opportunities into Spreadsheet using odoo CRMCeline George
In Odoo 17 CRM, linking leads and opportunities to a spreadsheet can be done by exporting data or using Odoo’s built-in spreadsheet integration. To export, navigate to the CRM app, filter and select the relevant records, and then export the data in formats like CSV or XLSX, which can be opened in external spreadsheet tools such as Excel or Google Sheets.
2. Definition
Summary statistics are used to summarize a
set of observations in order to
communicate as much as information
about the data as possible. It is part of
descriptive statistics and are used to
basically summarize or describe a set of
observations.
Rupak Roy
3. Example
The weight of the population are
45 kg
57kg
72 kg
52 kg
Now what we want here is the summary of
weight of the population , we can say it is the
average weight of the population is 56.5 kg and
now we can describe the population in the
simplest way as possible.
Rupak Roy
4. Types
Summary statistics
Measures of Central
Tendency
1 . Mean
2 . Median
3 . Mode
5 . Geometric Mean
Measures of
Dispersion
1. Standard
Deviation
2. Variance
3. Interquartile
Range
Others
1. Co efficient
2. Skewness
3. Kurtosis
4. Probability
Distributions.
5. Distribution plot
Rupak Roy
5. Definition
Measures of central tendency : is the value that describes
which group of data clusters around a central value. In
simple words , it is a way to describe the center of a data
set. Again what is center of data ? A single number that
summarizes the entire dataset using techniques such as
mean/average or median of the dataset.
Measures of Dispersion: “dispersion (also
called variability, scatter, or spread) is the extent to which
a distribution of data is stretched or squeezed.”
Here in the graph we can see the
distribution of data (assume population)
is more stretched at the right side
ranging from 50 to 80
6. Measures of Central Tendency
1. Mean : is the average of observations. Most effective
when data is not heavily skewed.
2. Median: represents the middle value of the dataset.
Useful for skewed data.
We will talk about skewed data in the upcoming
slides.
3. Mode: means max no of times the data has occurred.
4. Geometric mean: nth root of a product of n numbers.
It is used when we want to get the average rate of the
event and the event rate is determined by multiplication.
For example growth of a bank account per year in a
ABC bank is calculated by geometric mean since the
growth event rate is determined by multiplying the
amount of a bank account by the percentage of growth.
then we use geometric mean.
Rupak Roy
7. Formula for calculating Geometric Mean
GM =
example: Geometric Mean of 23,56,66 ?
3 23 * 56*66
3 85008 = 43.9696761which means 3times of 43.9696761
is 85008
Note:
if one of the observation in the event is zero , Geometric
Mean becomes Zero and also it doesn’t works with
negative numbers like -1 , -4 , -5 and so on.
Rupak Roy
8. Calculation of Mode ; <- Delta
For ungrouped data = Max no of items
Example : 23,45,76,33,54,33,76,33 Therefore Mode = 33
For grouped data = = {(L + Delta 1) / Delta 1+Detal2 } * i
Where Delta 1 = f1 +f0
and Delta 2 = f1- f2
Nowadays, we don’t have to worry about the calculation, as in
any statistical software's like R, excel it will automatically calculate
the intense calculation for large amount of data but
for more in-depth information you can visit this website.
https://www.mathsisfun.com/data/frequency-grouped-mean-median-mode.html
9. Measures of dispersion
Standard Deviation is basically a measure of how near or far the
observations are from the mean.
Variance: the fact or quality of being different , divergent or
inconsistent. A value of zero means that there is no variability , all the
values in the data set are the same.
Interquartile Range: is a measure of variability ,
by dividing a data set into parts that is quartiles .
Say
Q1 is the middle value in the first half of the data set.
Q2 is the median value .
Q3 is the middle value in the second half of the
rank-ordered data set.
There interquartile range = Q3 – Q1
10. Skewness – refers to the lack of symmetry or imbalance in data
distribution.
In a symmetric distribution the data is
normally distributed where mean,
median, mode is at the same point.
However in real life data is never perfectly
distributed, hence we call it skewed data.
If the Left side has longer tail then the mass
distribution of data is concentrated on the right
side which is known as negatively skewed.
11. If the Right side has longer tail then the
mass distribution of data is
concentrated on the left side is
known as positive skewed.
Here is the summary of all the skewness as shown in the figure below.
12. Example (skewed data)
Temp(*c)
10
40
35
33
35
Mean = 153/5 = 30.6, if we apply mean is 30.6
which is incorrect since we can see maximum
number of values are above 35.
So we have to use median For Ungrouped
data ((n+1)/2)th
That will be ((5+1)/2)th = 6/2 = 3
i.e. 3th term ie 35.
For grouped data:
where L, lower class boundary of the group containing the group.
B, Cumulative frequency of the groups
G , Frequency of the median group
W , width/Range of the group
Again, we don’t have to worry about the calculation, as in any statistical software's like R
, excel it will automatically calculate the intense calculation for large amount of data
but for more in-depth information you can visit this website.
https://www.mathsisfun.com/data/frequency-grouped-mean-median-mode.html
13. Kurtosis : is a measure of whether data are peaked or flat relative to
normal distribution
(+) Leptokurtic
(-) PlatyKurtic
(0) Meskurtic
(+) Leptokurtic
This means the distribution is more clustered near the mean and has a
relativity less standard deviation
(-) PlatyKurtic
Where the distribution is less clustered around the mean and a standard
deviation more then Leptokurtic
(0) Meskurtic is typically measured with respect to the normal
distribution. Meskurtic has tails similar to normal distribution i.e neither
high nor low, rather it is consider to be a baseline for the other two’s.
14. Now how to check the data is skewed or not
in Excel:
=skew(select the range of values/numbers)
=skew(10.24,9.48……….-0.42,-0.95)
= - 0.27 means Negatively skewed.
And to check the Kurtosis in Excel
=kurt(select the values/numbers)
=kurt(10.24,9.48……….-0.42,-0.95)
= -1.6 means it is PlatyKurtic
15. Recap
What we have learned ?
Measures of central tendency,
Measures of dispersion,
Measure of risk,
Next we will see how to compute this theory in
practical and analyze any data using our
everyday simple tools like Excel.
Rupak Roy