Association rule mining is an unsupervised learning technique used to discover relationships between variables in a large dataset. It analyzes how frequently items are purchased together and generates rules based on metrics like support, confidence and lift. For example, it can determine that customers who buy milk and diapers are likely to also purchase beer based on transaction histories. Association rule mining has applications in market basket analysis, medical diagnosis, catalog design and other domains.
This presentation is aimed at fitting a Simple Linear Regression model in a Python program. IDE used is Spyder. Screenshots from a working example are used for demonstration.
This document provides an overview of maximum likelihood estimation. It explains that maximum likelihood estimation finds the parameters of a probability distribution that make the observed data most probable. It gives the example of using maximum likelihood estimation to find the values of μ and σ that result in a normal distribution that best fits a data set. The goal of maximum likelihood is to find the parameter values that give the distribution with the highest probability of observing the actual data. It also discusses the concept of likelihood and compares it to probability, as well as considerations for removing constants and using the log-likelihood.
Introduction to principal component analysis (pca)Mohammed Musah
This document provides an introduction to principal component analysis (PCA), outlining its purpose for data reduction and structural detection. It defines PCA as a linear combination of weighted observed variables. The procedure section discusses assumptions like normality, homoscedasticity, and linearity that are evaluated prior to PCA. Requirements for performing PCA include the variables being at the metric or nominal level, sufficient sample size and variable ratios, and adequate correlations between variables.
Descriptive statistics is used to describe and summarize key characteristics of a data set. Commonly used measures include central tendency, such as the mean, median, and mode, and measures of dispersion like range, interquartile range, standard deviation, and variance. The mean is the average value calculated by summing all values and dividing by the number of values. The median is the middle value when data is arranged in order. The mode is the most frequently occurring value. Measures of dispersion describe how spread out the data is, such as the difference between highest and lowest values (range) or how close values are to the average (standard deviation).
A random variable X has a continuous uniform distribution if its probability density function f(x) is constant over the interval (α, β). The uniform distribution has a probability density function f(x) = k for α < x < β, where k is a constant, and is equal to 0 otherwise. All values from α to β are equally likely to occur, meaning the probability of X falling in any sub-interval of (α, β) is equal regardless of the interval's position within the range.
This document discusses multiple linear regression analysis. It begins by defining a multiple regression equation that describes the relationship between a response variable and two or more explanatory variables. It notes that multiple regression allows prediction of a response using more than one predictor variable. The document outlines key elements of multiple regression including visualization of relationships, statistical significance testing, and evaluating model fit. It provides examples of interpreting multiple regression output and using the technique to predict outcomes.
linear regression is a linear approach for modelling a predictive relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables), which are measured without error. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.
In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. Most commonly, the conditional mean of the response given the values of the explanatory variables (or predictors) is assumed to be an affine function of those values; less commonly, the conditional median or some other quantile is used. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of the response given the values of the predictors, rather than on the joint probability distribution of all of these variables, which is the domain of multivariate analysis.
Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications.[4] This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.
Linear regression has many practical uses. Most applications fall into one of the following two broad categories:
If the goal is error reduction in prediction or forecasting, linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response.
If the goal is to explain variation in the response variable that can be attributed to variation in the explanatory variables, linear regression analysis can be applied to quantify the strength of the relationship between the response and the explanatory variables, and in particular to determine whether some explanatory variables may have no linear relationship with the response at all, or to identify which subsets of explanatory variables may contain redundant information about the response.
Presentation on "Measure of central tendency"muhammad raza
This presentation introduces measures of central tendency including mean, median, and mode. It provides definitions and formulas for calculating each measure using both ungrouped and grouped data. The mean is the average and is used for less scattered data. It is calculated by summing all values and dividing by the number of values. The median is the middle value when values are arranged in order. For even numbers of values, the median is the average of the two middle values. The mode is the most frequently occurring value in a data set and there can be single or multiple modes. Formulas are provided for calculating the median and mode using grouped frequency data.
This document discusses evaluating hypotheses and estimating hypothesis accuracy. It provides the following key points:
- The accuracy of a hypothesis estimated from a training set may be different from its true accuracy due to bias and variance. Testing the hypothesis on an independent test set provides an unbiased estimate.
- Given a hypothesis h that makes r errors on a test set of n examples, the sample error r/n provides an unbiased estimate of the true error. The variance of this estimate depends on r and n based on the binomial distribution.
- For large n, the binomial distribution can be approximated by the normal distribution. Confidence intervals for the true error can then be determined based on the sample error and standard deviation
This document discusses statistical inference and its two main types: estimation of parameters and testing of hypotheses. Estimation of parameters has two forms: point estimation, which provides a single numerical value as an estimate, and interval estimation, which expresses the estimate as a range of values. Point estimation involves calculating estimators like the sample mean to estimate population parameters. Interval estimation provides a interval rather than a single point as the estimate. Statistical inference uses samples to draw conclusions about unknown population parameters.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 2: Exploring Data with Tables and Graphs
2.4: Scatterplots, Correlation, and Regression
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 8: Hypothesis Testing
8.2: Testing a Claim About a Proportion
Linear Regression vs Logistic Regression | EdurekaEdureka!
YouTube: https://youtu.be/OCwZyYH14uw
** Data Science Certification using R: https://www.edureka.co/data-science **
This Edureka PPT on Linear Regression Vs Logistic Regression covers the basic concepts of linear and logistic models. The following topics are covered in this session:
Types of Machine Learning
Regression Vs Classification
What is Linear Regression?
What is Logistic Regression?
Linear Regression Use Case
Logistic Regression Use Case
Linear Regression Vs Logistic Regression
Blog Series: http://bit.ly/data-science-blogs
Data Science Training Playlist: http://bit.ly/data-science-playlist
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
This document outlines basic probability concepts, including definitions of probability, views of probability (objective and subjective), and elementary properties. It discusses calculating probabilities of events from data in tables, including unconditional/marginal probabilities, conditional probabilities, and joint probabilities. Rules of probability are presented, including the multiplicative rule that the joint probability of two events is equal to the product of the marginal probability of one event and the conditional probability of the other event given the first event. Examples are provided to illustrate key concepts.
1. The document discusses linear correlation and regression between plasma amphetamine levels and amphetamine-induced psychosis scores using data from 10 patients.
2. A positive correlation was found between the two variables, and a linear regression equation was established to predict psychosis scores from amphetamine levels.
3. However, further statistical tests were needed to determine if the correlation and regression model could be generalized to the overall patient population.
The document discusses simple linear regression. It defines key terms like regression equation, regression line, slope, intercept, residuals, and residual plot. It provides examples of using sample data to generate a regression equation and evaluating that regression model. Specifically, it shows generating a regression equation from bivariate data, checking assumptions visually through scatter plots and residual plots, and interpreting the slope as the marginal change in the response variable from a one unit change in the explanatory variable.
This document discusses linear regression, which maps the linear relationship between two variables. Linear regression is useful for business applications to understand which independent variables are related to the dependent variable and estimate the strength of relationships. It provides an example dataset and explains that the method of least squares is used to calculate linear regression by adjusting parameters to minimize residuals and best fit the data to the model Y = b0 + b1X.
Discrete Random Variables And Probability Distributionsmathscontent
1. The document defines discrete random variables as random variables that can take on a finite or countable number of values. It provides an example of a discrete random variable being the number of heads from 4 coin tosses.
2. It introduces the probability mass function (PMF) as a function that gives the probability of a discrete random variable taking on a particular value. The PMF must be greater than or equal to 0 and sum to 1.
3. The cumulative distribution function (CDF) of a discrete random variable is defined as the sum of the PMF values up to that point. It ranges from 0 to 1 and increases monotonically.
Probability In Discrete Structure of Computer SciencePrankit Mishra
This document provides an overview of probability, including its basic definition, history, interpretations, theory, and applications. Probability is defined as a measure between 0 and 1 of the likelihood of an event occurring, where 0 is impossible and 1 is certain. It has been given a mathematical formalization and is used in many fields including statistics, science, and artificial intelligence. Historically, the scientific study of probability began in the 17th century and was further developed by thinkers like Bernoulli, Legendre, and Kolmogorov. Probability can be interpreted objectively based on frequencies or subjectively as degrees of belief. Important probability terms covered include experiments, outcomes, events, joint probability, independent events, mutually exclusive events, and conditional probability.
Ridge regression is a technique used for linear regression when the number of predictor variables is greater than the number of observations. It addresses the problem of overfitting by adding a regularization term to the loss function that shrinks large coefficients. This regularization term penalizes coefficients with large magnitudes, improving the model's generalization. Ridge regression finds a balance between minimizing training error and minimizing the size of coefficients by introducing a tuning parameter lambda. The document includes an experiment demonstrating how different lambda values affect the variance and mean squared error of the ridge regression model.
Probability And Probability Distributions Sahil Nagpal
This document provides an overview of key concepts in probability and probability distributions. It defines important terms like probability, sample space, events, mutually exclusive events, independent events, and conditional probability. It also covers rules of probability like addition rules, complement rules, and Bayes' theorem. Finally, it introduces discrete and continuous random variables and discusses properties of discrete probability distributions like expected value and standard deviation.
This document discusses evaluating point estimators. It defines mean square error as an indicator for determining the worth of an estimator. There is rarely a single estimator that minimizes mean square error for all possible parameter values. Unbiased estimators, where the expected value equals the parameter, are commonly used. Bias is defined as the expected value of the estimator minus the parameter. Combining independent unbiased estimators results in an estimator with variance equal to the weighted sum of the individual variances. The mean square error of any estimator is equal to its variance plus the square of its bias. Examples are provided to illustrate evaluating bias and finding mean and variance of estimators.
The document discusses random variables and vectors. It defines random variables as functions that assign outcomes of random experiments to real numbers. There are two types of random variables: discrete and continuous. Random variables are characterized by their expected value, variance/standard deviation, and other moments. Random vectors are multivariate random variables. Key concepts covered include probability mass functions, probability density functions, expected value, variance, and how these properties change when random variables are scaled or combined linearly.
Bernoullis Random Variables And Binomial Distributionmathscontent
Bernoulli and binomial random variables are used to model success/failure experiments. A Bernoulli variable represents a single trial with outcomes success (1) and failure (0). A binomial variable counts the number of successes in n independent Bernoulli trials. The probability of x successes in n trials is given by the binomial distribution. Its mean and variance can be derived. The moment generating function of the binomial distribution helps compute moments like variance.
The document discusses statistical tests for analyzing relationships between variables, including chi-square, gamma, and other tests. It focuses on tests for associations between ordinal variables, such as the chi-square test of independence and gamma value. The chi-square test indicates whether variables are independent but not the strength of association, while gamma measures both existence and direction of association. Both tests compare a calculated statistic to critical values to determine statistical significance. While gamma is more powerful, chi-square is also important as it can detect dependence even when gamma equals zero.
This document provides an overview of topics related to data analysis, statistics, and probability that may be covered on the SAT. It includes brief explanations of different types of graphs used to display data, guidelines for interpreting data from graphs, tables, and charts, definitions and examples of common statistical concepts like mean, median, mode, and weighted average, and explanations of probability, independent and dependent events, and calculating probabilities using geometric models. Practice problems with solutions are provided as examples.
This document discusses multiple linear regression analysis. It begins by defining a multiple regression equation that describes the relationship between a response variable and two or more explanatory variables. It notes that multiple regression allows prediction of a response using more than one predictor variable. The document outlines key elements of multiple regression including visualization of relationships, statistical significance testing, and evaluating model fit. It provides examples of interpreting multiple regression output and using the technique to predict outcomes.
linear regression is a linear approach for modelling a predictive relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables), which are measured without error. The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression. This term is distinct from multivariate linear regression, where multiple correlated dependent variables are predicted, rather than a single scalar variable. If the explanatory variables are measured with error then errors-in-variables models are required, also known as measurement error models.
In linear regression, the relationships are modeled using linear predictor functions whose unknown model parameters are estimated from the data. Such models are called linear models. Most commonly, the conditional mean of the response given the values of the explanatory variables (or predictors) is assumed to be an affine function of those values; less commonly, the conditional median or some other quantile is used. Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of the response given the values of the predictors, rather than on the joint probability distribution of all of these variables, which is the domain of multivariate analysis.
Linear regression was the first type of regression analysis to be studied rigorously, and to be used extensively in practical applications.[4] This is because models which depend linearly on their unknown parameters are easier to fit than models which are non-linearly related to their parameters and because the statistical properties of the resulting estimators are easier to determine.
Linear regression has many practical uses. Most applications fall into one of the following two broad categories:
If the goal is error reduction in prediction or forecasting, linear regression can be used to fit a predictive model to an observed data set of values of the response and explanatory variables. After developing such a model, if additional values of the explanatory variables are collected without an accompanying response value, the fitted model can be used to make a prediction of the response.
If the goal is to explain variation in the response variable that can be attributed to variation in the explanatory variables, linear regression analysis can be applied to quantify the strength of the relationship between the response and the explanatory variables, and in particular to determine whether some explanatory variables may have no linear relationship with the response at all, or to identify which subsets of explanatory variables may contain redundant information about the response.
Presentation on "Measure of central tendency"muhammad raza
This presentation introduces measures of central tendency including mean, median, and mode. It provides definitions and formulas for calculating each measure using both ungrouped and grouped data. The mean is the average and is used for less scattered data. It is calculated by summing all values and dividing by the number of values. The median is the middle value when values are arranged in order. For even numbers of values, the median is the average of the two middle values. The mode is the most frequently occurring value in a data set and there can be single or multiple modes. Formulas are provided for calculating the median and mode using grouped frequency data.
This document discusses evaluating hypotheses and estimating hypothesis accuracy. It provides the following key points:
- The accuracy of a hypothesis estimated from a training set may be different from its true accuracy due to bias and variance. Testing the hypothesis on an independent test set provides an unbiased estimate.
- Given a hypothesis h that makes r errors on a test set of n examples, the sample error r/n provides an unbiased estimate of the true error. The variance of this estimate depends on r and n based on the binomial distribution.
- For large n, the binomial distribution can be approximated by the normal distribution. Confidence intervals for the true error can then be determined based on the sample error and standard deviation
This document discusses statistical inference and its two main types: estimation of parameters and testing of hypotheses. Estimation of parameters has two forms: point estimation, which provides a single numerical value as an estimate, and interval estimation, which expresses the estimate as a range of values. Point estimation involves calculating estimators like the sample mean to estimate population parameters. Interval estimation provides a interval rather than a single point as the estimate. Statistical inference uses samples to draw conclusions about unknown population parameters.
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 2: Exploring Data with Tables and Graphs
2.4: Scatterplots, Correlation, and Regression
Please Subscribe to this Channel for more solutions and lectures
http://www.youtube.com/onlineteaching
Chapter 8: Hypothesis Testing
8.2: Testing a Claim About a Proportion
Linear Regression vs Logistic Regression | EdurekaEdureka!
YouTube: https://youtu.be/OCwZyYH14uw
** Data Science Certification using R: https://www.edureka.co/data-science **
This Edureka PPT on Linear Regression Vs Logistic Regression covers the basic concepts of linear and logistic models. The following topics are covered in this session:
Types of Machine Learning
Regression Vs Classification
What is Linear Regression?
What is Logistic Regression?
Linear Regression Use Case
Logistic Regression Use Case
Linear Regression Vs Logistic Regression
Blog Series: http://bit.ly/data-science-blogs
Data Science Training Playlist: http://bit.ly/data-science-playlist
Follow us to never miss an update in the future.
YouTube: https://www.youtube.com/user/edurekaIN
Instagram: https://www.instagram.com/edureka_learning/
Facebook: https://www.facebook.com/edurekaIN/
Twitter: https://twitter.com/edurekain
LinkedIn: https://www.linkedin.com/company/edureka
This document outlines basic probability concepts, including definitions of probability, views of probability (objective and subjective), and elementary properties. It discusses calculating probabilities of events from data in tables, including unconditional/marginal probabilities, conditional probabilities, and joint probabilities. Rules of probability are presented, including the multiplicative rule that the joint probability of two events is equal to the product of the marginal probability of one event and the conditional probability of the other event given the first event. Examples are provided to illustrate key concepts.
1. The document discusses linear correlation and regression between plasma amphetamine levels and amphetamine-induced psychosis scores using data from 10 patients.
2. A positive correlation was found between the two variables, and a linear regression equation was established to predict psychosis scores from amphetamine levels.
3. However, further statistical tests were needed to determine if the correlation and regression model could be generalized to the overall patient population.
The document discusses simple linear regression. It defines key terms like regression equation, regression line, slope, intercept, residuals, and residual plot. It provides examples of using sample data to generate a regression equation and evaluating that regression model. Specifically, it shows generating a regression equation from bivariate data, checking assumptions visually through scatter plots and residual plots, and interpreting the slope as the marginal change in the response variable from a one unit change in the explanatory variable.
This document discusses linear regression, which maps the linear relationship between two variables. Linear regression is useful for business applications to understand which independent variables are related to the dependent variable and estimate the strength of relationships. It provides an example dataset and explains that the method of least squares is used to calculate linear regression by adjusting parameters to minimize residuals and best fit the data to the model Y = b0 + b1X.
Discrete Random Variables And Probability Distributionsmathscontent
1. The document defines discrete random variables as random variables that can take on a finite or countable number of values. It provides an example of a discrete random variable being the number of heads from 4 coin tosses.
2. It introduces the probability mass function (PMF) as a function that gives the probability of a discrete random variable taking on a particular value. The PMF must be greater than or equal to 0 and sum to 1.
3. The cumulative distribution function (CDF) of a discrete random variable is defined as the sum of the PMF values up to that point. It ranges from 0 to 1 and increases monotonically.
Probability In Discrete Structure of Computer SciencePrankit Mishra
This document provides an overview of probability, including its basic definition, history, interpretations, theory, and applications. Probability is defined as a measure between 0 and 1 of the likelihood of an event occurring, where 0 is impossible and 1 is certain. It has been given a mathematical formalization and is used in many fields including statistics, science, and artificial intelligence. Historically, the scientific study of probability began in the 17th century and was further developed by thinkers like Bernoulli, Legendre, and Kolmogorov. Probability can be interpreted objectively based on frequencies or subjectively as degrees of belief. Important probability terms covered include experiments, outcomes, events, joint probability, independent events, mutually exclusive events, and conditional probability.
Ridge regression is a technique used for linear regression when the number of predictor variables is greater than the number of observations. It addresses the problem of overfitting by adding a regularization term to the loss function that shrinks large coefficients. This regularization term penalizes coefficients with large magnitudes, improving the model's generalization. Ridge regression finds a balance between minimizing training error and minimizing the size of coefficients by introducing a tuning parameter lambda. The document includes an experiment demonstrating how different lambda values affect the variance and mean squared error of the ridge regression model.
Probability And Probability Distributions Sahil Nagpal
This document provides an overview of key concepts in probability and probability distributions. It defines important terms like probability, sample space, events, mutually exclusive events, independent events, and conditional probability. It also covers rules of probability like addition rules, complement rules, and Bayes' theorem. Finally, it introduces discrete and continuous random variables and discusses properties of discrete probability distributions like expected value and standard deviation.
This document discusses evaluating point estimators. It defines mean square error as an indicator for determining the worth of an estimator. There is rarely a single estimator that minimizes mean square error for all possible parameter values. Unbiased estimators, where the expected value equals the parameter, are commonly used. Bias is defined as the expected value of the estimator minus the parameter. Combining independent unbiased estimators results in an estimator with variance equal to the weighted sum of the individual variances. The mean square error of any estimator is equal to its variance plus the square of its bias. Examples are provided to illustrate evaluating bias and finding mean and variance of estimators.
The document discusses random variables and vectors. It defines random variables as functions that assign outcomes of random experiments to real numbers. There are two types of random variables: discrete and continuous. Random variables are characterized by their expected value, variance/standard deviation, and other moments. Random vectors are multivariate random variables. Key concepts covered include probability mass functions, probability density functions, expected value, variance, and how these properties change when random variables are scaled or combined linearly.
Bernoullis Random Variables And Binomial Distributionmathscontent
Bernoulli and binomial random variables are used to model success/failure experiments. A Bernoulli variable represents a single trial with outcomes success (1) and failure (0). A binomial variable counts the number of successes in n independent Bernoulli trials. The probability of x successes in n trials is given by the binomial distribution. Its mean and variance can be derived. The moment generating function of the binomial distribution helps compute moments like variance.
The document discusses statistical tests for analyzing relationships between variables, including chi-square, gamma, and other tests. It focuses on tests for associations between ordinal variables, such as the chi-square test of independence and gamma value. The chi-square test indicates whether variables are independent but not the strength of association, while gamma measures both existence and direction of association. Both tests compare a calculated statistic to critical values to determine statistical significance. While gamma is more powerful, chi-square is also important as it can detect dependence even when gamma equals zero.
This document provides an overview of topics related to data analysis, statistics, and probability that may be covered on the SAT. It includes brief explanations of different types of graphs used to display data, guidelines for interpreting data from graphs, tables, and charts, definitions and examples of common statistical concepts like mean, median, mode, and weighted average, and explanations of probability, independent and dependent events, and calculating probabilities using geometric models. Practice problems with solutions are provided as examples.
The document discusses concepts related to data analysis and probability for primary and intermediate grades. It covers four processes of statistics: formulating questions, data collection, data analysis, and interpreting results. Examples of data analysis in primary grades include collecting weather data and representing it in tally charts or bar graphs. Intermediate grades involve more complex graphs and measures of center. The document also discusses introducing probability terms and the two types of probability calculations. It provides an overview of lessons on weather data and M&M data that teach related concepts for different grade levels.
This document provides an overview of key concepts in data analysis and probability including classification, gathering and organizing data, analyzing data, probability as a scale from zero to one, fractions with probability, teaching strategies, and resources. It discusses categorizing data using attributes and Venn diagrams. It also outlines analyzing data through descriptive statistics, averages, and data displays. Probability is defined on a scale from impossible to certain. Teaching strategies encourage exploration, risk-taking, and using technology. Resources include textbooks, children's books, and online interactive tools.
The document discusses the five strands of the National Council of Teachers of Mathematics (NCTM), which are Number and Operations, Algebra, Geometry, Measurement, and Data Analysis & Probability. It focuses on the Data Analysis & Probability strand, outlining its four main components: formulating questions and collecting/organizing data; selecting and using statistical methods; developing and evaluating inferences and predictions; and understanding and applying probability concepts.
This document provides an overview of statistics concepts including descriptive and inferential statistics. Descriptive statistics are used to summarize and describe data through measures of central tendency (mean, median, mode), dispersion (range, standard deviation), and frequency/percentage. Inferential statistics allow inferences to be made about a population based on a sample through hypothesis testing and other statistical techniques. The document discusses preparing data in Excel and using formulas and functions to calculate descriptive statistics. It also introduces the concepts of normal distribution, kurtosis, and skewness in describing data distributions.
This document provides an overview of HTML5, CSS3, and client-side data storage. It discusses new HTML5 semantic tags, forms, and APIs for audio, video, and canvas. It also covers CSS3 features like rounded corners, shadows, gradients, transitions and animations. For data storage, it explains Web Storage APIs like localStorage and sessionStorage, as well as client-side databases in some browsers. Code examples are provided for HTML5, CSS3 effects, and using the Storage APIs.
This document discusses basic data storage in programming, including constants, variables, primitive data types, default values, possible values, variable names, and samples. It defines constants as data that never changes, variables as storage places for potentially changeable data, and primitive data types like boolean, byte, char, String, short, int, long, float, and double. It also covers initializing variables, default values for each data type, possible value ranges, naming conventions for variables, and examples of declaring different variable types.
The document discusses key details from India's 2011 Census:
- India's population increased 181 million from 2001 to 1.21 billion in 2011. The gap with China narrowed but widened with the US.
- Uttar Pradesh and Maharashtra have the highest populations, while Thane district has the most people of any district.
- India's sex ratio improved to 940 females per 1000 males, the highest since 1971.
- Literacy rates vary widely by state, from over 98% in Mizoram to 37% in one MP district.
- Mobile phone ownership exceeds access to toilets, with landlines at 10% and mobiles at 59% penetration.
The document calculates the variance and standard deviation of European auto sales over 6 years. It provides the annual sales data in millions of dollars. It walks through the steps of finding the mean, calculating the deviations from the mean, squaring the deviations, and summing the squared deviations to calculate the variance. The goal is to determine the variance and standard deviation for the 6 years of auto sales data provided.
Data Storage Tips for Optimal Spark Performance-(Vida Ha, Databricks)Spark Summit
Vida Ha presented best practices for storing and working with data in files for optimal Spark performance. Some key tips included choosing appropriate file sizes between 64 MB to 1 GB, using splittable compression formats like gzip and Snappy, enforcing schemas for structured formats like Parquet and Avro, and reusing Hadoop libraries to read various file formats. General tips involved controlling output file size through methods like coalesce and repartition, using sc.wholeTextFiles for non-splittable formats, and processing files individually by filename.
Basic knowledge of Storage technology and complete understanding on DAS, NAS & SAN with advantages and disadvantages. A quick understanding on storage will help you make the best decision in terms of cost and need.
This document provides an overview of key concepts in decision analysis, including problem formulation, decision making without and with probabilities, risk analysis, sensitivity analysis, and computing branch probabilities. It discusses techniques like influence diagrams, payoff tables, decision trees, and the expected value, conservative, optimistic, and minimax regret approaches. It also covers risk profiles, sensitivity analysis, Bayes' theorem, and the expected value of perfect and sample information.
This presentation summarizes key aspects of decision analysis and decision making under uncertainty. It discusses decision criteria like maximax, maximin, and Laplace that can be used when outcomes are uncertain. When probabilities are known, criteria for decision making under risk are used. Sequential decisions can be modeled with decision trees, which represent decisions, chances, and outcomes with nodes and paths. The presentation was given to MBA students on the topic of decision analysis.
Decision Tree Analysis for statistical tool. The deck provides understanding on the Decision Analysis.
It provides practical application and limited theory. Will be useful for MBA students.
Monte Carlo simulation is a statistical technique that uses random numbers and probability to simulate real-world processes. It was developed in the 1940s by scientists working on nuclear weapons research. Monte Carlo simulation provides approximate solutions to problems by running simulations many times. It allows for sensitivity analysis and scenario analysis. Some examples include estimating pi by randomly generating points within a circle, and approximating integrals by treating the area under a curve as a target for random darts. The technique provides probabilistic results and allows modeling of correlated inputs.
This document provides an overview of a data analysis course that covers topics such as descriptive statistics, probability distributions, correlation, regression, hypothesis testing, clustering, and time series analysis. The course introduces descriptive statistics including measures of central tendency, dispersion, frequency distributions, and histograms. Notes are provided on calculating and interpreting mean, median, mode, range, variance, standard deviation, and other descriptive statistics.
The document provides information about the chi-square test, including its introduction by Karl Pearson, its applications and uses, assumptions, and examples. The chi-square test is used to determine if an observed set of frequencies differ from expected frequencies. It can be used to test differences between categorical data and expected values. Examples shown include a goodness of fit test comparing blood group frequencies to expected equal distribution, and a one-dimensional coin flipping example.
This document discusses descriptive statistics and how to summarize data. It covers measures of central tendency like mode, mean, and median. It also discusses measures of variation such as range and standard deviation. Examples are given to illustrate the different types of central tendencies and measures of variation used. Readers are instructed to analyze graphs in magazines and newspapers to identify what statistics are used to influence readers.
This document discusses descriptive statistics and how they are used to summarize and describe data. Descriptive statistics allow researchers to analyze patterns in data but cannot be used to draw conclusions beyond the sample. Key aspects covered include measures of central tendency like mean, median, and mode to describe the central position in a data set. Measures of dispersion like range and standard deviation are also discussed to quantify how spread out the data values are. Frequency distributions are described as a way to summarize the frequencies of individual data values or ranges.
The document discusses various statistical concepts related to hypothesis testing, including:
- Types I and II errors that can occur when testing hypotheses
- How the probability of committing errors depends on factors like the sample size and how far the population parameter is from the hypothesized value
- The concept of critical regions and how they are used to determine if a null hypothesis can be rejected
- The difference between discrete and continuous probability distributions and examples of each
- How an observed test statistic is calculated and compared to a critical value to determine whether to reject or not reject the null hypothesis
This document defines hypothesis testing and describes the basic concepts and procedures involved. It explains that a hypothesis is a tentative explanation of the relationship between two variables. The null hypothesis is the initial assumption that is tested, while the alternative hypothesis is what would be accepted if the null hypothesis is rejected. Key steps in hypothesis testing are defining the null and alternative hypotheses, selecting a significance level, determining the appropriate statistical distribution, collecting sample data, calculating the probability of the results, and comparing this to the significance level to determine whether to accept or reject the null hypothesis. Types I and II errors in hypothesis testing are also defined.
1) The document discusses hypothesis testing and statistical inference using examples related to coin tossing. It explains the concepts of type I and type II errors and how hypothesis tests are conducted.
2) An example is provided to test the hypothesis that the average American ideology is somewhat conservative (H0: μ = 5) using data from the National Election Study. The alternative hypothesis is that the average is less than 5 (HA: μ < 5).
3) The results of the hypothesis test show the observed test statistic is lower than the critical value, so the null hypothesis that the average is 5 is rejected in favor of the alternative that the average is less than 5.
1. The document discusses hypothesis testing using the z-test. It outlines the steps of hypothesis testing including stating hypotheses, setting the criterion, computing test statistics, comparing to the criterion, and making a decision.
2. Examples are provided to demonstrate a non-directional and directional z-test, including stating hypotheses, computing test statistics, comparing to criteria, and interpreting results.
3. Key concepts reviewed are the central limit theorem, type I and II errors, significance levels, rejection regions, p-values, and confidence intervals in hypothesis testing.
This document discusses the process of testing hypotheses. It begins by defining hypothesis testing as a way to make decisions about population characteristics based on sample data, which involves some risk of error. The key steps are outlined as:
1) Formulating the null and alternative hypotheses, with the null hypothesis stating no difference or relationship.
2) Computing a test statistic based on the sample data and selecting a significance level, usually 5%.
3) Comparing the test statistic to critical values to either reject or fail to reject the null hypothesis.
Examples are provided to demonstrate hypothesis testing for a single mean, comparing two means, and testing a claim about population characteristics using sample data and statistics.
This document discusses hypothesis testing procedures. It begins by introducing hypothesis testing and defining key terms like the null hypothesis and alternative hypothesis. It then outlines the typical steps in hypothesis testing: 1) formulating the hypotheses, 2) setting the significance level, 3) choosing a test criterion, 4) performing computations, and 5) making a decision. It also discusses concepts like type I and type II errors, and one-tailed vs two-tailed tests. Tail tests refer to whether the rejection region is in one tail or both tails of the sampling distribution. The document provides examples and explanations of these statistical hypothesis testing concepts.
Paradigms and approaches on Hypothesis Tests
▪ Probability of existence
▪ Power
▪ Effect size (how much?)
• Tests
▪ Contingency table tests (categorical)
▪ One sample tests (sample vs value)
▪ Independent 2 sample tests
▪ Paired 2 sample tests
• Practice in R
Class of the course of Environmental Data Analysis, Federal University of ABC (UFABC), November, 2024.
The document discusses hypothesis testing and outlines the key steps in the hypothesis testing process:
1) Formulating the null and alternative hypotheses about a population parameter. The null hypothesis is tested while the alternative is accepted if the null is rejected.
2) Determining the significance level and critical value based on this level which establishes the boundary for rejecting the null hypothesis.
3) Selecting a sample, calculating the test statistic and comparing it to the critical value to determine whether to reject or fail to reject the null hypothesis.
4) Hypothesis tests can be one-tailed, focusing rejection in one tail, or two-tailed, splitting rejection between both tails. Steps are generally the same but null and alternatives differ.
This document discusses hypothesis testing, which involves drawing inferences about a population based on a sample from that population. It outlines the key elements of a hypothesis test, including the null and alternative hypotheses, test statistics, critical regions, significance levels, critical values, and p-values. Type I and Type II errors are explained, where a Type I error involves rejecting the null hypothesis when it is true, and a Type II error involves failing to reject the null when it is false. The power of a hypothesis test is defined as the probability of correctly rejecting the null hypothesis when it is false. Controlling type I and II errors involves considering the significance level, sample size, and population parameters in the null and alternative hypotheses.
Testing of Hypothesis, p-value, Gaussian distribution, null hypothesissvmmcradonco1
This document provides an overview of key concepts in statistical hypothesis testing. It defines what a hypothesis is, the different types of hypotheses (null, alternative, one-tailed, two-tailed), and statistical terms used in hypothesis testing like test statistics, critical regions, significance levels, critical values, type I and type II errors. It also explains the decision making process in hypothesis testing, such as rejecting or failing to reject the null hypothesis based on whether the test statistic falls within the critical region or if the p-value is less than the significance level.
Hypothesis testing involves making an assumption about an unknown population parameter, called the null hypothesis (H0). A hypothesis is tested by collecting a sample from the population and comparing sample statistics to the hypothesized parameter value. If the sample value differs significantly from the hypothesized value based on a predetermined significance level, then the null hypothesis is rejected. There are two types of errors that can occur - type 1 errors occur when a true null hypothesis is rejected, and type 2 errors occur when a false null hypothesis is not rejected. Hypothesis tests can be one-tailed, testing if the sample value is greater than or less than the hypothesized value, or two-tailed, testing if the sample value is significantly different from the hypothesized value.
The document defines key concepts in probability and hypothesis testing. It discusses probability as a numerical quantity between 0 and 1 that expresses the likelihood of an event. Different probability distributions are covered, including binomial, normal, and Poisson distributions. Hypothesis testing is defined as a methodology to either accept or reject a null hypothesis based on sample data. Types of hypotheses, terms used in testing like test statistics and p-values, and types of errors are also summarized.
A hypothesis test examines two opposing hypotheses: the null hypothesis and alternative hypothesis. The null hypothesis is the statement being tested, usually stating "no effect". The alternative hypothesis is what the researcher hopes to prove true. A hypothesis test uses a sample to determine whether to reject the null hypothesis based on a p-value and significance level. There are 5 steps: specify null and alternative hypotheses, set significance level, calculate test statistic and p-value, and draw a conclusion. Type I and II errors are possible - type I rejects a true null hypothesis, type II fails to reject a false null hypothesis.
This document discusses hypothesis testing and significance tests. It defines key terms like parameters, statistics, sampling distribution, standard error, null and alternative hypotheses, type I and type II errors. It explains how to set up a hypothesis test, including choosing a significance level and critical value. Both one-tailed and two-tailed tests are described. Finally, it provides an overview of different types of significance tests for both large and small sample sizes.
This document provides an overview of statistical inference and hypothesis testing. It discusses key concepts such as the null and alternative hypotheses, type I and type II errors, one-tailed and two-tailed tests, test statistics, p-values, confidence intervals, and parametric vs non-parametric tests. Specific statistical tests covered include the t-test, z-test, ANOVA, chi-square test, and correlation analyses. The document also addresses how sample size affects test power and significance.
The document discusses key concepts in statistical inference including estimation, confidence intervals, hypothesis testing, and types of errors. It provides examples and formulas for estimating population means from sample data, calculating confidence intervals, stating the null and alternative hypotheses, and making decisions to accept or reject the null hypothesis based on a significance level.
Hypothesis testing involves making an assumption about an unknown population parameter, called the null hypothesis (H0). A hypothesis is tested by collecting a sample from the population and comparing sample statistics to the null hypothesis. If the sample statistic is sufficiently different from the null hypothesis, the null hypothesis is rejected. There are two types of errors that can occur - type 1 errors occur when a true null hypothesis is rejected, and type 2 errors occur when a false null hypothesis is not rejected. Hypothesis tests can be one-tailed, testing if the sample statistic is greater than or less than the null hypothesis, or two-tailed, testing if it is significantly different in either direction.
importance of P value and its uses in the realtime SignificanceSukumarReddy43
This document discusses p-values and their significance in statistical hypothesis testing. It defines a p-value as the probability of obtaining a result equal to or more extreme than what was actually observed. A smaller p-value indicates stronger evidence against the null hypothesis. The document outlines the steps in significance testing: stating the research question, determining the probability of erroneous conclusions, choosing a statistical test to calculate a test statistic, obtaining the p-value, making an inference, and forming conclusions. It explains the concepts of type I and type II errors and how a p-value below 0.05 is typically considered statistically significant.
The document discusses challenges and opportunities for visualizing genomic data. It explores areas like comparative genomics, gene expression, protein analysis, pathways, and systems modeling that could benefit from improved visualization. New approaches and technologies are needed to integrate diverse data types and enable exploration, analysis, and interpretation. Emerging techniques from information visualization, like semantic zooming and progressive disclosure, could help make genomic data more accessible and understandable. Platforms and standards like GA4GH aim to facilitate discovery and sharing of genomic data.
SocialLearning: descubriendo contenidos educativos de manera colaborativaAlberto Labarga
This document lists various educational resources that can be used for social learning. It includes links to websites about poetry generated from Google searches, open street maps, illustrative street art, APIs, tools for analyzing social media data, services for natural language processing, trending bookmarks, and more. Each resource is listed along with its title, URL, description, relevant tags, and other metadata.
El documento propone una nueva forma de monetizar el talento de artistas callejeros durante las fiestas de San Fermín a través de una aplicación que utiliza códigos QR y tecnología para permitir que los espectadores apoyen económicamente a los artistas de forma voluntaria y anónima, lo que daría más visibilidad a los artistas y mejoraría la experiencia de la fiesta.
El documento habla sobre un sistema de estacionamiento inteligente que puede predecir la afluencia de vehículos y estimar los picos de tráfico basado en eventos, con el objetivo de ayudar a los conductores a encontrar espacios de estacionamiento más fácilmente. El sistema usa datos sobre disponibilidad, ubicación y tarifas de los espacios, así como información sobre eventos, para predecir la demanda y ayudar a los usuarios a estacionar de manera más eficiente. El documento también incluye capturas de pantalla de un prototipo
Este documento describe el proyecto vidascontadas.org, cuyo objetivo es crear una memoria histórica abierta, interactiva e integral sobre la Guerra Civil y el franquismo en España. El proyecto busca unificar bases de datos existentes sobre víctimas y eventos de la época a través de técnicas de minería y visualización de datos para hacer la información más accesible al público general y a investigadores. El proyecto también pretende involucrar a la ciudadanía para aportar información adicional y complementar los registros of
Periodismo de datos y visualización de datos abiertos #siglibre9Alberto Labarga
Este documento presenta una introducción al periodismo de datos y la visualización de datos abiertos. Proporciona numerosos enlaces a herramientas y ejemplos de visualización de datos que permiten entender y explicar el mundo mediante el análisis de datos abiertos. Incluye enlaces a sitios web que utilizan datos para contar historias y explicar temas complejos de una manera visual e interactiva.
Using open technology, including smart phone apps, wearables (such as watches) and a dedicated website, myHealth will let patients and their carers manage health data integrated with other sources of open data to help them identify, monitor and track their own disease progression on a day-to-day basis, collaborating with clinicians and researchers to find new ways to fight diseases.
El documento describe los grandes volúmenes de datos (Big Data) en el sector sanitario, incluyendo los beneficios potenciales para la salud y los ahorros económicos estimados en $450 mil millones. Explica que el procesamiento de estos grandes conjuntos de datos no estructurados requiere nuevas herramientas. También menciona ejemplos de inversiones recientes en startups de salud digital y el potencial de la genómica para generar 50 terabytes de datos por semana a partir de 18.000 genomas humanos analizados anualmente
Este documento proporciona enlaces a códigos de Arduino para controlar diferentes tipos de motores, incluyendo motores lineales, servos y paso a paso. Los enlaces llevan a páginas web con ejemplos de código de Arduino para controlar estos motores, así como información sobre componentes electrónicos relacionados como controladores de motores paso a paso. El objetivo es ayudar a los lectores a aprender cómo controlar y automatizar el movimiento físico mediante el uso de motores con Arduino.
Este documento describe la entrada y salida analógica en Arduino. Explica las funciones analogRead() para leer valores analógicos de 0 a 1023 en los pines 0-5 y analogWrite() para escribir valores de 0 a 255 en los pines de salida PWM. También cubre el mapeo de valores analógicos leídos a un rango de salida y enlaces a tutoriales y códigos de ejemplo para sensores fotodiodos, temperatura y ultrasonido.
El documento describe el proceso de crear un juego "Simon dice" usando Arduino. Explica cómo controlar los cambios de estado de un pulsador, reproducir secuencias de luces, y agregar sonidos mediante la importación de librerías. Luego proporciona el código para implementar cada una de estas funciones de forma incremental, culminando en una versión completa del juego "Simon dice" con luces y sonidos.
Este documento proporciona información sobre las entradas y salidas digitales en Arduino. Explica cómo configurar los pines como entrada o salida digital usando pinMode(), cómo leer el estado digital de un pin de entrada con digitalRead(), y cómo escribir valores alto y bajo en un pin de salida con digitalWrite(). También incluye enlaces a tutoriales de Arduino sobre parpadeo, botones y tonos.
Presentación Laboratorio de Fabricación Digital UPNA 2014Alberto Labarga
El documento resume la historia de los ordenadores personales desde 1953 hasta la actualidad. Comienza describiendo la primera computadora comercial de IBM en 1953 y continúa hablando sobre el desarrollo de sistemas UNIX y el Homebrew Computer Club en los años 70. También describe el lanzamiento del Apple I en 1976, el IBM PC en 1981 y el proyecto GNU iniciado por Richard Stallman en 1984 para crear software libre. Explica cómo Linux de Linus Torvalds en 1991 completó el sistema operativo GNU para crear GNU/Linux.
Conceptos de electrónica - Laboratorio de Fabricación Digital UPNA 2014Alberto Labarga
Este documento explica conceptos básicos de electrónica como circuitos, corriente, voltaje y la ley de Ohm. Describe componentes pasivos como resistencias, condensadores y bobinas, así como baterías, diodos y sus funciones. Explica que un circuito es una combinación de componentes que permiten el flujo de electrones y realizar tareas útiles, y define términos clave como corriente, voltaje, resistencia y capacitancia.
Introducción a la plataforma Arduino - Laboratorio de Fabricación Digital UPN...Alberto Labarga
El documento describe el hardware y software de Arduino, incluyendo la placa Arduino UNO, el entorno de desarrollo integrado Arduino IDE, y cómo programar Arduino usando funciones como setup(), loop(), variables, operadores, estructuras de control, entradas y salidas digitales. También cubre temas como comunicación serie, números aleatorios, y ejemplos de proyectos como adivinar un número.
El documento habla sobre los métodos de impresión 3D como deposición fundida, estereolitografía y sinterizado selectivo por láser. También menciona varios tipos de impresoras 3D como Darwin, Mendel, Prusa i1, Prusa i2, Prusa i3 y Rostock. Finalmente, proporciona enlaces útiles sobre laboratorios de fabricación digital y guías de montaje para impresoras 3D.
Happy May and Happy Weekend, My Guest Students.
Weekends seem more popular for Workshop Class Days lol.
These Presentations are timeless. Tune in anytime, any weekend.
<<I am Adult EDU Vocational, Ordained, Certified and Experienced. Course genres are personal development for holistic health, healing, and self care. I am also skilled in Health Sciences. However; I am not coaching at this time.>>
A 5th FREE WORKSHOP/ Daily Living.
Our Sponsor / Learning On Alison:
Sponsor: Learning On Alison:
— We believe that empowering yourself shouldn’t just be rewarding, but also really simple (and free). That’s why your journey from clicking on a course you want to take to completing it and getting a certificate takes only 6 steps.
Hopefully Before Summer, We can add our courses to the teacher/creator section. It's all within project management and preps right now. So wish us luck.
Check our Website for more info: https://ldmchapels.weebly.com
Get started for Free.
Currency is Euro. Courses can be free unlimited. Only pay for your diploma. See Website for xtra assistance.
Make sure to convert your cash. Online Wallets do vary. I keep my transactions safe as possible. I do prefer PayPal Biz. (See Site for more info.)
Understanding Vibrations
If not experienced, it may seem weird understanding vibes? We start small and by accident. Usually, we learn about vibrations within social. Examples are: That bad vibe you felt. Also, that good feeling you had. These are common situations we often have naturally. We chit chat about it then let it go. However; those are called vibes using your instincts. Then, your senses are called your intuition. We all can develop the gift of intuition and using energy awareness.
Energy Healing
First, Energy healing is universal. This is also true for Reiki as an art and rehab resource. Within the Health Sciences, Rehab has changed dramatically. The term is now very flexible.
Reiki alone, expanded tremendously during the past 3 years. Distant healing is almost more popular than one-on-one sessions? It’s not a replacement by all means. However, its now easier access online vs local sessions. This does break limit barriers providing instant comfort.
Practice Poses
You can stand within mountain pose Tadasana to get started.
Also, you can start within a lotus Sitting Position to begin a session.
There’s no wrong or right way. Maybe if you are rushing, that’s incorrect lol. The key is being comfortable, calm, at peace. This begins any session.
Also using props like candles, incenses, even going outdoors for fresh air.
(See Presentation for all sections, THX)
Clearing Karma, Letting go.
Now, that you understand more about energies, vibrations, the practice fusions, let’s go deeper. I wanted to make sure you all were comfortable. These sessions are for all levels from beginner to review.
Again See the presentation slides, Thx.
Rock Art As a Source of Ancient Indian HistoryVirag Sontakke
This Presentation is prepared for Graduate Students. A presentation that provides basic information about the topic. Students should seek further information from the recommended books and articles. This presentation is only for students and purely for academic purposes. I took/copied the pictures/maps included in the presentation are from the internet. The presenter is thankful to them and herewith courtesy is given to all. This presentation is only for academic purposes.
pulse ppt.pptx Types of pulse , characteristics of pulse , Alteration of pulsesushreesangita003
what is pulse ?
Purpose
physiology and Regulation of pulse
Characteristics of pulse
factors affecting pulse
Sites of pulse
Alteration of pulse
for BSC Nursing 1st semester
for Gnm Nursing 1st year
Students .
vitalsign
In this concise presentation, Dr. G.S. Virdi (Former Chief Scientist, CSIR-CEERI, Pilani) introduces the Junction Field-Effect Transistor (JFET)—a cornerstone of modern analog electronics. You’ll discover:
Why JFETs? Learn how their high input impedance and low noise solve the drawbacks of bipolar transistors.
JFET vs. MOSFET: Understand the core differences between JFET and MOSFET devices.
Internal Structure: See how source, drain, gate, and the depletion region form a controllable semiconductor channel.
Real-World Applications: Explore where JFETs power amplifiers, sensors, and precision circuits.
Perfect for electronics students, hobbyists, and practicing engineers looking for a clear, practical guide to JFET technology.
Ancient Stone Sculptures of India: As a Source of Indian HistoryVirag Sontakke
This Presentation is prepared for Graduate Students. A presentation that provides basic information about the topic. Students should seek further information from the recommended books and articles. This presentation is only for students and purely for academic purposes. I took/copied the pictures/maps included in the presentation are from the internet. The presenter is thankful to them and herewith courtesy is given to all. This presentation is only for academic purposes.
How to Create A Todo List In Todo of Odoo 18Celine George
In this slide, we’ll discuss on how to create a Todo List In Todo of Odoo 18. Odoo 18’s Todo module provides a simple yet powerful way to create and manage your to-do lists, ensuring that no task is overlooked.
How to Manage Purchase Alternatives in Odoo 18Celine George
Managing purchase alternatives is crucial for ensuring a smooth and cost-effective procurement process. Odoo 18 provides robust tools to handle alternative vendors and products, enabling businesses to maintain flexibility and mitigate supply chain disruptions.
SCI BIZ TECH QUIZ (OPEN) PRELIMS XTASY 2025.pptxRonisha Das
Probability and basic statistics with R
1. Quantitative
Data Analysis
Probability and basic statistics
2. probability
The most familiar way of thinking about probability is within a
framework of repeatable random experiments. In this view the
probability of an event is defined as the limiting proportion of times
the event would occur given many repetitions.
3. Probability
Instead of exclusively relying on knowledge of the proportion of times
an event occurs in repeated sampling, this approach allows the
incorporation of subjective knowledge, so-called prior probabilities,
that are then updated. The common name for this approach is
Bayesian statistics.
4. The Fundamental Rules of
Probability
Rule 1: Probability is always positive
Rule 2: For a given sample space, the sum of probabilities is 1
Rule 3: For disjoint (mutually exclusive) events, P(AUB)=P (A)
+ P (B)
7. Simple statistics
mean(x) arithmetic average of the values in x
median(x) median value in x
var(x) sample variance of x
cor(x,y) correlation between vectors x and y
quantile(x) vector containing the minimum, lower quartile, median,
upper quartile, and maximum of x
rowMeans(x) row means of dataframe or matrix x
colMeans(x) column means
8. cumulative probability function
The cumulative probability function is, for any value of x, the
probability of obtaining a sample value that is less than or equal to
x.
curve(pnorm(x),-3,3)
11. Continuous Probability
Distributions
R has a wide range of built-in probability distributions, for each of
which four functions are available: the probability density function
(which has a d prefix); the cumulative probability (p); the quantiles of
the distribution (q); and random numbers generated from the
distribution (r).
14. Exercise
Suppose we have measured the heights of 100 people. The mean
height was 170 cm and the standard deviation was 8 cm. We can ask
three sorts of questions about data like these: what is the probability
that a randomly selected individual will be:
shorter than a particular height?
taller than a particular height?
between one specified height and another?
16. The central limit theorem
If you take repeated samples from a population with finite variance
and calculate their averages, then the averages will be normally
distributed.
19. The gamma distribution
The gamma distribution is useful for describing a wide range of
processes where the data are positively skew (i.e. non-normal, with a
long tail on the right).
21. The gamma distribution
α is the shape parameter and β −1 is the scale parameter. Special
cases of the gamma distribution are the exponential =1 and chi-
squared =/2, =2.
The mean of the distribution is αβ , the variance is αβ 2, the
skewness is 2/√α and the kurtosis is 6/α.
27. cumulative probability function
The cumulative probability function is, for any value of x, the
probability of obtaining a sample value that is less than or equal to
x.
curve(pnorm(x),-3,3)
29. Exercise
Suppose we have measured the heights of 100 people. The mean
height was 170 cm and the standard deviation was 8 cm. We can ask
three sorts of questions about data like these: what is the probability
that a randomly selected individual will be:
shorter than a particular height?
taller than a particular height?
between one specified height and another?
31. Why Test?
Statistics is an experimental science, not really a branch of
mathematics.
It’s a tool that can tell you whether data are accidentally or really
similar.
It does not give you certainty.
32. Steps in hypothesis testing!
1. Set the null hypothesis and the alternative hypothesis.
2. Calculate the p-value.
3. Decision rule: If the p-value is less than 5% then reject the null
hypothesis otherwise the null hypothesis remains valid. In any
case, you must give the p-value as a justification for your
decision.
33. Types of Errors…
A Type I error occurs when we reject a true null hypothesis (i.e.
Reject H0 when it is TRUE)
H0 T F
Reject I
Reject II
A Type II error occurs when we don’t reject a false null hypothesis
(i.e. Do NOT reject H0 when it is FALSE)
11.33
34. Critical regions and power
The table shows schematically relation between relevant probabilities
under null and alternative hypothesis.
do not reject reject
Null hypothesis is true 1- (Type I error)
Null hypothesis is false (Type II error) 1-
35. Significance
It is common in hypothesis testing to set probability of Type I error,
to some values called the significance levels. These levels usually set
to 0.1, 0.05 and 0.01. If null hypothesis is true and probability of
observing value of the current test statistic is lower than the
significance levels then hypothesis is rejected.
Sometimes instead of setting pre-defined significance level, p-value is
reported. It is also called observed significance level.
37. P-value
We start from the basic assumption: The null hypothesis is true
P-value is the probability of getting a value equal to or more extreme
than the sample result, given that the null hypothesis is true
Decision rule: If p-value is less than 5% then reject the null
hypothesis; if p-value is 5% or more then the null hypothesis remains
valid
In any case, you must give the p-value as a justification for your
decision.
39. Power analysis
The power of a test is the probability of rejecting the null hypothesis
when it is false.
It has to do with Type II errors: β is the probability of accepting the
null hypothesis when it is false. In an ideal world, we would obviously
make as small as possible.
The smaller we make the probability of committing a Type II error, the
greater we make the probability of committing a Type I error, and
rejecting the null hypothesis when, in fact, it is correct.
Most statisticians work with α=0.05 and β =0.2. Now the power of a
test is defined as 1− β =0.8
40. Confidence
A confidence interval with a particular confidence level is
intended to give the assurance that, if the statistical model is correct,
then taken over all the data that might have been obtained, the
procedure for constructing the interval would deliver a confidence
interval that included the true value of the parameter the proportion
of the time set by the confidence level.
41. Don't Complicate Things
Use the classical tests:
var.test to compare two variances (Fisher's F)
t.test to compare two means (Student's t)
wilcox.test to compare two means with non-
normal errors (Wilcoxon's rank test)
prop.test (binomial test) to compare two
proportions
cor.test (Pearson's or Spearman's rank
correlation) to correlate two variables
chisq.test (chi-square test) or fisher.test
(Fisher's exact test) to test for independence
in contingency tables
42. Comparing Two Variances
Before comparing means, verify that the variances are not
significantly different.
var.text(set1, set2)
This performs Fisher's F test
If the variances are significantly different, you can transform the
output (y) variable to equalise variances, or you can still use the
t.test (Welch's modified test).
43. Comparing Two Means
Student's t-test (t.test) assumes the samples
are independent, the variances constant,
and the errors normally distributed. It will
use the Welch-Satterthwaite approximation
(default, less power) if the variances are
different. This test can also be used for paired
data.
Wilcoxon rank sum test (wilcox.test) is used
for independent samples, errors not normally
distributed. If you do a transform to get
constant variance, you will probably have to
use this test.
44. Student’s t
The test statistic is the number of standard errors by which the two
sample means are separated:
45. Power analysis
So how many replicates do we need in each of two samples to detect
a difference of 10% with power =80% when the mean is 20 (i.e. delta
=20) and the standard deviation is about 3.5?
power.t.test(delta=2,sd=3.5,power=0.8)
You can work out what size of difference your sample of 30 would
allow you to detect, by specifying n and omitting delta:
power.t.test(n=30,sd=3.5,power=0.8)
46. Paired Observations
The measurements will not be independent.
Use the t.test with paired=T. Now you’re doing a single sample test
of the differences against 0.
When you can do a paired t.test, you should always do the paired
test. It’s more powerful.
Deals with blocking, spatial correlation, and temporal correlation.
47. Sign Test
Used when you can't measure a difference but can see it.
Use the binomial test (binom.test) for this.
Binomial tests can also be used to compare proportions. prop.test
48. Chi-squared contingency tables
the contingencies are all the events that could possibly happen. A
contingency table shows the counts of how many times each of the
contingencies actually happened in a particular sample.
49. Chi-square Contingency Tables
Deals with count data.
Suppose there are two characteristics (hair colour and eye colour).
The null hypothesis is that they are uncorrelated.
Create a matrix that contains the data and apply
chisq.test(matrix).
This will give you a p-value for matrix values given the assumption of
independence.
50. Fisher's Exact Test
Used for analysis of contingency tables when one or more of the
expected frequencies is less than 5.
Use fisher.test(x)
51. compare two proportions
It turns out that 196 men were promoted out of 3270 candidates,
compared with 4 promotions out of only 40 candidates for the
women.
prop.test(c(4,196),c(40,3270))
52. Correlation and covariance
covariance is a measure of how much two variables change
together
the Pearson product-moment correlation coefficient
(sometimes referred to as the PMCC, and typically denoted by r) is a
measure of the correlation (linear dependence) between two
variables
53. Correlation and Covariance
Are two parameters correlated significantly?
Create and attach the data.frame
Apply cor(data.frame)
To determine the significance of a
correlation, apply cor.test(data.frame)
You have three options: Kendall's tau
(method = "k"), Spearman's rank (method =
"s"), or (default) Pearson's product-moment
correlation (method = "p")
54. Kolmogorov-Smirnov Test
Are two sample distributions significantly different?
or
Does a sample distribution arise from a specific distribution?
ks.test(A,B)