89 terms

statistics

a way of reasoning, along with a collection of tools and methods, designed to help us understand the world.

business statistics

the science of empirical decision making in the face of uncertainty applied to business goals.

business analytics is used where?

in all areas of business such as financial analysis, econometrics, auditing, production and operations including services improvement, and marketing research.

Statistics is the scientific discipline which consists of:

- Formulating a question

- Collecting the relevant

data

- Describing/analyzing data

- Drawing conclusions or

generalizations from data

- Communicating the

results/conclusions to the

target audience

- Collecting the relevant

data

- Describing/analyzing data

- Drawing conclusions or

generalizations from data

- Communicating the

results/conclusions to the

target audience

The ability to answer questions and draw conclusions from data depends largely on our ability to

understand variation

Descriptive Statistics

utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information revealed in a data set, and to present the information in a convenient form.

Inferential Statistics

utilizes information from a smaller group of individuals to make estimates, decisions, predictions, or other generalizations about a larger group of individuals.

respondents

individuals who answer a survey

subjects or participants

people in an experiment

analysis units

animals, plants, websites, or other inanimate objects

variables

The characteristics recorded about each individual or case

identifier variable

a unique identifier assigned to each individual or item in a group and acts as a label rather than providing information to be analyzed. Social security numbers, student ID numbers, tracking numbers and transactions numbers are examples of identifier variables.

categorical variable

places an individual into one of several groups or categories.

quantitative variable

has values that describe a measurable quantity as a number, often answering the question 'how many?' or 'how much?'.

time series

Variables that are measured at regular intervals over time

cross-sectional

When several variables are all measured at the same time

measurement unit

simply a quantity used as a standard of measurement (e.g., cents or dollars for currency, inches or cm's for length).

The fundamental problem in statistics

we are interested in knowing something about a large group of individuals, but we cannot acquire the necessary information for every individual in the group.

What is the solution to the fundamental problem is statistics?

We take a subset of the group, collect and study the information from the subset, and then try and say something meaningful about the original larger group of interest based on what we learned from the subset.

population

The original larger group of interest

sample

The subset of the population

sampling

The act of collecting data from a sample

analysis

The act of studying the sample data

statistical inference

The act of "saying something meaningful about

the population based on the sample data"

the population based on the sample data"

sampling

the act of selecting a subset of a population to study.

the three foundational ideas of sampling:

- examine a part of the whole

- randomization

- the sample size is what

matters most

- randomization

- the sample size is what

matters most

When a sample is biased....

it differs systematically from the population it is supposed to represent.

Random selection protects us against bias by

giving us a representative sample (on average) - EVEN if there are traits of the population that we are unaware of.

Randomization is 'fair' because it

does not systematically favor any particular characteristics.

the ratio of sample size to population is not relevant: The only factor affecting the sampling error is the _______ ______.

sample size

A "sample" that includes the entire population

census

- It can be impractical to take a

census.

- The population we're studying

may change during the time

the census is taken.

- Taking a census is usually also

very expensive.

- if we don't succeed in

collecting information on

everybody, those people that

we fail to reach are typically

more likely to have certain

characteristics (e.g., the

homeless). Therefore, the

sample will be systematically

bias due to not including

those members. Thereby, our

data will typically not be

representative of the wider

population with an incomplete

census.

census.

- The population we're studying

may change during the time

the census is taken.

- Taking a census is usually also

very expensive.

- if we don't succeed in

collecting information on

everybody, those people that

we fail to reach are typically

more likely to have certain

characteristics (e.g., the

homeless). Therefore, the

sample will be systematically

bias due to not including

those members. Thereby, our

data will typically not be

representative of the wider

population with an incomplete

census.

what are some drawbacks of taking a census?

is the entire collection of individuals or instances about which information is sought.

population

is a subset of a population, examined in hope of learning about the population.

sample

parameter

a quantity that describes an attribute of the population under study. typically unknown and unknowable.

statistic

a value calculated from sampled data, particularly one that corresponds to, and thus estimates, a population parameter.

When we use the sample data to make a conclusion about how the population may look we are making a __________ ________.

statistical inference

random sample

a sample in which every subject has some chance of being selected for the sample.

simple random sample

a sample in which every subject has an equally likely chance of being selected for the sample.

You can select a simple random sample by assigning each member of the population a number and either:

- Putting the numbers into a hat

and drawing at random

- Using a table of random digits

- Using a computer random

number generator to choose

the numbers at random.

and drawing at random

- Using a table of random digits

- Using a computer random

number generator to choose

the numbers at random.

sampling frame

a list of individuals from which the sample is drawn.

Ideally the sampling frame should be the same as the population. If the sampling frame differs systematically from the population, the sample will be _______.

biased

stratified sampling

the population is divided into non-overlapping groups (called strata) and a simple random sample is then obtained from each group.

cluster sampling

the population is divided into non-overlapping groups and all individuals within a randomly selected group or groups are sampled.

what is the difference between stratified and cluster sampling?

The difference between stratified and cluster sampling is that stratified sampling samples some individuals from all groups, where cluster sampling samples all individuals from some groups. Also, the cluster is usually NOT a defining attribute of the individual, however, the strata usually is a defining attribute.

systematic sampling

selecting every kth subject from the population.

nonresponse bias

People who refuse to respond differ systematically from those who do respond. For instance: people with low or high incomes refuse to take part in the survey, so the survey won't be able to reflect the population.

undercoverage

Some portion of the population is not sampled at all, or has a smaller representation than in the population. For instance: surveys done by evening phone calls skips those who are seldom at home in the evening.

voluntary response sample

A sample where people volunteer to participate.

when is voluntary response bias a problem?

it's a problem when NO statistical sampling method is used, for instance, posting the survey online and accepting whoever decides to answer it. Those with the strongest feelings (typically on one side) are more likely to respond than those who have moderate opinions. Therefore, internet surveys are virtually useless.

convenience sample

a sample that chooses the individuals based on convenience. For instance, asking customers waiting in line, rather than randomly selected customers. Those standing in line might have different opinions than other customers.

histogram

a type of bar chart with the height of each bar corresponding to that category's frequency. is the visualization of a grouped frequency distribution. allow us to visualize the shape of a quantitative distribution.

statistical inference

is basically the attempt to answer the question: What can we say about a population (whose parameters are essentially unknown) when all we know are the statistics of a sample?

sampling error

a fundamental concept in inferential statistics.

simply a by-product of studying a sample and NOT the entire population. originates from using only part of the population (i.e., the sample) instead of using the whole population.

simply a by-product of studying a sample and NOT the entire population. originates from using only part of the population (i.e., the sample) instead of using the whole population.

frequency table

organizes data by recording totals and category names as in the table below. appropriate for summarizing the distribution of a single categorical variable.

relative frequency table

displays the percentages that lie in each category rather than the counts.

bar chart

displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison.

if the counts in a bar chart are replaced with percentages, the data can be displayed in a _______ ______ ______ ______.

relative frequency bar chart

pie charts

show the whole group of cases as a circle sliced into pieces with sizes proportional to the fraction of the whole in each category.

contingency tables

displays one variable in the columns and a 2nd variable in the rows. are appropriate for summarizing the joint distribution of two categorical variables. can help us answer questions such as, '"Is there a relationship between survival rates and class?"

marginal distribution

The distribution of one variable while ignoring the other(s)

To describe the distribution of a quantitative variable, you will need to produce all of the following:

- A description of the shape of

the distribution

- A measure of the center of the

distribution

- A measure of the spread or

variation of the distribution

the distribution

- A measure of the center of the

distribution

- A measure of the spread or

variation of the distribution

Graphs used most often to represent quantitative data are

histograms and boxplots

Peaks or humps seen in a histogram are called the _____ of a distribution.

modes

A distribution whose histogram has one main peak is called (1) _______, two peaks (2)______ (see histogram to left), three or more peaks means that your distribution is (3)______.

(1) unimodal

(2) bimodal

(3) multimodal

(2) bimodal

(3) multimodal

A distribution whose histogram doesn't appear to have any mode and in which all the bars are approximately the same height is called _________.

uniform

outlier

a data value that is much smaller or much larger than the rest of the data values, atypical or unusual. an extreme value.

mode

this is the value in the data set that appears Most Often. This is often not very helpful with quantitative data, but is meaningful with categorical data.

median

this is the central value of an ordered data set.

mean

this is the sum of all the numbers divided by the total number of numbers in the set.

range

calculated as the difference between the largest and smallest values in the data - is very sensitive to unusual observations. Concentrating on the middle of the data avoids this problem.

quartiles

are the values that enclose the middle 50%.

interquartile range (IQR)

the difference between Q1 and Q3. measures how spread out the middle 50% of the data is.

standard deviation

can be seen as the typical (or average) distance of data points from the mean (center). The farther the typical distance from the mean, the greater the standard deviation. The standard deviation is very sensitive to outliers and has the same units as the original variable.

variance

the average squared deviation from the mean. square of all of the deviations from the mean and divided by n - 1.

the degrees of freedom

is the number of values in a system that are free to vary (i.e., the number of values that are not restricted or limited).

The degrees of freedom of an estimate is ........

the number of independent pieces of information (i.e., data values/observations) on which the estimate is based.

for what shape should the median and IQR be used for?

if the shape is skewed or if there are outliers

for what shape should the mean and standard deviation be used?

if the shape is unimodal and symmetric and no outliers.

The five-number summary

consists of the smallest observation (the Minimum), the First Quartile, the Median, the Third Quartile, and the largest observation (the Maximum), written in order from smallest to largest.

boxplot

a graph of the five-number summary. They are most useful for revealing outliers and side-by-side comparison of several distributions.

z-score

tells us how many standard deviations a data point is away from its mean, with the sign telling us whether the point falls above or below the mean.

ratios

one number expressed in relation to another by dividing the one number by the other.

proportions

measure the size of a part in comparative relation to a whole.

percentages

just a form of the proportion based on 100 units.

probability

measures or quantifies the likelihood of a future event occurring. the proportion of the number of occurrences of an event to the total number of trials.

independence

means that the outcome of one trial doesn't influence or change the outcome of another.

random variable

a variable for which the exact value cannot be predicted with certainty (it varies randomly).

binomial model

helps us calculate the probability of getting 𝑥 successes from a total of 𝑛 trials. used to easily calculate the probabilities for random variables with the following characteristics:

- There are only two possible

outcomes (success/failure) for

each trial.

- The probability of success,

denoted 𝑝, is the same for

each trial. The probability of

failure is also the same for

each trail and computed as

𝑞=1−𝑝.

- The trials are independent.

- The number of trials, 𝑛, is fixed

and known in advance.

- There are only two possible

outcomes (success/failure) for

each trial.

- The probability of success,

denoted 𝑝, is the same for

each trial. The probability of

failure is also the same for

each trail and computed as

𝑞=1−𝑝.

- The trials are independent.

- The number of trials, 𝑛, is fixed

and known in advance.