Respondents

individuals who answer a survey

Subjects or Participants

people in an experiment

Experimental Units

animals, plants, websites, or other inanimate objects

Variables

Characteristics recorded about each individual

Identifier Variable

is a unique identifier assigned to each individual or item in a group and acts as a label

Categorical Variable

places an individual into one of several groups or categories

Quantitive Variable

takes true numerical values for which arithmetic operations such as adding and averaging make sense.

Time Series

variables that are measured at regular intervals over time

Cross-Sectional

when several variables are measured at the same unit of time the data is...

When data

Data that are decades old may mean something different than similar values record last year

Where data

Data collected in Mexico may differ in meaning than data collected in the US.

How data

are data collected can make the difference between insight and nonsense.

Sample

is designed to obtain information about a subset of the population in order to learn something about the entire population.

Biased

Sampling methods that are over- or underemphasize certain characteristics of the population are said to be...

Randomize

Protects us by giving us a representative sample on average. It is fair because it does not systematically favor any particular characteristics

Sampling Error

sample-to-sample differences. This can be minimized by taking a bigger sample size.

Census

a "sample" that includes the entire population. A census does not always provide the best possible information about the population.

Population

is the entire collection of individuals or instances about which information is sought.

Sample

is a subset of a population, examined in hope of learning about the population.

Parameter

is a quantity that describes an attribute of the population under study. These are typically unknown and unknowable.

Statistic

is a value calculated from sampled data, particularly one that corresponds to, and thus estimates, a population parameter.

Statistical Inference

When we use that sample data to make a conclusion about how the population may look we are making a...

Random Sample

is a sample in which every subject has some chance of being selected for the sample. The most well-known type of random sample is a Simple Random Sample

Simple Random Sample

(abbreviated SRS) is a sample in which every subject has an equally likely chance of being selected for the sample.

Sampling Frame

is a list of individuals from which the sample is drawn.

Ideally the sampling frame should be the same as the population.

Ideally the sampling frame should be the same as the population.

Stratified Sampling

the population is divided into non-overlapping groups (called strata) and a simple random sample is then obtained from each group

Cluster Sampling

the population is divided into non-overlapping groups and all individuals within a randomly selected group or groups are sampled.

Systematic Sampling

selecting every kth subject from the population.

Stratified vs Systematic

Stratified sampling samples some individuals from all groups, where cluster sampling samples all individuals from some groups.

Nonresponse bias

People who refuse to respond differ systematically from those who do respond

Undercoverage

Some portion of the population is not sampled at all, or has a smaller representation than in the population.

For instance: surveys done by evening phone calls skips those who are seldom at home in the evening.

For instance: surveys done by evening phone calls skips those who are seldom at home in the evening.

Voluntary response sample:

A sample where people volunteer to participate. Those with the strongest feelings (typically on one side) are more likely to respond than those who have moderate opinions.

Convenience sample

is a sample that chooses the individuals based on convenience. For instance, asking customers waiting in line, rather than randomly selected customers. Those standing in line might have different opinions than other customers.

3 Rules of Data Analysis

(1) Make a Picture

(2) Make a Picture

(3) Make a Picture

(2) Make a Picture

(3) Make a Picture

Frequency Table

organizes data by recording totals and category names in the table below.

Relative Frequency Table

Displays the % that lie in each category rather than the counts

Bar Chart

displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison.

Relative Frequency Chart

When the counts are replaced with percentages within a bar chart

Pie Chart

show the whole group of cases as a circle sliced into pieces.

Contingency Table

displays one variable in the columns and 2nd variable in the rows

Help us answer questions such as, 'is there a relationship between survival rates and class'?

Help us answer questions such as, 'is there a relationship between survival rates and class'?

Dependent Variables

Percentage of passengers who survived is different for the various classes, so survival and class are dependent variables.

Independent Variables

the percentage of survivors would be the same for every class.

Histogram

is a simliar to a bar chart with the height of each bars corresponding to that categories frequency.

(No gap in bars, unless gap is in data) Used for quantitative data

(No gap in bars, unless gap is in data) Used for quantitative data

Relative Frequency Histogram

displaying the percentage of cases in each bin instead of the count

Skewed Left

distribution has tail the is stretched out to the left

Skewed right

has distribution that has tail stretched to the right

Modes

Peaks or humps in a histogram

Unimodal

a distribution whose histogram has one main peak

Bimodal

Has two peaks

Multimodal

Three of more peaks

uniform

distribution has bars all at approximately the same height

Outlier

is a data value that is much smaller or much larger than the rest of the data values, atypical, or unusual

Bar Graph

Chart used for categorical data?

Mode

this is the value in the data set the appear Most Often. This is often not very helpful with the quantitive data, but meaningful with categorical

Median

is the central value of an Ordered data set

Resistant to outliers

Resistant to outliers

Mean

this is sum of all the numbers divided by the total number of numbers in the set.

Sensitive to outliers

Sensitive to outliers

Range

Calculated as the difference between the largest and smallest values in the data. It is very sensitive to unusual observations.

Quartiles

are the values that enclose the middle 50%

Q1 & Q3

Q1, 25% of the data lies below the lower quartile.

Q3, 25% of the data lies above the third quartile.

Q3, 25% of the data lies above the third quartile.

IQR

the difference between Q1 and Q3. This measures how the spread out the middle 50% of the data is.

IQR= Q3- Q1

IQR= Q3- Q1

Standard Deviation

can be seen as the typical distance of data points from the mean (center)

Can only be 0 if all values are the same.

Can only be 0 if all values are the same.

Median and IQR are...

very stable and not sensitive to skewnedss and outliers

five-number summary

consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation.

Min Q1 Median Q3 Max

Min Q1 Median Q3 Max

How to find outliers?

Less than : Q1-1.5(IQR)

Greater than :Q3+1.5(IQR)

These will form your lower and upper fences.

Greater than :Q3+1.5(IQR)

These will form your lower and upper fences.

Boxplot

is a graph of the five-number summary.

Most useful for revealing outliers and side-by-side comparison of several distributions.

Most useful for revealing outliers and side-by-side comparison of several distributions.

The data set is Right Skewed

If the mean is greater than the median than...

The data set is Left Skewed

If the mean is less than the median than...

z-score

tells us how many standard Deviations a data point is away from their mean in units of SD.

Ex: z=-2.5 mean it is 2.5 SD's below the mean and vise versa.

Anything greater or less than 3 is unusual.

Ex: z=-2.5 mean it is 2.5 SD's below the mean and vise versa.

Anything greater or less than 3 is unusual.

68% of data values

Mean (+/-) 1 SD

95% of data values

Mean (+/-) 2 SD

99.7% of the data values

Mean (+/-) 3 SD

Probability

an event is its long-run relative frequency

Independence

means that outcome of one trial doesn't influence or change the outcome of another

Probability Rules

(1) Probability is a number between 0 and 1

(2) Probabilities sum to 1

(3) The Complement Rule

(4) The Multiplication Rule

(2) Probabilities sum to 1

(3) The Complement Rule

(4) The Multiplication Rule

Random Variable

a variable for which the exact value cannot be predicted with certainty