58 terms

data tables

provide context so we know what the values mean

usually organized with the who in the rows and the what in the columns

usually organized with the who in the rows and the what in the columns

the who

individual cases about whom we collect multiple measurements (name, price)

the what

measurements about these individual cases (Sandy, $15)

respondents

individuals who answer a survey

subjects/ participants

people in an experiment

experimental units

animals, plants, websites, or inanimate objects

variables

the measurements recorded about each individual or case

shown in the columns of the table and identify what has been measured for each case

first column identifies each case and isn't included in the variables

shown in the columns of the table and identify what has been measured for each case

first column identifies each case and isn't included in the variables

quantitative variable

tells us how much of something was measured

usually specified units that identify which scale was used to measure the distance (height, weight, salary, etc)

possible to double or half the quantity

usually specified units that identify which scale was used to measure the distance (height, weight, salary, etc)

possible to double or half the quantity

categorical variable

separate, distinct categories

not possible to identify how far apart two individuals are from each other (gender, race, nationality, hair color, phone number, zip code)

not possible to identify how far apart two individuals are from each other (gender, race, nationality, hair color, phone number, zip code)

units

indicate how each value has been measured, scale of measurement, how much of something we have, how far apart two values are

identifier

a unique identification assigned to each item

listed in the first column of the data table (name or numeric code)

listed in the first column of the data table (name or numeric code)

time series

variables that consist of the same quantity being measured over and over again, for a series of points in time

cross sectional

several variables all measured at the same time (# of customers, sales revenue)

sample

designed to obtain information about a subset of the population in order to learn something about the entire population

bias

sampling methods that over or under analyze certain characteristics of a population

randomization

doesn't systematically favor any particular characteristics

avoids unconscious bias

avoids unconscious bias

sampling error

sample-to-sample differences

purely due to randomness

can be minimized by taking a bigger sample size

purely due to randomness

can be minimized by taking a bigger sample size

census

samples the entire population

can be impractical or expensive

can be impractical or expensive

population parameter

the true, exact value in the population that we wish we knew (average gpa)

sample statistic

a number computed from the sample

want it to provide a good estimate of the population parameter

want it to provide a good estimate of the population parameter

simple random sample

occurs in a way when every individual has the same chance of being selected

eliminates systematic bias

eliminates systematic bias

sampling frame

a list of individuals from which the sampling is drawn

should be the same as the population

should be the same as the population

stratified sampling

when we divide the population into homogenous groups and use simple random sampling within each stratum

ex: Athens residents can be stratified by income

ex: Athens residents can be stratified by income

cluster sampling

when the population can be split into groups such that each is like a mini population

ex: select a few streets and sample everyone on the streets instead of a few people from every street (stratified)

ex: select a few streets and sample everyone on the streets instead of a few people from every street (stratified)

systematic sample

selecting every 5th, 10th, individual

avoids unconscious bias

avoids unconscious bias

population

the large group which you would like to study

nonresponse bias

people who don't respond differ systematically from people who do respond

voluntary response bias

people volunteer to participate

those with stronger feelings are more likely to respond

those with stronger feelings are more likely to respond

undercoverage bias

some portion of the population is not sampled at all or has a smaller representation

convenience sampling

when we simply include the individuals who are most convenient

ex: customers waiting in line vs. random selected customers

ex: customers waiting in line vs. random selected customers

pilot test

recommended test with a small population test to see if there are any problems with the survey questions

frequency table

organizes data by recording counts for the data

used with a bar chart

used with a bar chart

relative frequency table

displays percentages in each category rather than counts

used with a pie chart

used with a pie chart

contingency table

displays one variable in the columns and another in the rows

dependent variables

since the % of people who survived is different from the various classes, class and survival are dependent

independent variables

percent of survivors would be the same for every class

histogram

can only display quantitive data

like a bar chart but no gap between bars

like a bar chart but no gap between bars

modes

peaks or humps in histograms

unimodal

histogram with one main peak

bimodal

histogram with two peaks

multimodal

histogram with three or plus peaks

uniform

has all the bars approximately the same height

symmetric distribution

if the halves of either side look approximately like mirror images from the center

skewness

if one side stretches out further than the other

mean

typical value or average of the data set

a good measure for unimodal or symmetric distributions

a good measure for unimodal or symmetric distributions

median

the value that splits the histogram in two equal areas

use if histogram contains gaps, skewness, or outliers

use if histogram contains gaps, skewness, or outliers

range

the difference between the max and min values

quartiles

the values the enclose the middle 50%

25% of the data fall below the first quartile

25% of the data fall above the third quartile

25% of the data fall below the first quartile

25% of the data fall above the third quartile

IQR

difference between Q1 and Q3

encloses the middle 50% of the data

encloses the middle 50% of the data

standard deviation

measure of the average distance of points from the mean

more spread out the points are, the farther the average distance from the mean

very sensitive to outliers and skewness

more spread out the points are, the farther the average distance from the mean

very sensitive to outliers and skewness

five number summary

min, max, Q1, Q3, and the median

boxplot

graph of the five number summary

short horizontal lines at Q1, Q3, and the median

length of the box=IQR

if data is roughly distant from Q1 to Q3=symmetric

if one side is longer than the longer whisker is the side its skewed on

upper fence: Q3+IQR(1.5)

right skewed: mean > median

short horizontal lines at Q1, Q3, and the median

length of the box=IQR

if data is roughly distant from Q1 to Q3=symmetric

if one side is longer than the longer whisker is the side its skewed on

upper fence: Q3+IQR(1.5)

right skewed: mean > median

z-score

tells us how many standard deviations a data point is away from its mean

Z=X-mean/standard deviation

greater than 3 or less than -3 are very unusual

Z=X-mean/standard deviation

greater than 3 or less than -3 are very unusual

time series plot

a display of values against time

fences

Q3+1.5(IQR) upper

Q1-1.5(IQR) lower

anything outside of these fences are considered IQRs

Q1-1.5(IQR) lower

anything outside of these fences are considered IQRs

complement rule

the set of outcomes that are not in an event

EX: if probability of rain is .2 then probability of no rain is .8

EX: if probability of rain is .2 then probability of no rain is .8

random variable

a variable with different outcomes

expected value

random variable=large sample average