58 terms

# MSIT 3000 Test 1

#### Terms in this set (...)

data tables
provide context so we know what the values mean
usually organized with the who in the rows and the what in the columns
the who
individual cases about whom we collect multiple measurements (name, price)
the what
measurements about these individual cases (Sandy, \$15)
respondents
individuals who answer a survey
subjects/ participants
people in an experiment
experimental units
animals, plants, websites, or inanimate objects
variables
the measurements recorded about each individual or case
shown in the columns of the table and identify what has been measured for each case
first column identifies each case and isn't included in the variables
quantitative variable
tells us how much of something was measured
usually specified units that identify which scale was used to measure the distance (height, weight, salary, etc)
possible to double or half the quantity
categorical variable
separate, distinct categories
not possible to identify how far apart two individuals are from each other (gender, race, nationality, hair color, phone number, zip code)
units
indicate how each value has been measured, scale of measurement, how much of something we have, how far apart two values are
identifier
a unique identification assigned to each item
listed in the first column of the data table (name or numeric code)
time series
variables that consist of the same quantity being measured over and over again, for a series of points in time
cross sectional
several variables all measured at the same time (# of customers, sales revenue)
sample
designed to obtain information about a subset of the population in order to learn something about the entire population
bias
sampling methods that over or under analyze certain characteristics of a population
randomization
doesn't systematically favor any particular characteristics
avoids unconscious bias
sampling error
sample-to-sample differences
purely due to randomness
can be minimized by taking a bigger sample size
census
samples the entire population
can be impractical or expensive
population parameter
the true, exact value in the population that we wish we knew (average gpa)
sample statistic
a number computed from the sample
want it to provide a good estimate of the population parameter
simple random sample
occurs in a way when every individual has the same chance of being selected
eliminates systematic bias
sampling frame
a list of individuals from which the sampling is drawn
should be the same as the population
stratified sampling
when we divide the population into homogenous groups and use simple random sampling within each stratum
ex: Athens residents can be stratified by income
cluster sampling
when the population can be split into groups such that each is like a mini population
ex: select a few streets and sample everyone on the streets instead of a few people from every street (stratified)
systematic sample
selecting every 5th, 10th, individual
avoids unconscious bias
population
the large group which you would like to study
nonresponse bias
people who don't respond differ systematically from people who do respond
voluntary response bias
people volunteer to participate
those with stronger feelings are more likely to respond
undercoverage bias
some portion of the population is not sampled at all or has a smaller representation
convenience sampling
when we simply include the individuals who are most convenient
ex: customers waiting in line vs. random selected customers
pilot test
recommended test with a small population test to see if there are any problems with the survey questions
frequency table
organizes data by recording counts for the data
used with a bar chart
relative frequency table
displays percentages in each category rather than counts
used with a pie chart
contingency table
displays one variable in the columns and another in the rows
dependent variables
since the % of people who survived is different from the various classes, class and survival are dependent
independent variables
percent of survivors would be the same for every class
histogram
can only display quantitive data
like a bar chart but no gap between bars
modes
peaks or humps in histograms
unimodal
histogram with one main peak
bimodal
histogram with two peaks
multimodal
histogram with three or plus peaks
uniform
has all the bars approximately the same height
symmetric distribution
if the halves of either side look approximately like mirror images from the center
skewness
if one side stretches out further than the other
mean
typical value or average of the data set
a good measure for unimodal or symmetric distributions
median
the value that splits the histogram in two equal areas
use if histogram contains gaps, skewness, or outliers
range
the difference between the max and min values
quartiles
the values the enclose the middle 50%
25% of the data fall below the first quartile
25% of the data fall above the third quartile
IQR
difference between Q1 and Q3
encloses the middle 50% of the data
standard deviation
measure of the average distance of points from the mean
more spread out the points are, the farther the average distance from the mean
very sensitive to outliers and skewness
five number summary
min, max, Q1, Q3, and the median
boxplot
graph of the five number summary
short horizontal lines at Q1, Q3, and the median
length of the box=IQR
if data is roughly distant from Q1 to Q3=symmetric
if one side is longer than the longer whisker is the side its skewed on
upper fence: Q3+IQR(1.5)
right skewed: mean > median
z-score
tells us how many standard deviations a data point is away from its mean
Z=X-mean/standard deviation
greater than 3 or less than -3 are very unusual
time series plot
a display of values against time
fences
Q3+1.5(IQR) upper
Q1-1.5(IQR) lower
anything outside of these fences are considered IQRs
complement rule
the set of outcomes that are not in an event
EX: if probability of rain is .2 then probability of no rain is .8
random variable
a variable with different outcomes
expected value
random variable=large sample average