75 terms

# MSIT 3000

HOLMES

#### Terms in this set (...)

Respondents
individuals who answer a survey
Subjects or Participants
people in an experiment
Experimental Units
animals, plants, websites, or other inanimate objects
Variables
Characteristics recorded about each individual
Identifier Variable
is a unique identifier assigned to each individual or item in a group and acts as a label
Categorical Variable
places an individual into one of several groups or categories
Quantitive Variable
takes true numerical values for which arithmetic operations such as adding and averaging make sense.
Time Series
variables that are measured at regular intervals over time
Cross-Sectional
when several variables are measured at the same unit of time the data is...
When data
Data that are decades old may mean something different than similar values record last year
Where data
Data collected in Mexico may differ in meaning than data collected in the US.
How data
are data collected can make the difference between insight and nonsense.
Sample
is designed to obtain information about a subset of the population in order to learn something about the entire population.
Biased
Sampling methods that are over- or underemphasize certain characteristics of the population are said to be...
Randomize
Protects us by giving us a representative sample on average. It is fair because it does not systematically favor any particular characteristics
Sampling Error
sample-to-sample differences. This can be minimized by taking a bigger sample size.
Census
a "sample" that includes the entire population. A census does not always provide the best possible information about the population.
Population
is the entire collection of individuals or instances about which information is sought.
Sample
is a subset of a population, examined in hope of learning about the population.
Parameter
is a quantity that describes an attribute of the population under study. These are typically unknown and unknowable.
Statistic
is a value calculated from sampled data, particularly one that corresponds to, and thus estimates, a population parameter.
Statistical Inference
When we use that sample data to make a conclusion about how the population may look we are making a...
Random Sample
is a sample in which every subject has some chance of being selected for the sample. The most well-known type of random sample is a Simple Random Sample
Simple Random Sample
(abbreviated SRS) is a sample in which every subject has an equally likely chance of being selected for the sample.
Sampling Frame
is a list of individuals from which the sample is drawn.
Ideally the sampling frame should be the same as the population.
Stratified Sampling
the population is divided into non-overlapping groups (called strata) and a simple random sample is then obtained from each group
Cluster Sampling
the population is divided into non-overlapping groups and all individuals within a randomly selected group or groups are sampled.
Systematic Sampling
selecting every kth subject from the population.
Stratified vs Systematic
Stratified sampling samples some individuals from all groups, where cluster sampling samples all individuals from some groups.
Nonresponse bias
People who refuse to respond differ systematically from those who do respond
Undercoverage
Some portion of the population is not sampled at all, or has a smaller representation than in the population.

For instance: surveys done by evening phone calls skips those who are seldom at home in the evening.
Voluntary response sample:
A sample where people volunteer to participate. Those with the strongest feelings (typically on one side) are more likely to respond than those who have moderate opinions.
Convenience sample
is a sample that chooses the individuals based on convenience. For instance, asking customers waiting in line, rather than randomly selected customers. Those standing in line might have different opinions than other customers.
3 Rules of Data Analysis
(1) Make a Picture
(2) Make a Picture
(3) Make a Picture
Frequency Table
organizes data by recording totals and category names in the table below.
Relative Frequency Table
Displays the % that lie in each category rather than the counts
Bar Chart
displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison.
Relative Frequency Chart
When the counts are replaced with percentages within a bar chart
Pie Chart
show the whole group of cases as a circle sliced into pieces.
Contingency Table
displays one variable in the columns and 2nd variable in the rows

Help us answer questions such as, 'is there a relationship between survival rates and class'?
Dependent Variables
Percentage of passengers who survived is different for the various classes, so survival and class are dependent variables.
Independent Variables
the percentage of survivors would be the same for every class.
Histogram
is a simliar to a bar chart with the height of each bars corresponding to that categories frequency.
(No gap in bars, unless gap is in data) Used for quantitative data
Relative Frequency Histogram
displaying the percentage of cases in each bin instead of the count
Skewed Left
distribution has tail the is stretched out to the left
Skewed right
has distribution that has tail stretched to the right
Modes
Peaks or humps in a histogram
Unimodal
a distribution whose histogram has one main peak
Bimodal
Has two peaks
Multimodal
Three of more peaks
uniform
distribution has bars all at approximately the same height
Outlier
is a data value that is much smaller or much larger than the rest of the data values, atypical, or unusual
Bar Graph
Chart used for categorical data?
Mode
this is the value in the data set the appear Most Often. This is often not very helpful with the quantitive data, but meaningful with categorical
Median
is the central value of an Ordered data set

Resistant to outliers
Mean
this is sum of all the numbers divided by the total number of numbers in the set.

Sensitive to outliers
Range
Calculated as the difference between the largest and smallest values in the data. It is very sensitive to unusual observations.
Quartiles
are the values that enclose the middle 50%
Q1 & Q3
Q1, 25% of the data lies below the lower quartile.
Q3, 25% of the data lies above the third quartile.
IQR
the difference between Q1 and Q3. This measures how the spread out the middle 50% of the data is.

IQR= Q3- Q1
Standard Deviation
can be seen as the typical distance of data points from the mean (center)

Can only be 0 if all values are the same.
Median and IQR are...
very stable and not sensitive to skewnedss and outliers
five-number summary
consists of the smallest observation, the first quartile, the median, the third quartile, and the largest observation.

Min Q1 Median Q3 Max
How to find outliers?
Less than : Q1-1.5(IQR)
Greater than :Q3+1.5(IQR)

These will form your lower and upper fences.
Boxplot
is a graph of the five-number summary.

Most useful for revealing outliers and side-by-side comparison of several distributions.
The data set is Right Skewed
If the mean is greater than the median than...
The data set is Left Skewed
If the mean is less than the median than...
z-score
tells us how many standard Deviations a data point is away from their mean in units of SD.

Ex: z=-2.5 mean it is 2.5 SD's below the mean and vise versa.

Anything greater or less than 3 is unusual.
68% of data values
Mean (+/-) 1 SD
95% of data values
Mean (+/-) 2 SD
99.7% of the data values
Mean (+/-) 3 SD
Probability
an event is its long-run relative frequency
Independence
means that outcome of one trial doesn't influence or change the outcome of another
Probability Rules
(1) Probability is a number between 0 and 1
(2) Probabilities sum to 1
(3) The Complement Rule
(4) The Multiplication Rule
Random Variable
a variable for which the exact value cannot be predicted with certainty