MSIT- Test 1


Terms in this set (...)

a way of reasoning, along with a collection of tools and methods, designed to help us understand the world.
business statistics
the science of empirical decision making in the face of uncertainty applied to business goals.
business analytics is used where?
in all areas of business such as financial analysis, econometrics, auditing, production and operations including services improvement, and marketing research.
Statistics is the scientific discipline which consists of:
- Formulating a question
- Collecting the relevant
- Describing/analyzing data
- Drawing conclusions or
generalizations from data
- Communicating the
results/conclusions to the
target audience
The ability to answer questions and draw conclusions from data depends largely on our ability to
understand variation
Descriptive Statistics
utilizes numerical and graphical methods to look for patterns in a data set, to summarize the information revealed in a data set, and to present the information in a convenient form.
Inferential Statistics
utilizes information from a smaller group of individuals to make estimates, decisions, predictions, or other generalizations about a larger group of individuals.
individuals who answer a survey
subjects or participants
people in an experiment
analysis units
animals, plants, websites, or other inanimate objects
The characteristics recorded about each individual or case
identifier variable
a unique identifier assigned to each individual or item in a group and acts as a label rather than providing information to be analyzed. Social security numbers, student ID numbers, tracking numbers and transactions numbers are examples of identifier variables.
categorical variable
places an individual into one of several groups or categories.
quantitative variable
has values that describe a measurable quantity as a number, often answering the question 'how many?' or 'how much?'.
time series
Variables that are measured at regular intervals over time
When several variables are all measured at the same time
measurement unit
simply a quantity used as a standard of measurement (e.g., cents or dollars for currency, inches or cm's for length).
The fundamental problem in statistics
we are interested in knowing something about a large group of individuals, but we cannot acquire the necessary information for every individual in the group.
What is the solution to the fundamental problem is statistics?
We take a subset of the group, collect and study the information from the subset, and then try and say something meaningful about the original larger group of interest based on what we learned from the subset.
The original larger group of interest
The subset of the population
The act of collecting data from a sample
The act of studying the sample data
statistical inference
The act of "saying something meaningful about
the population based on the sample data"
the act of selecting a subset of a population to study.
the three foundational ideas of sampling:
- examine a part of the whole
- randomization
- the sample size is what
matters most
When a sample is biased....
it differs systematically from the population it is supposed to represent.
Random selection protects us against bias by
giving us a representative sample (on average) - EVEN if there are traits of the population that we are unaware of.
Randomization is 'fair' because it
does not systematically favor any particular characteristics.
the ratio of sample size to population is not relevant: The only factor affecting the sampling error is the _______ ______.
sample size
A "sample" that includes the entire population
- It can be impractical to take a
- The population we're studying
may change during the time
the census is taken.
- Taking a census is usually also
very expensive.
- if we don't succeed in
collecting information on
everybody, those people that
we fail to reach are typically
more likely to have certain
characteristics (e.g., the
homeless). Therefore, the
sample will be systematically
bias due to not including
those members. Thereby, our
data will typically not be
representative of the wider
population with an incomplete
what are some drawbacks of taking a census?
is the entire collection of individuals or instances about which information is sought.
is a subset of a population, examined in hope of learning about the population.
a quantity that describes an attribute of the population under study. typically unknown and unknowable.
a value calculated from sampled data, particularly one that corresponds to, and thus estimates, a population parameter.
When we use the sample data to make a conclusion about how the population may look we are making a __________ ________.
statistical inference
random sample
a sample in which every subject has some chance of being selected for the sample.
simple random sample
a sample in which every subject has an equally likely chance of being selected for the sample.
You can select a simple random sample by assigning each member of the population a number and either:
- Putting the numbers into a hat
and drawing at random
- Using a table of random digits
- Using a computer random
number generator to choose
the numbers at random.
sampling frame
a list of individuals from which the sample is drawn.
Ideally the sampling frame should be the same as the population. If the sampling frame differs systematically from the population, the sample will be _______.
stratified sampling
the population is divided into non-overlapping groups (called strata) and a simple random sample is then obtained from each group.
cluster sampling
the population is divided into non-overlapping groups and all individuals within a randomly selected group or groups are sampled.
what is the difference between stratified and cluster sampling?
The difference between stratified and cluster sampling is that stratified sampling samples some individuals from all groups, where cluster sampling samples all individuals from some groups. Also, the cluster is usually NOT a defining attribute of the individual, however, the strata usually is a defining attribute.
systematic sampling
selecting every kth subject from the population.
nonresponse bias
People who refuse to respond differ systematically from those who do respond. For instance: people with low or high incomes refuse to take part in the survey, so the survey won't be able to reflect the population.
Some portion of the population is not sampled at all, or has a smaller representation than in the population. For instance: surveys done by evening phone calls skips those who are seldom at home in the evening.
voluntary response sample
A sample where people volunteer to participate.
when is voluntary response bias a problem?
it's a problem when NO statistical sampling method is used, for instance, posting the survey online and accepting whoever decides to answer it. Those with the strongest feelings (typically on one side) are more likely to respond than those who have moderate opinions. Therefore, internet surveys are virtually useless.
convenience sample
a sample that chooses the individuals based on convenience. For instance, asking customers waiting in line, rather than randomly selected customers. Those standing in line might have different opinions than other customers.
a type of bar chart with the height of each bar corresponding to that category's frequency. is the visualization of a grouped frequency distribution. allow us to visualize the shape of a quantitative distribution.
statistical inference
is basically the attempt to answer the question: What can we say about a population (whose parameters are essentially unknown) when all we know are the statistics of a sample?
sampling error
a fundamental concept in inferential statistics.
simply a by-product of studying a sample and NOT the entire population. originates from using only part of the population (i.e., the sample) instead of using the whole population.
frequency table
organizes data by recording totals and category names as in the table below. appropriate for summarizing the distribution of a single categorical variable.
relative frequency table
displays the percentages that lie in each category rather than the counts.
bar chart
displays the distribution of a categorical variable, showing the counts for each category next to each other for easy comparison.
if the counts in a bar chart are replaced with percentages, the data can be displayed in a _______ ______ ______ ______.
relative frequency bar chart
pie charts
show the whole group of cases as a circle sliced into pieces with sizes proportional to the fraction of the whole in each category.
contingency tables
displays one variable in the columns and a 2nd variable in the rows. are appropriate for summarizing the joint distribution of two categorical variables. can help us answer questions such as, '"Is there a relationship between survival rates and class?"
marginal distribution
The distribution of one variable while ignoring the other(s)
To describe the distribution of a quantitative variable, you will need to produce all of the following:
- A description of the shape of
the distribution
- A measure of the center of the
- A measure of the spread or
variation of the distribution
Graphs used most often to represent quantitative data are
histograms and boxplots
Peaks or humps seen in a histogram are called the _____ of a distribution.
A distribution whose histogram has one main peak is called (1) _______, two peaks (2)______ (see histogram to left), three or more peaks means that your distribution is (3)______.
(1) unimodal
(2) bimodal
(3) multimodal
A distribution whose histogram doesn't appear to have any mode and in which all the bars are approximately the same height is called _________.
a data value that is much smaller or much larger than the rest of the data values, atypical or unusual. an extreme value.
this is the value in the data set that appears Most Often. This is often not very helpful with quantitative data, but is meaningful with categorical data.
this is the central value of an ordered data set.
this is the sum of all the numbers divided by the total number of numbers in the set.
calculated as the difference between the largest and smallest values in the data - is very sensitive to unusual observations. Concentrating on the middle of the data avoids this problem.
are the values that enclose the middle 50%.
interquartile range (IQR)
the difference between Q1 and Q3. measures how spread out the middle 50% of the data is.
standard deviation
can be seen as the typical (or average) distance of data points from the mean (center). The farther the typical distance from the mean, the greater the standard deviation. The standard deviation is very sensitive to outliers and has the same units as the original variable.
the average squared deviation from the mean. square of all of the deviations from the mean and divided by n - 1.
the degrees of freedom
is the number of values in a system that are free to vary (i.e., the number of values that are not restricted or limited).
The degrees of freedom of an estimate is ........
the number of independent pieces of information (i.e., data values/observations) on which the estimate is based.
for what shape should the median and IQR be used for?
if the shape is skewed or if there are outliers
for what shape should the mean and standard deviation be used?
if the shape is unimodal and symmetric and no outliers.
The five-number summary
consists of the smallest observation (the Minimum), the First Quartile, the Median, the Third Quartile, and the largest observation (the Maximum), written in order from smallest to largest.
a graph of the five-number summary. They are most useful for revealing outliers and side-by-side comparison of several distributions.
tells us how many standard deviations a data point is away from its mean, with the sign telling us whether the point falls above or below the mean.
one number expressed in relation to another by dividing the one number by the other.
measure the size of a part in comparative relation to a whole.
just a form of the proportion based on 100 units.
measures or quantifies the likelihood of a future event occurring. the proportion of the number of occurrences of an event to the total number of trials.
means that the outcome of one trial doesn't influence or change the outcome of another.
random variable
a variable for which the exact value cannot be predicted with certainty (it varies randomly).
binomial model
helps us calculate the probability of getting 𝑥 successes from a total of 𝑛 trials. used to easily calculate the probabilities for random variables with the following characteristics:
- There are only two possible
outcomes (success/failure) for
each trial.
- The probability of success,
denoted 𝑝, is the same for
each trial. The probability of
failure is also the same for
each trail and computed as
- The trials are independent.
- The number of trials, 𝑛, is fixed
and known in advance.