C1 Density estimation


Terms in this set (...)

objects described by a set of data;
examples: customers, companies, subjects in a study, units in an experiment, etc.
special variable used in some data sets to distinguish the different cases
a characteristic of a case
Different cases can have different ______ of the variables.
What does the distribution of a variable tell us?
what values it takes and how often it takes these values
very useful for doing simple computations;
-> you can type in a formula and have the same computation performed for each row
The key characteristics of a data set answer the questions: ______ , _______ and _______.
Who? What? Why?
two basis strategies of exploratory data analysis
1. examine each variable by itself; then move on to study the relationship among the variables
2. begin with a graph; then add numerical summaries of specific aspects of the data
distribution of a categorical variable
lists the categories and gives either the
- count
- or the percent
of cases that fall in each category
the extreme values of a distribution;
high values are in the upper, or right, tail
low values are in the lower, or left tail
How to examine a distribution:
1. look for the overall pattern and for striking deviations from that pattern
2. describe the overall pattern of a distribution by its shape, center and spread
3. look for outliers
time plot
plots each observation against the time at which it was measured
resistant measure
when a measurement can resist the influence of extreme observations;
-> its value does not respond strongly to changes in a few observations
Why is the mean weak in measuring center?
mean is sensitive to the influence of extreme observations;
it is not a resistant measure
robust measure
synonym for resistant measure
How to find the median:
1. arrange all observations in order of size;
-> from smallest to largest
2. if the number of observations n is odd (ungerade),
the median M is the center observation in the
ordered list
-> find the location of the median by counting (n+1)/2
observations up from the bottom of the list
3. if the number of observations n is even, the median
M is the mean of the two center observations in the
ordered list
Name the two most common measures of the center of a distribution.
Compare the mean and the median:
- of a symmetric distribution
- of a skewed distribution
symmetric: close together
skewed: the mean is farther out on the long tail than is the median
variance s^2
the average of the squares of the deviations of the observations from their mean;
standard deviation s
the square root of the variance s^2;
degrees of freedom
the number n-1
properties of the standard deviation
- s measures spread about the mean and should be
used only when the mean is chosen as the measure of
- s=0 only when there is no spread
-> happens only when all observations have the same
-> as the observations become more spread about
their mean, s gets larger
- s, like the mean, is not resistant
-> a few outliers can make s very large
linear transformation
changes the original variable x into the new variable x new given by an equation of the form:
x new = a+bx

-> adding the constant a shifts all values of x upward or
downward by the same amount
-> such a shift changes the origin (zero point) of the
-> multiplying by the positive constant b changes the size of the unit of measurement
density curve
a curve that:
- is always on or above the horizontal axis
- has area exactly 1 underneath it

-> describes the overall pattern of a distribution
-> the area under the curve and above any range of values is the proportion of all observations that fall in that range
Why are Normal distributions so important for statistics?
1. they are good descriptions for some distributions of
real data
2. Normal distributions are good approx. to the results
of many kinds of chance outcomes (such as tossing a
3. many statistical inference procedures based on
Normal distribution work well for other roughly
symmetric distributions
68-95-99.7 rule
in the Normal distribution with mean µ and standard deviation σ:
- approx. 68% of the observations fall within one
standard deviation of µ
- approx. 95% within two standard deviations of µ
- approx. 99.7% within three standard deviations of µ
z-score or standardized value
if x is an observation from a distribution that has mean µ and standard deviation σ, the standardized value of x is:
z = (x - µ) / σ
standard Normal distribution
Normal distribution N(0,1) with mean 0 and standard deviation 1