31 terms

DESCRIPTIVE STATISTICS

STUDY
PLAY
GOAL OF DESCRIPTIVE STATISTICS
Provide succinct summary of data from study
Impractical to list results from all subjects in clinical trial
Such list also tends not to be informative
Better to summarize key features of data
Allows presentation in table or graph
Facilitates interpretation of results
Different strategies needed to summarize different types of data
Two features particularly important for summarizing set of data
Measure of central tendency ("average" value)
Measure of variability (dispersion) in data
Two additional features provide further information
Measure of asymmetry in distribution of data
Measure of relative prevalence of data in center & "tails" of distribution
Latter features can be important in selecting appropriate inferential statistical tests
Procedures listed for summarizing data assume whole population studied
Minor modifications needed when data are from sample (see later lecture)
Parameters (e.g., mean) defined on basis of whole population
Parameters conventionally assigned Greek symbols in statistics
Sample used to estimate parameters of population
MEASURES OF CENTRAL TENDENCY
mean, mode, median
Mean (arithmetic mean)
Often best central measure for interval data
Potentially misleading with asymmetric distribution
Median
Usually best central measure for ordinal data
Also best central measure for interval data with asymmetric distribution
Mode
Only central measure for nominal data
Can be useful with ordinal data (if large set of data)
Rarely used with interval data
2 additional measures used in specialized situations
Geometric mean
Harmonic mean
Both require interval data with absolute value for zero ("ratio" data)
More representative than arithmetic mean in specific circumstances
Mean (Arithmetic Mean)
Arithmetic average of all data values
Symbolized for population by (Greek letter 'mu')
Designated by X
_
("X-bar") for sample
Sum () of all data values divided by number (N) of values
= ( X ) / N
Appropriate measure of central tendency only when data values can be added
Appropriate for use with interval data
Inappropriate for use with ordinal data - unequally spaced scale
Cannot be used with nominal data
Relatively sensitive to extreme values
Especially with small set of data
Can be estimated from frequency table when individual values unavailable
Median
Middle value when data values listed in increasing order
Half of data values lie below median & half lie above it
No conventional symbol ascribed
Arrange data values in increasing order & count to halfway point
With even number of data values, take average of two middle values
Median = ( ½ N + ½ ) th data value
Most appropriate measure of central tendency for ordinal data
Also invaluable for interval data with asymmetric distribution
Less sensitive than mean to extreme values
Mode
Data value that occurs most frequently
No conventional symbol ascribed
Determined simply by counting number of occurrences for each data value
Only measure of "central tendency" applicable with nominal data
Can be meaningful for ordinal data (when set of data is large)
Modal class can be defined with frequency table for interval data
Set of data may have two (or more) modes - "bimodal" = 2 modes present
Likely to reflect inhomogeneous nature of population
Geometric Mean
Calculated as Nth root of product () of all data values
No conventional symbol ascribed
Geometric Mean = N% ( X )
More readily calculated using logarithms
log ( Geometric Mean ) = { ( log X ) } / N
Useful measure of central tendency for interval data usually shown on logarithmic scale
Always less than arithmetic mean
Harmonic Mean
Reciprocal of arithmetic mean of reciprocals of all data values
No conventional symbol ascribed
Harmonic Mean = 1 / [ { ( 1 / X ) } / N ]
Useful measure of central tendency when data values represent rates or ratios
Always less than arithmetic & geometric means
MEASURES OF VARIABILITY (DISPERSION)
3 measures commonly used to describe variability in set of experimental data
Standard deviation
Often best measure of variability for interval data
Potentially misleading with asymmetric distribution
Lower & upper quartiles (or interquartile range)
Usually best measure of variability for ordinal data
Also best for interval data with asymmetric distribution
Range
Measure of extremes for variability in population
Cannot be estimated reliably from sample
4 additional measures of variability can provide useful information
Mean deviation
Helps to explain meaning of standard deviation
Variance
Square of standard deviation - used for inferential statistical tests
Coefficient of variation
Other percentile values
Mean Deviation
The term "deviation" denotes difference of data value from mean
Mean deviation is arithmetic average of absolute values of all deviations
No conventional symbol
Mean Deviation = { X - } / N
Not used in practice
Not readily amenable to computational use
Invaluable for explaining meaning of standard deviation
Standard Deviation
Defined as "root mean square deviation"
Symbolized for population by (lowercase Greek letter 'sigma')
Designated by lowercase s for sample
Square root of arithmetic average of squares of all deviations
= % [{ ( X - )2 } / N ]
Important to recognize as a form of "average" of deviations
Appropriate measure of variability (dispersion, spread) only for interval data
Invaluable measure of variability for interval data with symmetric distribution
May be misleading for interval data with asymmetric distribution
Inappropriate for use with ordinal data
Essential role in inferential statistical tests for interval data with symmetric distribution
Parametric statistical tests
Very sensitive to extreme values
Especially with small set of data
Can be estimated from frequency table when individual values unavailable
Variance
Square of standard deviation
Symbolized for population by 2
Designated by lowercase s2 for sample
Can be referred to as "second moment" of data values
Calculated in same manner as standard deviation - without taking square root
Very sensitive to extreme values
Especially with small set of data
Care needed in listing units for variance
Square of units that apply to data values (also to standard deviation)
e.g., variance in m2 for heights of patients (in metric units)
Coefficient of Variation
Standard deviation expressed as a percentage of the (arithmetic) mean
No conventional symbol ascribed
Coefficient of Variation = ( / ) × 100 %
Meaningful only for interval data with absolute value for zero ("ratio" data)
Dimensionless quantity (no units)
Useful for comparing variability when different units apply
Can compare variability for completely different observations
Useful for assessing precision of laboratory assay
Lower & Upper Quartiles
Lower quartile, median & upper quartile segregate data into quarters when listed in order
One quarter of data values lie below lower quartile
One quarter of data values lie between lower quartile & median
One quarter of data values lie between median & upper quartile
One quarter of data values lie above upper quartile
Arrange data values in increasing order & count to quarter-way point for lower quartile
Lower quartile also known as 25th percentile
Count three-quarters of way through data values to determine upper quartile
Upper quartile also known as 75th percentile
Lower Quartile = ( ¼ N + ½ ) th data value
Upper Quartile = ( ¾ N + ½ ) th data value
Most appropriate measure of variability for ordinal data
Also invaluable for interval data with asymmetric distribution
Much less sensitive than standard deviation to extreme values
Interquartile Range
Range between lower quartile & upper quartile
Typically listed as values for lower & upper quartiles
Characteristically shown in graphical format (box plot or box & whisker plot)
Semi-Interquartile Range
Half of the interquartile range
Typically listed as a number (half of difference between lower & upper quartiles)
Range
Extreme values (lowest & highest) in population
Best listed as two extreme values (lowest & highest)
Sometimes listed as difference between lowest & highest values
Estimate from sample may be highly inaccurate
Other Percentile Values
Yield additional information about distribution of data values
Calculated in similar manner to lower & upper quartiles
10th Percentile = ( 1/10 N + ½ ) th data value
Provide more extensive description of distribution for ordinal data
Also useful for interval data with asymmetric distribution
10th percentile & 90th percentile can be displayed graphically on box & whisker plot
Alternatively, 5th percentile & 95th percentile shown
Range from 2½th percentile to 97½th percentile designated as "normal range"
Used extensively with laboratory values for patient diagnosis
Values outside normal range taken to indicate presence of disease
Equally applicable for interval & ordinal data
MEASURES OF ASYMMETRY
One parameter characteristically used to assess asymmetry in distribution of data values
Skewness
Best measure of asymmetry for interval data
Not applicable with other data types
Quartiles & other percentiles also yield data on asymmetry of data distribution
Invaluable measures of asymmetry for ordinal data
May provide useful information with interval data
Can aid in assessment of whether data are normally distributed
Skewness
Characteristically defined as third moment of data values
Scaled in relation to standard deviation so as to yield dimensionless quantity
Skewness = [{ ( X - )3 } / N ] / 3
Zero value for skewness corresponds to perfect symmetry in distribution
Values between -½ and +½ for skewness imply approximate symmetry
Negative values for skewness imply lower tail of distribution is longer ("skewed to left")
Values between -1 and -½ imply distribution moderately skewed to left
Values below -1 imply distribution highly skewed to left
Positive values for skewness imply upper tail of distribution is longer ("skewed to right")
Values between +½ and +1 imply distribution moderately skewed to right
Values above +1 imply distribution highly skewed to right
Distributions more often skewed to right than to left in practice
MEASURES OF RELATIVE PREVALENCE OF CENTER & TAILS OF DISTRIBUTION
2 related parameters used to assess relative prevalence of center & tails of distribution
Kurtosis
Excess kurtosis
Both measures applicable only to interval data
Percentiles of distribution also indicative of pattern of prevalence in center & tails
Kurtosis
Characteristically defined as fourth moment of data values
Scaled in relation to standard deviation so as to yield dimensionless quantity
Kurtosis = [{ ( X - )4 } / N ] / 4
Minimum possible value of 1 for kurtosis
Applies to discrete distribution with 2 equally likely outcomes (e.g., coin toss)
Distribution without center or tails - all data in shoulders
Normal distribution has value of 3 for kurtosis
More extensive use thus made of "excess kurtosis"
Excess Kurtosis
Calculated value for kurtosis reduced by 3
Adjusted to yield value of zero for normally distributed data
Excess Kurtosis = [{ ( X - )4 } / N ] / 4 - 3
Excess kurtosis (calculated as above) may be reported merely as "kurtosis"
Data distributed with excess kurtosis . 0 designated as "mesokurtic"
Similar pattern to normal distribution
Data distributed with excess kurtosis < 0 designated as "platykurtic"
Distribution has broad & flat central peak & weak tails
Pronounced shoulders in distribution
e.g., uniform distribution of data
Data distributed with excess kurtosis > 0 designated as "leptokurtic"
Distribution has sharp central peak & prominent tails
e.g., t-distribution with few degrees of freedom (see later lecture)
DESCRIPTIVE STATISTICS FOR NOMINAL DATA
Parameters listed above generally have no value for summarizing nominal data
Mode may have limited value as measure of central tendency
Nominal data best summarized by stating proportion of data values in each category
Data conveniently summarized in tabular form or through a pie chart
YOU MIGHT ALSO LIKE...