31 terms

GOAL OF DESCRIPTIVE STATISTICS

Provide succinct summary of data from study

Impractical to list results from all subjects in clinical trial

Such list also tends not to be informative

Better to summarize key features of data

Allows presentation in table or graph

Facilitates interpretation of results

Different strategies needed to summarize different types of data

Impractical to list results from all subjects in clinical trial

Such list also tends not to be informative

Better to summarize key features of data

Allows presentation in table or graph

Facilitates interpretation of results

Different strategies needed to summarize different types of data

Two features particularly important for summarizing set of data

Measure of central tendency ("average" value)

Measure of variability (dispersion) in data

Measure of variability (dispersion) in data

Two additional features provide further information

Measure of asymmetry in distribution of data

Measure of relative prevalence of data in center & "tails" of distribution

Latter features can be important in selecting appropriate inferential statistical tests

Measure of relative prevalence of data in center & "tails" of distribution

Latter features can be important in selecting appropriate inferential statistical tests

Procedures listed for summarizing data assume whole population studied

Minor modifications needed when data are from sample (see later lecture)

Parameters (e.g., mean) defined on basis of whole population

Parameters conventionally assigned Greek symbols in statistics

Sample used to estimate parameters of population

Parameters (e.g., mean) defined on basis of whole population

Parameters conventionally assigned Greek symbols in statistics

Sample used to estimate parameters of population

MEASURES OF CENTRAL TENDENCY

mean, mode, median

Mean (arithmetic mean)

Often best central measure for interval data

Potentially misleading with asymmetric distribution

Potentially misleading with asymmetric distribution

Median

Usually best central measure for ordinal data

Also best central measure for interval data with asymmetric distribution

Also best central measure for interval data with asymmetric distribution

Mode

Only central measure for nominal data

Can be useful with ordinal data (if large set of data)

Rarely used with interval data

Can be useful with ordinal data (if large set of data)

Rarely used with interval data

2 additional measures used in specialized situations

Geometric mean

Harmonic mean

Both require interval data with absolute value for zero ("ratio" data)

More representative than arithmetic mean in specific circumstances

Harmonic mean

Both require interval data with absolute value for zero ("ratio" data)

More representative than arithmetic mean in specific circumstances

Mean (Arithmetic Mean)

Arithmetic average of all data values

Symbolized for population by (Greek letter 'mu')

Designated by X

_

("X-bar") for sample

Sum () of all data values divided by number (N) of values

= ( X ) / N

Appropriate measure of central tendency only when data values can be added

Appropriate for use with interval data

Inappropriate for use with ordinal data - unequally spaced scale

Cannot be used with nominal data

Relatively sensitive to extreme values

Especially with small set of data

Can be estimated from frequency table when individual values unavailable

Symbolized for population by (Greek letter 'mu')

Designated by X

_

("X-bar") for sample

Sum () of all data values divided by number (N) of values

= ( X ) / N

Appropriate measure of central tendency only when data values can be added

Appropriate for use with interval data

Inappropriate for use with ordinal data - unequally spaced scale

Cannot be used with nominal data

Relatively sensitive to extreme values

Especially with small set of data

Can be estimated from frequency table when individual values unavailable

Median

Middle value when data values listed in increasing order

Half of data values lie below median & half lie above it

No conventional symbol ascribed

Arrange data values in increasing order & count to halfway point

With even number of data values, take average of two middle values

Median = ( ½ N + ½ ) th data value

Most appropriate measure of central tendency for ordinal data

Also invaluable for interval data with asymmetric distribution

Less sensitive than mean to extreme values

Half of data values lie below median & half lie above it

No conventional symbol ascribed

Arrange data values in increasing order & count to halfway point

With even number of data values, take average of two middle values

Median = ( ½ N + ½ ) th data value

Most appropriate measure of central tendency for ordinal data

Also invaluable for interval data with asymmetric distribution

Less sensitive than mean to extreme values

Mode

Data value that occurs most frequently

No conventional symbol ascribed

Determined simply by counting number of occurrences for each data value

Only measure of "central tendency" applicable with nominal data

Can be meaningful for ordinal data (when set of data is large)

Modal class can be defined with frequency table for interval data

Set of data may have two (or more) modes - "bimodal" = 2 modes present

Likely to reflect inhomogeneous nature of population

No conventional symbol ascribed

Determined simply by counting number of occurrences for each data value

Only measure of "central tendency" applicable with nominal data

Can be meaningful for ordinal data (when set of data is large)

Modal class can be defined with frequency table for interval data

Set of data may have two (or more) modes - "bimodal" = 2 modes present

Likely to reflect inhomogeneous nature of population

Geometric Mean

Calculated as Nth root of product () of all data values

No conventional symbol ascribed

Geometric Mean = N% ( X )

More readily calculated using logarithms

log ( Geometric Mean ) = { ( log X ) } / N

Useful measure of central tendency for interval data usually shown on logarithmic scale

Always less than arithmetic mean

No conventional symbol ascribed

Geometric Mean = N% ( X )

More readily calculated using logarithms

log ( Geometric Mean ) = { ( log X ) } / N

Useful measure of central tendency for interval data usually shown on logarithmic scale

Always less than arithmetic mean

Harmonic Mean

Reciprocal of arithmetic mean of reciprocals of all data values

No conventional symbol ascribed

Harmonic Mean = 1 / [ { ( 1 / X ) } / N ]

Useful measure of central tendency when data values represent rates or ratios

Always less than arithmetic & geometric means

No conventional symbol ascribed

Harmonic Mean = 1 / [ { ( 1 / X ) } / N ]

Useful measure of central tendency when data values represent rates or ratios

Always less than arithmetic & geometric means

MEASURES OF VARIABILITY (DISPERSION)

3 measures commonly used to describe variability in set of experimental data

Standard deviation

Often best measure of variability for interval data

Potentially misleading with asymmetric distribution

Lower & upper quartiles (or interquartile range)

Usually best measure of variability for ordinal data

Also best for interval data with asymmetric distribution

Range

Measure of extremes for variability in population

Cannot be estimated reliably from sample

Standard deviation

Often best measure of variability for interval data

Potentially misleading with asymmetric distribution

Lower & upper quartiles (or interquartile range)

Usually best measure of variability for ordinal data

Also best for interval data with asymmetric distribution

Range

Measure of extremes for variability in population

Cannot be estimated reliably from sample

4 additional measures of variability can provide useful information

Mean deviation

Helps to explain meaning of standard deviation

Variance

Square of standard deviation - used for inferential statistical tests

Coefficient of variation

Other percentile values

Helps to explain meaning of standard deviation

Variance

Square of standard deviation - used for inferential statistical tests

Coefficient of variation

Other percentile values

Mean Deviation

The term "deviation" denotes difference of data value from mean

Mean deviation is arithmetic average of absolute values of all deviations

No conventional symbol

Mean Deviation = {**X - ** } / N

Not used in practice

Not readily amenable to computational use

Invaluable for explaining meaning of standard deviation

Mean deviation is arithmetic average of absolute values of all deviations

No conventional symbol

Mean Deviation = {

Not used in practice

Not readily amenable to computational use

Invaluable for explaining meaning of standard deviation

Standard Deviation

Defined as "root mean square deviation"

Symbolized for population by (lowercase Greek letter 'sigma')

Designated by lowercase s for sample

Square root of arithmetic average of squares of all deviations

= % [{ ( X - )2 } / N ]

Important to recognize as a form of "average" of deviations

Appropriate measure of variability (dispersion, spread) only for interval data

Invaluable measure of variability for interval data with symmetric distribution

May be misleading for interval data with asymmetric distribution

Inappropriate for use with ordinal data

Essential role in inferential statistical tests for interval data with symmetric distribution

Parametric statistical tests

Very sensitive to extreme values

Especially with small set of data

Can be estimated from frequency table when individual values unavailable

Symbolized for population by (lowercase Greek letter 'sigma')

Designated by lowercase s for sample

Square root of arithmetic average of squares of all deviations

= % [{ ( X - )2 } / N ]

Important to recognize as a form of "average" of deviations

Appropriate measure of variability (dispersion, spread) only for interval data

Invaluable measure of variability for interval data with symmetric distribution

May be misleading for interval data with asymmetric distribution

Inappropriate for use with ordinal data

Essential role in inferential statistical tests for interval data with symmetric distribution

Parametric statistical tests

Very sensitive to extreme values

Especially with small set of data

Can be estimated from frequency table when individual values unavailable

Variance

Square of standard deviation

Symbolized for population by 2

Designated by lowercase s2 for sample

Can be referred to as "second moment" of data values

Calculated in same manner as standard deviation - without taking square root

Very sensitive to extreme values

Especially with small set of data

Care needed in listing units for variance

Square of units that apply to data values (also to standard deviation)

e.g., variance in m2 for heights of patients (in metric units)

Symbolized for population by 2

Designated by lowercase s2 for sample

Can be referred to as "second moment" of data values

Calculated in same manner as standard deviation - without taking square root

Very sensitive to extreme values

Especially with small set of data

Care needed in listing units for variance

Square of units that apply to data values (also to standard deviation)

e.g., variance in m2 for heights of patients (in metric units)

Coefficient of Variation

Standard deviation expressed as a percentage of the (arithmetic) mean

No conventional symbol ascribed

Coefficient of Variation = ( / ) × 100 %

Meaningful only for interval data with absolute value for zero ("ratio" data)

Dimensionless quantity (no units)

Useful for comparing variability when different units apply

Can compare variability for completely different observations

Useful for assessing precision of laboratory assay

No conventional symbol ascribed

Coefficient of Variation = ( / ) × 100 %

Meaningful only for interval data with absolute value for zero ("ratio" data)

Dimensionless quantity (no units)

Useful for comparing variability when different units apply

Can compare variability for completely different observations

Useful for assessing precision of laboratory assay

Lower & Upper Quartiles

Lower quartile, median & upper quartile segregate data into quarters when listed in order

One quarter of data values lie below lower quartile

One quarter of data values lie between lower quartile & median

One quarter of data values lie between median & upper quartile

One quarter of data values lie above upper quartile

Arrange data values in increasing order & count to quarter-way point for lower quartile

Lower quartile also known as 25th percentile

Count three-quarters of way through data values to determine upper quartile

Upper quartile also known as 75th percentile

Lower Quartile = ( ¼ N + ½ ) th data value

Upper Quartile = ( ¾ N + ½ ) th data value

Most appropriate measure of variability for ordinal data

Also invaluable for interval data with asymmetric distribution

Much less sensitive than standard deviation to extreme values

One quarter of data values lie below lower quartile

One quarter of data values lie between lower quartile & median

One quarter of data values lie between median & upper quartile

One quarter of data values lie above upper quartile

Arrange data values in increasing order & count to quarter-way point for lower quartile

Lower quartile also known as 25th percentile

Count three-quarters of way through data values to determine upper quartile

Upper quartile also known as 75th percentile

Lower Quartile = ( ¼ N + ½ ) th data value

Upper Quartile = ( ¾ N + ½ ) th data value

Most appropriate measure of variability for ordinal data

Also invaluable for interval data with asymmetric distribution

Much less sensitive than standard deviation to extreme values

Interquartile Range

Range between lower quartile & upper quartile

Typically listed as values for lower & upper quartiles

Characteristically shown in graphical format (box plot or box & whisker plot)

Typically listed as values for lower & upper quartiles

Characteristically shown in graphical format (box plot or box & whisker plot)

Semi-Interquartile Range

Half of the interquartile range

Typically listed as a number (half of difference between lower & upper quartiles)

Typically listed as a number (half of difference between lower & upper quartiles)

Range

Extreme values (lowest & highest) in population

Best listed as two extreme values (lowest & highest)

Sometimes listed as difference between lowest & highest values

Estimate from sample may be highly inaccurate

Best listed as two extreme values (lowest & highest)

Sometimes listed as difference between lowest & highest values

Estimate from sample may be highly inaccurate

Other Percentile Values

Yield additional information about distribution of data values

Calculated in similar manner to lower & upper quartiles

10th Percentile = ( 1/10 N + ½ ) th data value

Provide more extensive description of distribution for ordinal data

Also useful for interval data with asymmetric distribution

10th percentile & 90th percentile can be displayed graphically on box & whisker plot

Alternatively, 5th percentile & 95th percentile shown

Range from 2½th percentile to 97½th percentile designated as "normal range"

Used extensively with laboratory values for patient diagnosis

Values outside normal range taken to indicate presence of disease

Equally applicable for interval & ordinal data

Calculated in similar manner to lower & upper quartiles

10th Percentile = ( 1/10 N + ½ ) th data value

Provide more extensive description of distribution for ordinal data

Also useful for interval data with asymmetric distribution

10th percentile & 90th percentile can be displayed graphically on box & whisker plot

Alternatively, 5th percentile & 95th percentile shown

Range from 2½th percentile to 97½th percentile designated as "normal range"

Used extensively with laboratory values for patient diagnosis

Values outside normal range taken to indicate presence of disease

Equally applicable for interval & ordinal data

MEASURES OF ASYMMETRY

One parameter characteristically used to assess asymmetry in distribution of data values

Skewness

Best measure of asymmetry for interval data

Not applicable with other data types

Quartiles & other percentiles also yield data on asymmetry of data distribution

Invaluable measures of asymmetry for ordinal data

May provide useful information with interval data

Can aid in assessment of whether data are normally distributed

Skewness

Best measure of asymmetry for interval data

Not applicable with other data types

Quartiles & other percentiles also yield data on asymmetry of data distribution

Invaluable measures of asymmetry for ordinal data

May provide useful information with interval data

Can aid in assessment of whether data are normally distributed

Skewness

Characteristically defined as third moment of data values

Scaled in relation to standard deviation so as to yield dimensionless quantity

Skewness = [{ ( X - )3 } / N ] / 3

Zero value for skewness corresponds to perfect symmetry in distribution

Values between -½ and +½ for skewness imply approximate symmetry

Negative values for skewness imply lower tail of distribution is longer ("skewed to left")

Values between -1 and -½ imply distribution moderately skewed to left

Values below -1 imply distribution highly skewed to left

Positive values for skewness imply upper tail of distribution is longer ("skewed to right")

Values between +½ and +1 imply distribution moderately skewed to right

Values above +1 imply distribution highly skewed to right

Distributions more often skewed to right than to left in practice

Scaled in relation to standard deviation so as to yield dimensionless quantity

Skewness = [{ ( X - )3 } / N ] / 3

Zero value for skewness corresponds to perfect symmetry in distribution

Values between -½ and +½ for skewness imply approximate symmetry

Negative values for skewness imply lower tail of distribution is longer ("skewed to left")

Values between -1 and -½ imply distribution moderately skewed to left

Values below -1 imply distribution highly skewed to left

Positive values for skewness imply upper tail of distribution is longer ("skewed to right")

Values between +½ and +1 imply distribution moderately skewed to right

Values above +1 imply distribution highly skewed to right

Distributions more often skewed to right than to left in practice

MEASURES OF RELATIVE PREVALENCE OF CENTER & TAILS OF DISTRIBUTION

2 related parameters used to assess relative prevalence of center & tails of distribution

Kurtosis

Excess kurtosis

Both measures applicable only to interval data

Percentiles of distribution also indicative of pattern of prevalence in center & tails

Kurtosis

Excess kurtosis

Both measures applicable only to interval data

Percentiles of distribution also indicative of pattern of prevalence in center & tails

Kurtosis

Characteristically defined as fourth moment of data values

Scaled in relation to standard deviation so as to yield dimensionless quantity

Kurtosis = [{ ( X - )4 } / N ] / 4

Minimum possible value of 1 for kurtosis

Applies to discrete distribution with 2 equally likely outcomes (e.g., coin toss)

Distribution without center or tails - all data in shoulders

Normal distribution has value of 3 for kurtosis

More extensive use thus made of "excess kurtosis"

Scaled in relation to standard deviation so as to yield dimensionless quantity

Kurtosis = [{ ( X - )4 } / N ] / 4

Minimum possible value of 1 for kurtosis

Applies to discrete distribution with 2 equally likely outcomes (e.g., coin toss)

Distribution without center or tails - all data in shoulders

Normal distribution has value of 3 for kurtosis

More extensive use thus made of "excess kurtosis"

Excess Kurtosis

Calculated value for kurtosis reduced by 3

Adjusted to yield value of zero for normally distributed data

Excess Kurtosis = [{ ( X - )4 } / N ] / 4 - 3

Excess kurtosis (calculated as above) may be reported merely as "kurtosis"

Data distributed with excess kurtosis . 0 designated as "mesokurtic"

Similar pattern to normal distribution

Data distributed with excess kurtosis < 0 designated as "platykurtic"

Distribution has broad & flat central peak & weak tails

Pronounced shoulders in distribution

e.g., uniform distribution of data

Data distributed with excess kurtosis > 0 designated as "leptokurtic"

Distribution has sharp central peak & prominent tails

e.g., t-distribution with few degrees of freedom (see later lecture)

Adjusted to yield value of zero for normally distributed data

Excess Kurtosis = [{ ( X - )4 } / N ] / 4 - 3

Excess kurtosis (calculated as above) may be reported merely as "kurtosis"

Data distributed with excess kurtosis . 0 designated as "mesokurtic"

Similar pattern to normal distribution

Data distributed with excess kurtosis < 0 designated as "platykurtic"

Distribution has broad & flat central peak & weak tails

Pronounced shoulders in distribution

e.g., uniform distribution of data

Data distributed with excess kurtosis > 0 designated as "leptokurtic"

Distribution has sharp central peak & prominent tails

e.g., t-distribution with few degrees of freedom (see later lecture)

DESCRIPTIVE STATISTICS FOR NOMINAL DATA

Parameters listed above generally have no value for summarizing nominal data

Mode may have limited value as measure of central tendency

Nominal data best summarized by stating proportion of data values in each category

Data conveniently summarized in tabular form or through a pie chart

Mode may have limited value as measure of central tendency

Nominal data best summarized by stating proportion of data values in each category

Data conveniently summarized in tabular form or through a pie chart