• incomplete: lacking attribute values, lacking certain attributes of interest
- e.g., occupation=" "
• noisy: containing errors or outliers
- e.g., Salary="‐10"
• inconsistent: containing discrepancies in codes or names
- e.g., Age="42" Birthday="03/07/1997"
- e.g., Was rating "1,2,3", now rating "A, B, C" - e.g., discrepancy between duplicate records
• No quality data, no quality mining results!
• Quality decisions must be based on quality data
• Data extraction, integration, cleaning, transformation, and reduction comprises the majority of the work of building target data
• We can turn a numeric attribute into a
nominal/categorical one by using some sort of
• This involves dividing the range of possible values into subranges called buckets or bins.
- example: an age attribute could be divided into these bins:
child: 0‐12 teen: 12‐17 young: 18‐35 middle: 36‐59 senior: 60‐
• What if we don't know which subranges make
• Equal‐width binning divides the range of possible values into N subranges of the same size.
- bin width = (max value - min value) / N
- example: if the observed values are all between 0‐ 100, we could create 5 bins as follows:
width = (100 - 0)/5 = 20
bins: [0‐20], (20‐40], (40‐60], (60‐80], (80‐100]
- problems with this equal‐width approach?
• Equal‐frequency or equal‐height binning divides the range of possible values into N bins, each of
which holds the same number of training instances.
- example: let's say we have 10 training examples with the following values for the attribute that we're discretizing:
5, 7, 12, 35, 65, 82, 84, 88, 90, 95
To select the boundary values for the bins, this method typically chooses a value halfway between the training examples on either side of the boundary final bins: (‐inf, 9.5], (9.5, 50], (50, 83], (83, 89], (89, inf)
3rd EditionCharles E. Leiserson, Clifford Stein, Ronald L. Rivest, Thomas H. Cormen 5th EditionDavid A. Patterson, John L. Hennessy 2nd EditionHector Garcia-Molina, Jeffrey D Ullman, Jennifer Widom