Upgrade to remove ads
Terms in this set (28)
What function calculates the Euclidean distance between 2 sets of data?
it assumes you have that the same number of rows and columns in your data (a square)
Why is it important to have the same scales for your data?
Because it can throw your data off. For instance if you are comparing the heights and weights of people, you need them on the same scale
What function do you use to get values on the same scale?
scale() which does this:
Whats the approach to clustering categorical data?
dummify the data, or in other words create 1 or 0 'flag' columns
What function dummifies a dataframe?
What function/attribute measures the euclidean distance of a dummified data
dist() where the attribute method = 'binary'
e.g. dist(dummydata, method="binary")
What is hiearchal clustering
grouping by distances at multiple levels
What is linkage criteria?
It is the approach you use to measure distances for groups
What linkage criteria uses the max distances between two sets
e.g max(c(distance_1, distance_2))
What linkage criteria uses the min distances between two sets
e.g. min(c(distance_1, distance_2))
What linkage criteria uses the average distances between two sets
e.g. mean(c(distance_1, distance_2))
What function breaks up a dist object into different hierarchical clusters? Name its 2 attributes
first input is a dist object, the result of the dist() function
the second is the method
e.g. dist(lineup, method='euclidean')
what function takes a hclust object and cuts it down into a specified number of groups. What is its attributes?
first input is a hclust object
the second is the number of groups as k
If you wanted to take the results of cutree and assign it back to the original data what approach would you use
dplyr mutate where you add a new column that is equal to the results of cutree
e.g. mutate(lineup, cluster = clusters_k2)
From end to end, explain the 4 steps to hiearchal clustering
1. measure the euclidean distance between the objects with dist()
2. Break the dist result up into hiearchal clusters with hclust()
3. Break the clusters down the to the number of groups you want with cutree()
4. Assign the results back to the orginal data using dplyr mutate
What are the 3 steps to visualize a colored dendogram and what library is needed for the color?
1. Convert your hclust object to a dendogram with as.dendogram()
2. With the library dendextend use the function color_branches() with the h attribute to denote the height you want colored
3. Plot the dendogram
What does the h attribute do in the color_branches() function specifically, but also in dendograms in general ?
in a dendogram h refers to the height of the dendogram. With an hclust object this literally refers to the distance between groups depending on the linkage criteria you chose when you created the hclust object
What function splits a hiearchal cluster into smaller chunks by height?
What is another name for a tree diagram?
Name 2 types of clustering methods and their a clustering?
hclust() - hierarchical clustering, essentially grouping them by distances at each level
kmeans() - k means clustering, is given a predefined number of groups, k, which the function figures out 'centroids' for k and then groups accordingly
From a kmeans object, how do you get to the cluster assignments for the observations
$cluster, as in bob$cluster where bob is the kmeans object
What are 3 methods for deciding what k is for kmeans()?
1. Intuitively...you know already
2. Elbow method
3. Silhoutte method
Explain the elbow method?
Elbow method calculates the sum of the squared distances between each observation and its centroid. This is done for k=1,2 and so on until you start to see the result level out. That point is the k you want to use. (example below is 2)
Where would you find the sum of squared distances for the elbow method?
is calculated in a kmeans object as tot.withinss
In calculating the elbow method what function from what library is recommended to calculated the sum of squared distances for each k?
map_dbl() from the purrr library lets you repeat the same function for each k
What is another name for an elbow plot?
Silhouette analysis what two inputs are calculated to measure distance?
c(i) Within cluster distance - measures the average distance for each observation to every other observation in their cluster
N(i) Closets neighbor, calculates the avg distance to every observation in every other cluster, then uses the average distance to the closest neighbor
The result of a silhouette analysis produces a result between 1 and -1. Explain what 1, 0, and a -1 mean
1 means that the observations match well within the cluster
0 means it could belong to other cluster
-1 means it should probably be related to other clusters
THIS SET IS OFTEN IN FOLDERS WITH...
Datacamp intermediate r
String Manipulation Commands in R
Text Mining in R
Import Functions in R
YOU MIGHT ALSO LIKE...
ISDS 361B - Ch. 4
Math Vocabulary Q3
Excel Exam 1
Marketing Research Chapter 20
OTHER SETS BY THIS CREATOR
Supervised Learning in R
Exploratory Analysis R
Correlation and Regression