Search
Browse
Create
Log in
Sign up
Log in
Sign up
Upgrade to remove ads
Only $0.99/month
R Programming
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Key Concepts:
Terms in this set (213)
Vector
A one dimensional ordered variables of the same type.
Array
A multi-dimensional generalization of vectors.
List
An ordered collection of variables of possibly different types.
List Signature
An ordered list of variable types in the list.
Dataframe
An ordered collection of lists having the same list signature.
Declaring a Vector
A = c(1, 2, 3)
Simple Variable Assignment
a = 1
Variable assignment with pointer syntax
a <- 1
Multiple Variable Assignment on a Single Line
a = 1; b = 2; c = 3
Concatentation Function
c()
List all variables in memory
ls()
Remove a specific variable from memory
rm(a)
Remove all variables from memory
rm(list=ls())
Set working directory to '/'
setwd('/')
Show example of a function X
example(X)
Start the web based help page
help.start()
Search the help pages for a specific term X
help.search('X')
All items in the array except the third.
A[-3]
Length of array A
length(A)
Element wise arithmetic
Applying a function to every element of an array
Vector function example
y = vector(mode="logical", length=4)
typeof(a)
Returns the type of variable a
length(A)
Returns the length of array A
as.integer(b)
Casts variable b to an integer
ls.str()
Shows variables in memory and their types
Factor variables
Represent values from an ordered or unordered finite set
rep(3.2, times=10) function
Repeats the value 3.2 10 times.
seq(0, 100, by=1)
Creates a vector of numerics from 0 to 100 that increment by 1.
seq(0, 1, length.out = 11)
11 evenly spaced numbers between 0 and 1
y = list(name="Mike", title="badass")
creates a list with two attributes
y$name
the name variable in the list y
x <- matrix(c(1,2,3,4), nrow=2, ncol=2)
creates a matrix (or array) that is 2x2
is.array(w)
returns TRUE if w is an array, otherwise FALSE
any(w < 0.5)
returns TRUE if any values of w are less than 0.5, otherwise FALSE
all(w < 0.5)
returns TRUE if all values of w are less than 0.5, otherwise FALSE
which(w<0.5)
returns the indices of w where their value is less than 0.5, otherwise 0
w[w>50] = 0
set all values of w that are greater than 50 to 0
x = array(data = z, dim = c(4, 5))
creates a two dimensional array with 4 rows and 5 columns
x[2,3]
access the 2nd row, 3rd column of a 2d array
x[2,]
accesses the entire 2nd row
x[-1,]
all rows but the first
y = x[c(1,2),c(1,2)]
extracts the top left 2x2 of an array
Matrix transpose
A rotation of the matrix around the diagonal
outer(y[1,], y[1,])
The outer product of two vectors (results in a matrix)
system('ls')
Runs the ls command on the system
dir()
Lists the files in the current directory
dir('dir')
Lists the files in the directory specified
y %*% y
Matrix or Inner Product of y and y
rbind(x[1,], x[1,])
Vertical concatenation of vectors.
cbind(x[1,], x[1,])
Horizontal concatenation of vectors.
L = list(name='mike', age=100, no.children=2, children.ages=c(60,50))
Creates a new list with a number of properties.
L$name
Prints the name variable in the list L.
L[1]
Returns a List with the first variable from the List L in it. The returned type is a List.
L[[1]]
Returns the first variable from the List L. If the first variable were a character, the returned type would be a character.
names(L)
Prints the keys (attribute values from the list L).
names(L)=c('a','b','c','d')
Overrides the variable names in the list L with a, b, c, d.
L['name']
Prints the property 'name' from the list L.
a = c(1,2)
b=c(1,2,3,4)
c = a+b
c
2 4 4 6
is.vector(a)
Returns TRUE if a is a vector, FALSE otherwise.
b = c(1,2,3)
b[5] = 5
b
1 2 3 NA 5
n = c('Mike', 'Mark')
a = c(30, 40)
s = c(1000,2000)
R = data.frame(name=n, age=a, salary=s)
names(R) = c('Name', 'Age', 'Salary')
R
Creates a DataFrame, essentially a table.
Name Age Salary
1 mike 30 1000
2 mark 40 2000
iris dataset
A core package dataset that includes flower measurements of different flower species.
head(iris, 4)
the first 4 rows of the iris dataset
tail(iris, 4)
the last 4 rows of the iris dataset
iris$Sepal.Length
All Sepal Length values in the iris dataset. A vector.
iris$Sepal.Length[1:10]
The first 10 Sepal Length values. A vector.
attach(iris, warn.conflicts=FALSE)
Attaches the iris dataset to the local namespace. This allows us to call attributes from it directly. i.e. we can just say Sepal.Length instead of iris$Sepal.Length.
mean(Sepal.Length)
The average of Sepal Lengths.
colMeans(iris[,1:4])
The means of the first 4 columns of iris.
dim(iris)
Returns the number of rows and columns of the iris dataset.
150 5
subset(iris, Sepal.Length < 5 & Species != 'setosa')
Returns a subset of the iris dataframe which includes rows where the Sepal Length is less than 5 and the Species is not setosa.
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
58 4.9 2.4 3.3 1.0 versicolor
107 4.9 2.5 4.5 1.7 virginica
dim(subset(iris, Sepal.Length < 5 & Species != 'setosa'))[1]
Returns the number of rows of the resulting subset.
2
summary(iris)
A useful statistical summary of the dataset iris.
edit(iris)
Opens the dataset in a spreadsheet editor.
Iris = read.table('someFile.txt', useHeader=TRUE)
Reads someFile.txt into Iris as a dataframe and expects a header to be use as the column names.
save.image(file='fname')
Saves an image of the current working memory (i.e. all variables) to a file named fname.
load('fname')
Loads working data into memory from filename fname.
history(10)
Displays the 10 most recent commands execute.
iris = edit(iris)
Edits the iris dataset in a spreadsheet and sets the resulting dataset to the iris variable.
log(10)
natural log of 10
log10(100)
log base 10 of 100, i.e. 2
exp(0)
e raised to the power of 0, i.e. 1
exp(1)
e raised to the power of 1, i.e. 2.718282
pi
A global variable representing pi.
unname(L)
Strips out the names of attributes in the list or dataframe.
sink('outFile')
Sends console output to the 'outFile' file, not to the console.
sink('outFile', split=TRUE)
Sends console output to the 'outFile' file AND to the console.
cat()
Prints arguments one after the other. e.g. cat("Hello", "World")
for(num in seq(1, 100, by=1) { print("hello world") }
A for loop that prints "hello world" 100 times.
repeat { }
Repeats code block until break is called.
while (b > a) { }
Repeats code block until b !> a
foo(1,2,3)
Calls function foo, passing in argument 1, 2, 3.
If argument(s) are omitted, the default value for the variable is used.
foo(name='mike', age=36)
Calls function food, passing in arguments name and age. Note: order does not matter when variables are named.
mypower = function(bas = 10, pow = 2) {
return(bas ^ pow)
}
Defines a function, mypower, that takes 2 arguments, bas and pow and returns bas ^ pow. bas has a default value of 10, pow has a default value of 2.
Variable passing
R passes variables to functions by value.
Vectorized code
Code, like element wise operations in arrays, that avoids computations in loops in the interpreter. Vectorized code is executed natively, like in C.
a = 1:10
Creates a vector of 10 integers, 1 through 10.
system.time(function)
Displays the time it took to execute function.
sapply(data, function)
Applies the supplied function to the data set provided.
R CMD SHLIB foo.c
Compiles foo.c into a foo.so object file that can be called by R, .C or .CALL functions. e.g. dynload('foo.so'); .C('foo',...)
library(ggplot2)
Loads the ggplot2 library into memory.
dyn.load("fooC.so")
Loads the fooC.so shared object file into memory.
.C("fooC", A, B, C, D)
Calls the fooC c program passing input A, B, C, and result D.
microbenchmark(function)
Displays the execution time of function in microseconds.
.Rprofile
Placed in the user's home directory, can be used to define .First and .Last functions.
.First
Function in .Rprofile that gets executed when R starts.
.Last
Function in .Rprofile that executes when R stops.
#pragma omp parallel for
A directive placed before a for loop in C that mult-threads the for loop. From OpenMP extension.
options(expressions=500000)
Sets the maximum number of nested recursive calls to 500,000.
source('h1.R')
Loads the R code in h1.R into the interpreter and runs it.
stopifnot(boolean)
Similar to an assert statement. Stops the program if the boolean is not TRUE.)
graphics package
The default visualization package in R. Harder to use than ggplot2 but may run faster.
ggplot2
A visualization package that may be simpler to use than graphics. It may also run slower, however. It is based on the Grammar of Graphics by Wilkinson (2005).
install.packages('ggplot2')
library(ggplot2)
Install the ggplot2 package and bring into scope.
faithful
A dataset of eruption times of Old Faithful in Yellowstone National Park, Wyoming, USA. Part of the datasets package.
datasets package
A package of datasets that come installed by default in R.
mtcars
A dataset of car model data from the Motor Trend Magazine in 1974. Part of the datasets package.
mpg
A dataset of car model data from fueleconomy.gov. Part of the ggplot2 package.
high-level functions
In the graphics package, functions that produce graphs, i.e. plot, hist, or curve.
low-level functions
In the graphics package, functions that edit graphs.
title
A low-level function in the graphics package that modifies the title of a graph.
grid
A low-level function in the graphics package that adds grid lines to a graph.
legend
A low-level function in the graphics package that connects symbols, colors, and line types to descriptive strings.
lines
A low-level function in the graphics package that adds a line plot to an existing graph.
qplot
A function from the ggplot2 package that produces a scatter plot by default.
ggplot
A function from the ggplot2 package that returns a graphics object that may be modified by adding layers to it. Provides automatic axes labeling and legends.
print
A function that will print a ggplot graph.
plot(x=data_frame$x, y=data_frame$y)
Using the high-level plot function from the graphics package that produces a simple scatter plot.
title(main="Some title")
Adds a title to the current plot.
qplot(x=x, y=y, data=data_frame, main="title feature", geom="point")
Creates a basic scatter plot with ggplot's qplot function.
aes
A function that accepts data variables as arguments and is passed to ggplot.
ggplot(data_frame, aes(x=x, y=y)) + geom_point()
Creates a scatter plot with ggplot and adds point geometry to it.
diamonds
A dataset from the ggplot2 package that lists the details of 50,000 round cut diamonds.
xlab
Defines the x label attribute in the plot function.
ylab
Defines the y label attribute in the plot function.
main
Defines the title string in the plot function.
Strip Plot
A plot that maps the ordered index of a data frame against a single attribute. The strip plot can expose strips or lines of similar data.
Histogram
A one dimensional plot that groups data into bins and shows counts of those bins.
bin width
Good bin width balances information loss with good aggregation.
Histograms vs Strip Plots
Histograms discard the ordering of the data.
breaks
In a histogram (hist) plot, the number of breaks between bins. The number of bins equals breaks + 1.
hist(data_frame$x, xlab="x label", ylab="y label", main="title", breaks=20)
Uses the graphics package to generate a histogram of the x dimension of data_frame with 20 breaks or 21 bins.
qplot(x = x, data=data_frame, binwidth=3, main="title")
Uses the ggplot2 package to generate a histogram from dimension x of data_frame.
ggplot(data_frame, aes(x=x)) + geom_histogram(binwidth=1)
Creates a plot of the x attribute of data_frame and displays it as a histogram.
ggplot(data_frame, aes(x=x, y=..density..)) + geom_histogram(binwidth=4)
Creates a plot of the x attribute of data_frame and shows the probability distribution of the bin values on the y axis and displays as a histogram of binwidth 4.
curve(sinc, -7, 7)
Creates a line plot using the graphics package that applies values from -7 to 7 to the function sinc.
S = sort.int(mpg$cty, index.return=T)
Returns a sorted list of values from the mpg$cty dimension. The list contains two lists. The first is the sorted list of values. The second is the index of each value from the original data set.
plot(S$x, type="l", lty=2, xlab="x label", ylab="y label")
Creates a line plot of the S$x dimension with line type of 2.
lines(mpg$hwy[S$ix], lty=1)
Adds a line representing highway mileage to an existing plot using line type 1.
legend("topleft", c("highway mpg", "city mpg"), lty=c(1,1))
Creates a legend in the top left corner of an existing plot with the labels and line types given.
qplot(x, y, geom="line")
Creates a line plot of x and y.
qplot(x, y, geom=c("line", "point"))
Creates a line plot of x and y with points.
geom_line()
Adds a line to a ggplot plot.
geom_point()
Adds points to a ggplot plot.
ggplot(data_frame, aes(x=x, y=y)) + geom_line() + geom_point()
Creates a line plot with points from the x and y dimensions of data_frame.
Faceting
Displaying multiple panels in the same graph.
qplot(x, y, geom=c("line"))
Create a line chart with functions x and y.
Smoothed Histogram
f_h : R -> R_+
f_h(x) = (1/n) * sum_i_to_n( K(x-x_of_i) )
Scatter Plot
Plots two variables against each other as points.
plot(faithful$waiting, faithful$eruption, xlab="waiting time (min)", ylab="waiting time (min)")
Creates a scatter plot with the Graphics library which puts waiting time on the x axis and eruption time on the y axis.
kernel function
smooths a distribution
pch, col, cex
Shape, Color, Size of points in a scatter plot
qplot(waiting, eruptions, data=faithful)
Scatterplot of faithful waiting vs eruptions
IQR
Inter quartile range
Box plot whiskers
Extend no more than 1.5 times the IQR away from the edges of the box
qqplot
quantile-quantile plots
ggsave("filename.pdf")
Saves the current graph as a pdf.
Raster graphics
Lower resolution but smaller file sizes.
Vector graphics
Higher resolution but larger file sizes.
Problems with Data sets
1.) Missing Data
2.) Outliers
3.) Highly skewed data
MCAR
Missing Completely At Random
Missingness does not depend on observed or unobserved variables.
MAR
Missing At Random
Missingness depends on observed variables only.
Ways to Deal with Missing Data
1.) Remove data with missing values.
2.) Replace missing values with a substitute (e.g. mean).
3.) Estimate a probability model and replace with values from that model.
Replacing missing data for MCAR
Any of the 3 missing data techniques are OK.
Replacing missing data for MAR
Methods may introduce systematic bias into the data.
is.na(dataframe)
Returns TRUE where dataframe value is NA and FALSE otherwise.
complete.cases(dataframe)
Returns a vector whose components are FALSE for all data rows missing data and TRUE for all data rows with no missing data.
na.omit(dataframe)
Returns a new dataframe that omits all rows with missing data.
na.rm
An option passed to functions that causes the function to operate only on rows without missing data.
Outliers
Two sources:
1.) errors in entering data (like with a human typing data in)
2.) non corrupt data but highly unlikely
Robustness to Outliers
Means a model is not sensitive to outliers.
Mean is not robust.
Median is robust.
Dealing with Outliers
1.) Truncate - drop them
2.) Winsorization - reset outliers to the highest value remaining outside of the outliers
3.) Robustness - analyze data with a robust procedure
Outlier
A data item is an outlier if it s below the alpha percentile or above the 100-alpha percentile.
Standard Deviation with Outliers
First remove the most extreme values, then calculate the standard deviation.
Power Transformation
A data transformation for dealing with skewed data.
Lambda in Power Transformation
Lambda < 1 removes right skewness.
Lambda > 1 removes left skewness.
Smaller values of lambda are more aggressive in removing skewness.
log-log Relationship
log log plot can reveal linear relationships between variables that are otherwise hard to see.
e.g. qplot(brain, body, log = "xy", data = Animals)
numeric variables
variables that are real valued. difference between two numeric variables is expected to be the euclidean distance. abs(b-a)
ordinal variables
variables representing measurements in a certain range R with a well defined order relation. e.g. the seasons.
categorial variables
variables that do not satisfy the ordinal or numeric assumption. e.g. items on a restaurant menu.
Binning (or discretization)
Taking a numeric variable (real number), dividing its range into several bins, and replacing it with a number representing the corresponding bin.
Binarization
Replacing a value with 0 or 1 based on threshold.
Indicator Variables
Breaking out categories into separate binary variables. For example, if we had 6 height categories (A-F), we would create 5 height category binary indicators (all 0 would mean category was A). Category variable would be set to 1 for appropriate category.
Sampling
Selecting a random number of rows.
Sampling with Replacement
When you sample, you can sample the same rows multiple times.
Sampling without Replacement
Rows can only be sampled once.
sample function
sample(data, number to sample, replace=TRUE/FALSE)
data partitioning
splitting data into two sets, like 75% in one and 25% in another
data shuffling
randomly mixing up data frame rows
dataframe concatenation
taking two dataframes with identical columns and combining them together
dataframe joining
when you have two dataframes with not identical columns, you can join them together
tall data
many rows, fewer columns. one or more columns act as an id, one other column acts as a value. e.g.
date | product | sales
1/1/1 | apples | 100
1/1/1 | oranges | 200
1/2/1 | apples | 50
1/2/1 | oranges | 150
wide data
more columns. basically a table. categories are columns, indexes are the first column of a row, cell values represent corresponding values.
e.g.
date | apples | oranges
1/1/11 | 100 | 200
1/2/11 | 50 | 150
reshape2
converts data from tall to wide
melt
converts from wide to tall
acast / dcast
converts from tall to wide
e.g.
dcast(tipsm, sex+time~variable, fun.aggregate = mean, margins = TRUE)
melt function
converts wide to tall
melt(data, id = id columns)
e.g. melt(smiths, id=c(1,2,3))
Split-apply-combine
Split a dataframe into segments, apply some operation to each segment, recombine segments into one array or dataframe.
unlist()
turns a list into a vector
strsplit("a,b,c", ",")
splits string into a list of characters
gsub(" ', "", x)
replaces all spaces with nothing in string x
nrow()
the number of rows in a dataframe
merge(df1, df2, by=0)
join dataframes on row index
THIS SET IS OFTEN IN FOLDERS WITH...
R Programming Data Science
114 terms
Classification Methods: Metrics
24 terms
R stringr package
26 terms
R Programming Basics
279 terms
YOU MIGHT ALSO LIKE...
31
SAT | Mometrix Comprehensive Guide
COSC 1436 Test 2 Review
92 terms
CSC 111 Final Review
78 terms
242 Final
94 terms
OTHER SETS BY THIS CREATOR
Job interview
2 terms
The Meditations of Marcus Aurelius - Stoicism Prin…
16 terms
Prin. of Health- Common Diet Plans/Therapies
23 terms
nutrition ch 8
6 terms