Home
Subjects
Textbook solutions
Create
Study sets, textbooks, questions
Log in
Sign up
Upgrade to remove ads
Only $35.99/year
The Data Scientists Toolbox
STUDY
Flashcards
Learn
Write
Spell
Test
PLAY
Match
Gravity
Terms in this set (467)
What is the broad definition of Data Science?
At its core, data science is using data to answer questions.
List the 6 things Data Science can involve.
1) Statistics
2) Computer Science
3) Mathematics
4) Data Cleaning
5) Formatting
6) Data Visualization
Give the Economist's definition of a Data Scientist.
Someone who combines the skills of software programmer, statistician and storyteller slash artist to extract the nuggets of gold hidden under mountains of data.
What 2 factors explain the rise of data science?
1) The vast amount of data currently available and being generated.
2) The rise of inexpensive computing.
Describe how vast amounts of computer data are being generated.
Massive amounts of data being collected about many aspects of the world and our lives. Rich data and the tools to analyze it.
Describe the rise of inexpensive computing.
Rising computer memory capabilities, better processors and more software.
Name the 3 major characteristics of big data.
1) Volume
2) Velocity
3) Variety
What is meant by the volume of big data? How is this a problem?
More and more data is becoming increasingly available. Big data involves large datasets - and these large datasets are becoming more and more routine. You will have a lot of data available to you to analyze and it can be a difficult problem to wrangle all of that data!
What is meant by the velocity of big data?
Data is being generated at an astonishing rate. Data is being generated and collected faster than ever before.
How can you use velocity in big data?
You could in real time analyze data if you have the tools and skills to do so!
What is meant by the variety of big data?
The data we can analyze comes in many forms. You have different types of data available to you.
Name 2 types of datasets that could by analyzed. Give an example of each.
1) Unstructured datasets - Video or audio
2) Structured datasets - Database of video lengths, views or comments.
A ___________ is somebody who uses data to answer questions
data scientist
Name the 3 major skill sets that a data scientist embodies.
1) Hacking Skills
2) Math and Statistics Knowledge
3) Substantive Expertise
Explain how the 3 major skills sets that data scientists embody come together in their work.
A data scientist needs to have enough expertise in the area that they want to ask about in order to formulate their questions and to know what sorts of data are appropriate to answer that question. Once they have their question and appropriate data, they know from the sorts of data that data science works with, often times it needs to undergo significant cleaning and formatting - and this often takes computer programming slash "hacking" skills. Finally, once they have their data, they need to analyze it, and this often takes math and stats knowledge.
In this course, list the 2 components of hacking skills that will be presented.
1) Computer programming or at least computer programming with R.
2) Learning how to go out and get answers to your programming questions.
List 3 reasons why programming with R is useful to a data scientist.
1) Accessing data
2) Play around with data
3) Plot data
Name one reason why data scientists are in such high demand.
One reason data scientists are in such demand is that most of the answers aren't already outlined in textbooks - a data scientist needs to be somebody who knows how to find answers to novel problems.
True or False. There is a huge need for individuals with data science skills.
True. Top emerging jobs. Demand exceeds supply.
Name the 3 major job titles of data science.
1) Machine learning engineer
2) Data scientists
3) Big data engineer
True or False. Data Scientist is THE top job in the US in 2017, based on job satisfaction, salary, and demand.
True
List 3 reasons why it is a good time to be getting into data science.
1) More and more data available
2) More and more tools for collecting, storing, and analyzing it.
3) Demand for data scientists is becoming increasingly recognized as important in many diverse sectors, not just business and academia.
What is RStudio?
RStudio is a very nice graphical interface to R programming language.
Give the Cambridge English Dictionary definition of data.
Information, especially facts or numbers, collected to be examined and considered and used to help decision-making.
Give the Wikipedia definition of data.
A set of values of qualitative or quantitative variables.
Describe how the Cambridge English Dictionary and Wikipedia definition of data are the same.
Both agree that data is values or numbers or facts.
What are the 3 areas of focus surrounding data in the Cambridge English Dictionary definition of data? Which is most important?
1) Data is collected.
2) Data is examined.
3) Data is used to inform decisions. (most important)
What is the most important part of data science?
The most important part of data science is the question and how all we are doing is using data to answer the question.
How is the Wikipedia definition of data different than the Cambridge English dictionary?
The Wikipedia definition focuses more on what data entails rather than the question.
Define a set of values.
A set of values is a set of items to measure from. The set as a whole is what you are trying to discover something about. The population you are trying to discover something about.
In statistics, the set of items or set of values is often called the ___________.
population
Define variables.
Variables - variables are measurements or characteristics of an item.
Define quantitative variables.
Quantitative variables are measurements or information about quantities. Quantitative measurements are usually described by numbers and are measured on a continuous, ordered scale; they're things like height, weight and blood pressure.
Define qualitative variables.
Qualitative variables are measurements or information about qualities or numerical items. They are things like country of origin, sex, or treatment group. They are usually described by words, not numbers, and they are not necessarily ordered.
Describe an example of a structured dataset.
An example of a structured dataset - a spreadsheet of individuals (first initial, last name) and their country of origin, sex, height, and weight).
True or False. Data is often presented as a structured dataset.
False. Data is rarely presented as a structured dataset. Data sets commonly encountered are much messier.
Describe the 4 steps a data scientist has to follow to handle messy data.
1) Extract the information you want.
2) Corral the information into something tidy like the structured dataset table.
3) Analyze the structured dataset table appropriately.
4) Visualize the results.
List 8 examples or sources of messy data where you have to work to extract the information you need to answer your question.
1) Sequencing data
2) Population census data
3) Electronic medical records (EMR), other large databases
4) Geographic information system (GIS) data (mapping)
5) Image analysis and image extrapolation
6) Language and translations
7) Website traffic
8) Personal/Ad data (eg: Facebook, Netflix predictions, etc)
True or False. Data is important, but it is secondary to your question.
True
A good data scientist asks questions ________ and seeks out relevant data _______.
first, second
True or False. Data can drive the question being asked.
False. Often the data available will limit, or perhaps even enable, certain questions you are trying to ask. In these cases, you may have to reframe your question or answer a related question, but the data itself does not drive the question asking.
True or False. You could have all the data you could ever hope for, but if you don't have a question to start, the data is useless.
True
One of the main skills called upon for a data scientist is the ability to do what?
solve problems
In order to solve problems, a data scientist has to sometimes do what?
Sometimes to do that, a data scientist should seek help.
True or False. The ability to solve problems is at the root of data science; so the importance of being able to do so is paramount.
True. Being able to solve problems is often one of the core skills of a data scientist.
Give 3 reasons why knowing how to get help important.
1) This course is not like a standard class you have taken before where there may be 30 to 100 people and you have access to your professor for immediate help. In this class, at any one time there can be thousands of students taking the class; no one person could provide help to all of these people, all of the time.
2) Data science is new; you may be the first person to come across a specific problem and you need to be equipped with skills that allow you to tackle problems that are both new to you and to the community!
3) Troubleshooting and figuring out solutions to problems is a great, transferable skill! Being able to think about problems and get help effectively is of benefit to you in whatever career path you find yourself in!
True or False. So much of what any job often entails is problem solving.
True
True or False. Oftentimes, the fastest answer to a problem is one you find for yourself.
True
Describe 2 different ways you could find an answer to a problem for yourself.
1) Reading the manuals or help files. If you post a question on a forum that is easily answered by the manual, you will often get a reply of "Read the manual" ... which is not the easiest way to get at the answer you were going for!
2) Next steps are searching on Google and searching relevant forums.
How do you get to the help files in the R programming language?
For R problems, try typing ?command.
Name 3 common forums for searching for answers to data science problems.
1) StackOverflow
2) CrossValidated
3) This course forum
For a forum, what is advisable before you post a question?
Before posting a question to any forum, try and double check that it hasn't been asked before, using the forums' search functions.
Describe how to effectively use Google to search for answers to problems in data science.
While you are searching using Google, things to pay attention to and look for are: tutorials, FAQs, or vignettes of whatever command or program is giving you trouble. These are great resources to get you started - either in telling you the language/words to use in your next searches, or outright showing you how to do something.
Describe the 2 categories that coding problems generally fall into.
1) Your command produces no data and spits out an error message.
2) Your command produces an output, but it is not at all what you wanted.
Describe 3 strategies for coding problems involving your command producing no data and spitting out an error message.
1) Check for typos.
2) Read the error message and make sure you understand it.
3) Google the error message, exactly.
Describe 2 strategies for coding problems involving your command produces an output, but it is not at all what you wanted.
1) Consider how the output was different from what you expected.
2) Think about what it looks like the command actually did, why it would do that, and not what you wanted.
After trying to solve a problem on your own, what are the benefits of seeking out someone or a peer for help or direction?
Easiest is to find a peer with some experience with what you are working on and ask them for help/direction. This is often great because the person explaining gets to solidify their understanding while teaching it to you, and you get a hands on experience seeing how they would solve the problem. In this class, your peers can be your classmates and you can interact with them through the course forum.
Describe a problem solving strategy if there are no available data science savvy peers.
Rubber duck debugging. In this approach, a stumped programmer explain their problem to a rubber duck, and in the process of explaining the problem, identify the solution. Many programmers have had the experience of explaining a programming problem to someone else, possibly even to someone who knows nothing about programming, and then hitting upon the solution in the process of explaining the problem. In describing what the code is supposed to do and observing what it actually does, any incongruity between these two becomes apparent.
List 8 details to include when posting questions to a forum.
Details to include:
1) The question you are trying to answer
2) How you approached the problem, what steps you have already taken to answer the question
3) What steps will reproduce the problem (including sample data for troubleshooters to work from!)
4) What was the expected output
5) What you saw instead (including any error messages you received!)
6) What troubleshooting steps you have already tried
7) Details about your set-up, eg: what operating system you are using, what version of the product you have installed (eg: R, Rpackages)
8) Be specific in the title of your questions!
What can be an issue with titles for a question posted on a forum?
A title should be specific. Without being specific, you don't give your potential helpers a lot to go off of - they don't really know what the problem is and if they are able to help you. Provide some details about what you are having problems with.
What 2 key pieces of information should be included in a title posting on a forum?
The title to your question posted to the forum should include:
1) Answering what you were doing.
2) Answering what the problem is.
Why is a forum title that answers what you were doing and what the problem is can be a good idea?
This way somebody who is on the forum will know exactly what is happening and that they might be able to help!
Why would you want to focus on a very specific core problem that you are trying to get help with for a title of a forum posting?
It signals to people that you are looking for a very specific answer; the more specific the question, often, the faster the answer.
Give 3 examples of common forum etiquette.
1) Asking specific questions.
2) Doing some troubleshooting of your own.
3) Giving potential problem solvers easy access to all the information they need to help you.
List 7 things you should do in regard to forum etiquette.
1) Read the forum posting guidelines
2) Make sure you are asking your question on an appropriate forum!
3) Describe the goal
4) Be explicit and detailed in your explanation
5) Provide the minimum information required to describe (and replicate) the problem
6) Be courteous! (Please and thank you!)
7) Follow up on the post OR post the solution
Regarding forum etiquette, why is it important to follow up on the post?
You have asked your question and you have received several answers and lo and behold one of them works! You are all set, get back to work! No! Go back to your posting, reply to the solution that worked for you, explaining that they fixed your problem and thanking them for their solution! Not only do the people helping you deserve thanks, but this is helpful to anybody else who has the same problem as you, later on. They are going to do their due diligence, search the forum and find your post - it is so helpful for you to have flagged the answer that solved your problem.
Regarding forum etiquette, why is it important to post the solution?
While you are waiting for a reply, perhaps you stumble upon the solution (go you!) - don't just close the posting or never check back on it. One, people who are trying to help you may be replying and you are functionally ignoring them, or two, if you close it with no solution, somebody with the same problem won't ever learn what your solution was! Make sure to post the solution and thank everybody for their help!
List 4 things you should avoid in regard to forum etiquette.
1) Immediately assume you have found a bug
2) Post homework questions
3) Cross post on multiple forums
4) Repost if you don't immediately get a response
Name the 4 steps in a data science project.
1) Forming the question.
2) Finding or generating the data.
3) Analyzing the data.
4) Communicating the data science project to others.
Every Data Science Project starts with a ___________ that is to be ___________ with data.
question, answered
Describe the 2 steps in analyzing data.
1) Exploring the data.
2) Modeling the data.
When analyzing data, what is meant by modeling the data?
Modeling the data means using some statistical or machine learning techniques to analyze the data and answer your question.
Describe 4 different ways a data science project is communicated to others.
1) Report sent to your boss.
2) Report sent to your team at work.
3) A blog post.
4) A presentation to a group of colleagues.
True or False. A data science project almost always involves some form of communication of the projects' findings.
True
True or False. When setting out on a data science project, it is always great to have your question well-defined. Additional questions may pop up as you do the analysis, but knowing what you want to answer with your analysis is a really important first step.
True
Explain how GitHub can be helpful in a data analysis project.
GitHub is a place where you can make available all of your code (written for example in R programming language) so that others could see what she did and repeat her steps if they wanted.
True or False. Data science projects often involve writing a lot of code and generating a lot of figures that are always included in your final results.
False. Data science projects often involve writing a lot of code and generating a lot of figures that are NOT included in your final results. This is part of the data science process too.
Why do data science projects often involve writing a lot of code and generating a lot of figures that are NOT included in your final results?
Figuring out how to do what you want to do to answer your question of interest is part of the process, doesn't always show up in your final project, and can be very time-consuming.
In a data science project, what is an important part of communicating the project to others?
It is important to note that most projects build off someone else's work. It is really important to give those people credit.
For a data science project communicated via blog post, what is a good way to give people credit for projects built off other people's work? Give 3 examples.
Provide links in the blog post to:
1) Linking to a blog post where someone had asked a similar question previously.
2) Linking to the Social Security website where the author got the data.
3) Linking to where the author learned about web scraping.
What is the Shiny App?
Shiny App is an interactive web application for presenting results of a data science project.
What programming language is used with the Shiny App to present the results of a data science project?
The R Programming Language
For a data science project, why should you include links or citations to others' work?
It helps other quickly find information you have referenced in your data science project report.
R is both a _______________ and an ___________, focused mainly on _______________ and ___________.
programming language, environment, statistical analysis, graphics
What is CRAN? What is it used for?
CRAN is an acronym for Comprehensive R Archive Network. You download R and installation packages from CRAN.
List 4 reasons you should use the R programming language.
1) Its popularity
2) Its cost
3) Its extensive functionality
4) Its community
True or False. R is quickly becoming the standard language for statistical analysis.
True
True or False. Knowing R is one of the top five languages asked for in data scientist job postings.
True
List 3 reasons why the popularity of the R programming language is important.
The more popular a software is:
1) The quicker new functionality is developed.
2) The more powerful it becomes
3) The better the support there is.
What is one key advantage of R over SPSS and SAS?
Every aspect of R is free to use, unlike some other stats packages you may have heard of (eg: SAS, SPSS), so there is no cost barrier to using R.
R is a very versatile programming language. List 4 functions outside of statistics and graphing for which you can use R.
1) Making websites
2) Making maps using GIS data
3) Analyzing language
4) Making lectures and videos
True or False. For whatever task you have in mind using R, there is often a package available for download that does exactly that.
True
What are 2 key benefits of the R programming community?
1) Individuals have come together to make "packages" that add to the functionality of R - and more are being developed every day!
2) Due to its popularity, there are multiple forums that have pages and pages dedicated to solving R problems. These forums are great both for finding other people who have had the same problem as you, and posting your own new problems.
Base R focuses on ____________.
statistical analysis
Why is the programming language R favored for this course?
Packages are easy to install and "play nicely" together.
What is an alternative to the R interface to input code?
An alternative way to input code is to use RStudio.
___________ is a graphical user interface for R.
RStudio
List 10 different things you can do using RStudio.
1) Write code
2) Edit code
3) Store code
4) Generate plots
5) View plots
6) Store plots
7) Manage files
8) Manage objects
9) Manage dataframes
10) Integrate with version control systems
What is the one huge benefit of using RStudio?
The visual nature of the RStudio program as an interface for R is a huge benefit.
Describe the general layout of RStudio.
Rstudio can be roughly divided into four quadrants, each with specific and varied functions, plus a main menu bar.
How do you add the Source quadrant if it is missing in your view of RStudio?
If you are missing the upper left quadrant and instead have the left side of the screen with just one region, "Console" - if this is the case, go to File > New File > R Script and you will now see 4 quadrants.
True or False. The size of the 4 quadrants or panels in RStudio is fixed.
False. You can change the sizes of each of the various quadrants by hovering your mouse over the spaces between quadrants and click-dragging the divider to resize the sections.
Outside of the Main Menu bar, name the 4 quadrants or panels that compose the RStudio desktop and their location.
1) The source (Upper Left)
2) The environment (Upper Right)
3) The console (Lower Left)
4) Files, plots, packages, help (Lower Right)
Describe the Main Menu bar on the RStudio desktop.
The menu bar runs across the top of your screen and should have two rows. The first row should be a fairly standard menu, starting with "File" and "Edit." Below that, there is a row of icons that are shortcuts for functions that you'll frequently use.
List 7 different short-cut functions you can find on the lower level of the Main Menu bar in RStudio.
1) New File
2) New Project
3) Open File
4) Change Panel Organization
5) Save File
6) Save All Open Files
7) Print File
List 6 things you can do in the File menu of the Main Menu bar of RStudio?
1) Open new files
2) Open saved files
3) Open new projects
4) Open saved projects
5) Save the current document
6) Close the RStudio program
Explain what happens when you hover with the mouse cursor on New File under the File menu option of the Main Menu bar of RStudio.
If you mouse over "New File", a new menu will appear that suggests the various file formats available to you.
List the 6 different types of file formats available to you in RStudio.
1) R Script
2) R Markdown
3) R notebooks
4) Web apps
5) Websites
6) Slide presentations
What are the 2 most common file types used in RStudio?
1) R Script
2) R Markdown
What happens when you select New File and a specific file format in RStudio?
If you click on any one of the file formats, a new tab in the "Source" quadrant will open.
What does the Session menu do on the Main Menu bar of RStudio?
The Session menu has R specific functions.
List 3 R specific functions in the Session menu of the Main Menu bar of RStudio.
1) Restart R
2) Terminate R
3) Interrupt R
Why would you want to Restart/Terminate/Interrupt R?
These can be helpful if R isn't behaving or is stuck and you want to stop what it is doing and start from scratch.
What are the 3 main things you can do in the Tools menu of the Main Menu Bar of RStudio?
1) Install new packages.
2) Set up your version control software.
3) Set your options and preferences for how RStudio looks and functions.
Describe the console quadrant or panel of RStudio. How is this similar to R?
The console of RStudio is where you type and execute commands, and where the output of said command is displayed. This is similar to the console window of R. The Console panel is where R code is input and run.
Any dataframe or matrix that you create in R can be viewed in a ________________________ in RStudio.
new tab of the source panel
What information is provided in the Environment panel of RStudio?
The Environment panel of RStudio tells you some information about the object or data, like whether it is a list or a dataframe or if it contains numbers, integers or characters. The Environment panel lists all of the objects that have been created within an R session.
Why would you want to know information about an object or data, like if it is a list or a dataframe or if it contains numbers, integers or characters?
This is very helpful information to have as some functions in R only work with certain classes of data. And knowing what kind of data you have is the first step to that.
What does the History tab in the Environment panel of RStudio do? How can this be useful?
Here you will see the commands that we have run in this session of R. The History tab keeps a record of all commands that have been run. It also presents the option to either rerun the command in the Console panel, or send the command to Source panel, to be saved. If you click on any one of the commands, you can click "To Console" or "To Source" and this will either rerun the command in the Console panel, or will move the command to the Source panel, respectively.
The Source panel in RStudio is also known as the _______________.
Script Editor Panel
As you work with RStudio, in which of the 4 panels or quadrants will you be spending most of your time?
The Source panel
Describe the purpose of the Source panel in RStudio.
This is where you store the R commands that you want to save for later, either as a record of what you did or as a way to rerun code.
True or False. You cannot save an R script in the Source panel.
False. You can save a script written in R using the Source panel. There is a save icon along the top of the Source panel.
What 5 tabs run along the top of the bottom right quadrant/panel of RStudio?
1) Files
2) Plots
3) Packages
4) Help
5) Viewer
What can you do in the Files tab of the bottom right quadrant/panel of RStudio?
In Files, you can see a listing of all of the files in your current working directory.
Describe how to use the Files tab of the bottom right quadrant/panel of RStudio to find a desired folder and then setting the new folder as the working directory.
If this isn't where you want to save or retrieve files from, you can also change the current working directory in the Files tab using the ellipsis at the far right, finding the desired folder, and then under the "More" cogwheel, setting this new folder as the working directory.
What can you do in the Plots tab of the bottom right quadrant/panel of RStudio?
In the Plots tab, if you generate a plot with your code, it will appear here.
What are the arrow for in the Plots tab of the bottom right quadrant/panel of RStudio?
You can use the arrows to navigate back and forth to previously and recently generated plots.
What is Zoom for in the Plots tab of the bottom right quadrant/panel of RStudio?
The Zoom function will open the plot in a new window, that is much larger than the quadrant.
What is Export for in the Plots tab of the bottom right quadrant/panel of RStudio?
Export is how you save the plot.
Name the 2 ways to save a plot in RStudio.
You can either save it as an image or as a PDF.
What does the Broom icon do in the Plots tab of the bottom right quadrant/panel of RStudio?
The broom icon clears all plots from memory.
What 4 things can you do in the Packages tab of the bottom right quadrant/panel of RStudio?
In the Packages tab you can see a list of all of the packages you have installed, load and unload these packages, and update them.
What can you do in the Help tab of the bottom right quadrant/panel of RStudio?
The Help tab is where you find the documentation for your R packages and various functions. Here you can find help files for when you need some assistance.
What can you do if you have a question about specific function or package in RStudio?
In the upper right corner of the bottom right quadrant/panel of RStudio there is a search function for when you have a specific function or package in question.
___________ make R so special and unique.
R Packages
What is Base R? What is a downside of the Base R system?
Base R, or everything included in R when you download it, has rather basic functionality for statistics and plotting but it can sometimes be limiting.
To expand upon R's basic functionality, people have developed ___________.
packages
In the context of discussing the R programming language, what is a package?
A package is a collection of functions, data, and code conveniently provided in a nice, complete format for you.
True or False. At the time of writing, there are just over 14,300 packages available to download - each with their own specialized functions and code, all for some different purpose.
True
What is a great resource for learning about R packages?
For a really in depth look at R Packages (what they are, how to develop them), check out Hadley Wickham's book from O'Reilly, "R Packages."
True or False. A package in R can be compared to a library.
False. A package is not to be confused with a library (these two terms are often conflated in colloquial speech about R). A library is the place where the package is located on your computer. To think of an analogy, a library is, well, a library... and a package is a book within the library. The library is where the books/packages are located.
Compare and contrast the roles of base R and packages.
Base R has some great functionality but the packages greatly expand its functionality.
Who has written packages for R and where are the packages located?
Each package is developed and published by the R community at large and deposited in repositories.
What are respositories?
A repository is a central location where many developed packages for R are located and available for download.
List the 3 big repositories for R.
1) CRAN - Comprehensive R Archive Network
2) BioConductor
3) GitHub
What is R's main repository?
CRAN - Comprehensive R Archive Network. Over 12,100 packages available.
What is BioConductor?
A repository mainly for bioinformatic-focused packages.
What is GitHub?
A very popular, open source repository for R.
How is GitHub different from the other 2 main repositories of R?
It is not an R programming language specific repository.
Name 3 different avenues for exploring packages for R so that you can you find a package that will do what you are trying to do in R.
1) Use Task View for CRAN or Comprehensive R Archive Network.
2) Use the RDocumentation website.
3) Use Google search.
Describe the Task View for CRAN or the Comprehensive R Archive Network and how it can help you find a package that will do what you are trying to do in R.
CRAN groups all of its packages by their functionality/topic into 35 "themes." It calls this its "Task view." This at least allows you to narrow the packages you can look through to a topic relevant to your interests.
Describe RDocumentation and how it can help you find a package that will do what you are trying to do in R.
RDocumentation is a search engine for packages and functions from CRAN, BioConductor, and GitHub (ie: the big three repositories). If you have a task in mind, this is a great way to search for specific packages to help you accomplish that task! It also has a "task" view like CRAN, that allows you to browse themes.
How is RDocumentation similar to CRAN when searching for a package that will do what you are trying to do in R.
RDocumentation has a "task" view like CRAN, that allows you to browse themes.
Describe how you can use a Google search to help you find a package that will do what you are trying to do in R.
A Google search of a task followed by "R package" is a great place to start! From there, looking at tutorials, vignettes, and forums for people already doing what you want to do is a great way to find relevant packages.
Name the 2 ways to install a package from the CRAN repository into RStudio on your computer.
1) Use the install.packages( ) function in the R Console panel or quadrant of RStudio.
2) Use RStudio's graphical interface to install packages.
Describe how to install a package from the CRAN repository into RStudio on your computer using the install.packages( ) function in the R Console panel.
If you are installing from the CRAN repository, use the install.packages( ) function, with the name of the package you want to install in quotes between the parentheses (note: you can use either single or double quotes).
Give an example of installing a package from the CRAN repository into RStudio on your computer using the install.packages( ) function in the R Console panel.
For example, if you want to install the package "ggplot2", you would use: install.packages("ggplot2"). This command downloads the "ggplot2" package from CRAN and installs it onto your computer.
Give an example of installing multiple packages at once from the CRAN repository into RStudio on your computer using the install.packages( ) function in the R Console panel.
If you want to install multiple packages at once, you can do so by using a character vector, like: install.packages(c("ggplot2", "devtools", "lme4"))
Describe how to install a package from the CRAN repository into RStudio on your computer using RStudio's graphical interface to install packages.
If you want to use RStudio's graphical interface to install packages, go to the Tools menu, and the first option should be "Install packages..." An Install Packages box will appear. If installing from CRAN, select it as the repository, type the desired package names in the appropriate box and click the install button.
Describe how to install a package from the Bioconductor repository into RStudio on your computer.
The BioConductor repository involves 2 steps to install packages using the Console panel of RStudio:
1) Get the basic functions required to install through BioConductor, use: source("https://bioconductor.org/biocLite.R")
This makes the main install function of BioConductor, biocLite(), available to you.
2) Call the package you want to install in quotes, between the parentheses of the biocLite command, like so: biocLite("GenomicFeatures")
What is a great resource for installing a package from the Github repository into RStudio on your computer?
Consult the "Putting your R Package on GitHub" PDF guide with the course materials.
List the 4 general steps of installing a package from the Github repository into RStudio on your computer using the Console panel of RStudio.
1) Find the package you want on GitHub and take note of both the package name AND the author of the package.
2) install.packages("devtools") - only run this if you don't already have devtools installed.
3) library(devtools)
4) install_github("author/package") replacing "author" and "package" with their GitHub username and the name of the package.
True or False. Installing a package makes its functions immediately available to you in R.
False. Installing a package does not make its functions immediately available to you. First you must load the package into R, like any other software you install on your computer. Just because you have installed a program, doesn't mean it's automatically running - you have to open the program. Same with R. You have installed it, but now you have to "open" it.
Describe how to install a package so its functions are immediately available to you in R.
Use the library( ) function.
Give an example of the library( ) function to install a package so its functions are immediately available to you in R.
For example, to "open" the "ggplot2" package, you would run:
library(ggplot2)
Contrast the installation versus the load process of a package so its functions are immediately available to you in R. What must you be careful about?
Do not put the package name in quotes when using the library( ) function. Unlike when you are installing the packages, the library( ) command does not accept package names in quotes.
True or False. Step one of getting a package is installing it, but to use it, you must load it using library( ); similar to installing R and then loading it by opening the .exe file.
True
Describe what care must be taken in loading packages.
There is an order to loading packages - some packages require other packages to be loaded first (dependencies).
There is an order to loading packages - some packages require other packages to be loaded first, also known as ___________.
dependencies
Where can you find information about the order of loading packages or dependencies?
That package's manual/help pages will help you out in finding that order, if they are picky.
Describe how to load a package using the RStudio interface.
If you want to load a package using the RStudio interface, in the lower right quadrant there is a tab called "Packages" that lists out all of the packages and a brief description, as well as the version number, of all of the packages you have installed. To load a package just click on the checkbox beside the package name.
Describe how to use the Console panel of RStudio to check what packages you have installed.
If you aren't sure if you've already installed a package, or want to check what packages are installed, you can use either of: installed.packages( ) or library( ) with nothing between the parentheses to check.
Describe how to use RStudio to check what packages you have installed.
In the lower right quadrant of RStudio there is a tab called "Packages" that lists out all of the packages and a brief description, as well as the version number, of all of the packages you have installed.
Describe how to use the Console panel of RStudio to check what packages you have installed that need an update.
You can check what packages need an update with a call to the function old.packages( ) This will identify all packages that have been updated since you installed them/last updated them.
Describe how you can use the Console panel of RStudio to update a specific package or all packages.
To update all packages, use update.packages( ). If you only want to update a specific package, just use once again install.packages("packagename")
Describe how to use the RStudio interface to check what packages you have installed that need updates and update specific or all packages.
In the lower right quadrant of RStudio there is a tab called "Packages". In that Packages tab, you can click "Update," which will list all of the packages that are not up to date. It gives you the option to update all of your packages, or allows you to select specific packages.
What is a good practice when it comes to updating packages?
Periodically check in on your packages and check if you have fallen out of date.
Describe what caution must be taken with regard to updating packages.
Sometimes an update can change the functionality of certain functions, so if you re-run some old code, the command may be changed or perhaps even outright gone and you will need to update your code too!
Sometimes you want to unload a package in R. Why would you want to do this?
Sometimes you want to unload a package in the middle of a script - the package you have loaded may not play nicely with another package you want to use.
Describe how to unload a package in R using the Console panel.
To unload a given package you can use the detach() function.
Give an example of how to unload a package in R using the Console panel.
For example, detach("package:ggplot2", unload=TRUE) would unload the ggplot2 package (that has been loaded previously).
Describe how to unload a package in R using the RStudio interface.
In the lower right quadrant of RStudio there is a tab called "Packages". Within the RStudio interface, in the Packages tab, you can simply unload a package by unchecking the box beside the package name.
Unloading a package in R is also known as ____________ a package.
detaching
Describe how to uninstall a package in R using the Console panel.
If you no longer want to have a package installed, you can simply uninstall it using the function remove.packages( ).
Give an example of a function you can enter into a Console panel in order to uninstall a package in R.
For example to remove the package ggplot2, enter the following command into the Console panel of RStudio: remove.packages("ggplot2")
Describe how to uninstall a package in R using the RStudio interface.
In the lower right quadrant of RStudio there is a tab called "Packages". In the Packages tab, clicking on the "X" at the end of a package's row will uninstall that package.
Why is it important to know what version of R you are running when it comes to installing packages?
Sometimes, when you are looking at a package that you might want to install, you will see that it requires a certain version of R to run. To know if you can use that package, you need to know what version of R you are running!
Describe the 1st of 3 different ways to know what version of R you are running.
One way to know your R version is to check when you first open R/RStudio - the first thing it outputs in the console tells you what version of R is currently running.
Describe the 2nd of 3 different ways to know what version of R you are running.
You can type the command word: version into the console and it will output information on the R version you are running.
Describe the 3rd of 3 different ways to know what version of R you are running.
You can type the command into the Console window: sessionInfo( ) - it will tell you what version of R you are running along with a listing of all of the packages you have loaded.
How is the information gathered from the command sessionInfo( ) helpful when posting questions to forums?
The output of this command is a great detail to include when posting a question to forums - it tells potential helpers a lot of information about your OS, R, and the packages (plus their version numbers!) that you are using.
Describe the 1st of 2 ways to find the functions that are included within a package.
To do this, you can look at the manual/help pages included in all (well-made) packages. In the console, you can use the help( ) function to access a package's help files.
Give an example of using the help( ) function to access a package's help files.
Try help(package = "ggplot2") and you will see all of the many functions that ggplot2 provides.
Describe the 2nd of 2 ways to find the functions that are included within a package.
In the lower right quadrant of RStudio there is a tab called "Packages". You can access the help files through the Packages tab - clicking on any package name should open up the associated help files in the "Help" tab, found in that same quadrant, beside the Packages tab. Clicking on any one of these help pages will take you to that functions help page, that tells you what that function is for and how to use it.
True or False. Once you know what function within a package you want to use, you simply call it in the console like any other function you may have used before. Once a package has been loaded, it is as if it were a part of the base R functionality.
True
What are vignettes?
Many packages include vignettes that help address questions about what functions within a package are right for you or how to use them. Vignettes are extended help files, that include an overview of the package and its functions, but often they go the extra mile and include detailed examples of how to use the functions in plain words that you can follow along with to see how to use the package.
Describe how can you see the vignettes included with a package.
To see the vignettes included in a package, you can use the browseVignettes( ) function.
Provide an example of using the browseVignettes( ) function.
For example, let's look at the vignettes included in ggplot2:
browseVignettes("ggplot2") .
True or False. Vignettes can be helpful by providing clear instructions on how to use the included functions of a package.
True
If you still want to learn more about R packages, name two great resources.
1) R Packages: A Beginner's Guide from Adolfo Álvarez on DataCamp.
2) A lesson from the University of Washington, on an Introduction to R Packages from Ken Rice and Timothy Thornton.
Explain R Projects.
One of the ways people organize their work in R is through the use of R Projects, a built in functionality of RStudio that helps to keep all your related files together.
Describe how R Projects works in RStudio.
When you make a Project, it creates a folder where all files will be kept, which is helpful for organizing yourself and keeping multiple projects separate from each other. When you re-open a project, RStudio remembers what files were open and will restore the work environment as if you had never left - which is very helpful when you are starting back up on a project after some time off!
True or False. Functionally, creating a Project in R will create a new folder and assign that as the working directory so that all files generated will be assigned to the same directory.
True
Describe the 1st of 3 benefits of using Projects in RStudio.
It starts the organization process off right! It creates a folder for you and now you have a place to store all of your input data, your code, and the output of your code. Everything you are working on within a Project is self-contained; which often means finding things is much easier - there's only one place to look!
Describe the 2nd of 3 benefits of using Projects in RStudio.
Since everything related to one project is all in the same place, it is much easier to share your work with others - either by directly sharing the folder/files, or by associating it with version control software.
Describe the 3rd of 3 benefits of using Projects in RStudio.
Since RStudio remembers what documents you had open when you closed the session, it is easier to pick a project up after a break - everything is set-up just as you left it.
Describe the 3 ways to make a project in RStudio.
1) From scratch - this will create a new directory for all your files to go in.
2) From an existing folder - this will link an existing directory with RStudio.
3) From version control - this will "clone" an existing project onto your computer.
Describe 3 ways to start a new project in RStudio.
1) Open RStudio, and under File, select "New Project".
2) Using the Projects toolbar and selecting "New Project" in the drop down menu.
3) Use the "New Project" shortcut in the toolbar.
Describe the 6 steps to creating a new project in RStudio.
1) Open RStudio, and under File, select "New Project".
2) A window will appear. Select "New Directory".
3) When prompted about the Project type, select "New Project".
4) Pick a name for your project and for this time, save it to your Desktop. This will create a folder on your Desktop where all of the files associated with this Project will be kept.
5) Click "Create Project."
6) A blank RStudio session should open.
What 2 things should be noted when you create a new project in RStudio?
1) When you create a new project in RStudio, in the "Files" quadrant of the screen, you can see that RStudio has made this new directory your working directory and generated a single file with the extension ".Rproj".
2) In the upper-right of the window, there is a Projects toolbar that states the name of your current Project and has a drop down menu with a few different options.
Describe 3 ways to open a project in RStudio.
1) Double clicking the .Rproj file on your computer.
2) Within RStudio by opening RStudio and going to File > Open Project.
3) Use the Project toolbar in RStudio and open the drop down menu and select "Open Project..."
Describe 3 ways to quit a project in RStudio.
1) Close the RStudio window.
2) Go to File > Close Project.
3) Use the Project toolbar by clicking on the drop down menu and choosing "Close Project".
Describe what happens when you quit a project in RStudio?
Quitting a Project will cause RStudio to write which documents are currently open (so they can be restored when you start back up again) and it then closes the R session.
What is one thing to be mindful of concerning the quitting of a project in RStudio?
When you set up your Project, you can tell it to save environment (so, for example, all of your variables and data tables will be preloaded when you reopen the project), but this is NOT the default behavior.
Describe how to switch between projects in RStudio using the Projects toolbar.
On the Projects toolbar icon in the upper right-hand corner of RStudio, click on the drop down menu and choose "Open Project" and find your new Project you want to open - this will save the current Project, close it, and then open the new Project within the same window.
Describe how to have multiple projects open in RStudio using the Projects toolbar.
On the Projects toolbar icon in the upper right-hand corner of RStudio, click on the drop down menu and choose "Open Project in New Session" and find your new Project you want to open - this will open the new Project within the same window and will leave existing projects open.
Describe how to switch between projects in RStudio using the File menu.
Click on File in the upper left-hand corner of RStudio, corner of RStudio, click on the drop down menu and choose "Open Project" and find your new Project you want to open - this will save the current Project, close it, and then open the new Project within the same window.
Describe how to have multiple projects open in RStudio using the File menu.
Click on File in the upper left-hand corner of RStudio, click on the drop down menu and choose "Open Project in New Session" and find your new Project you want to open - this will open the new Project within the same window and will leave existing projects open.
Describe best practice for file structures and setting up directories or folder structure for a project in RStudio.
Most file structures are set-up around having 3 different types of directories:
1) A directory containing the raw data. (Called Data)
2) A directory that you keep scripts/R files in. (Called Scripts)
3) A directory for the output of your code. (Called Output)
Why is a good folder structure important for a project?
It can save you organizational headaches later on in a project when you can't quite remember where something is.
What file extension do Projects in R use?
.Rproj
True or False. Creating a new project from scratch will initiate version control.
False. Creating a new project from scratch will create a new folder and open a blank RStudio window but it will NOT initiate version control.
What is Version Control?
Version control is a system that records changes that are made to a file or a set of files over time. As you make edits, the version control system takes snapshots of your files and the changes, and then saves those snapshots so you can refer or revert back to previous versions later if need be.
What Microsoft Office feature is Version Control similar to?
If you have ever used the "Track changes" feature in Microsoft Word, you have seen a rudimentary type of version control, in which the changes to a file are tracked, and you can either choose to keep those edits or revert to the original format.
How is Version Control superior to the "Track Changes" feature in Microsoft Word?
Version control systems are like a more sophisticated "Track changes" - in that they are far more powerful and are capable of meticulously tracking successive changes on many files, with potentially many people working simultaneously on the same groups of files.
Describe the 1st of 4 benefits of using Version Control.
Without version control, you might be keeping multiple, very similar copies of a file. And this could be dangerous - you might start editing the wrong version, not recognizing that the document labelled "FINAL" has been further edited to "FINAL2" - and now all your new changes have been applied to the wrong file! Version control systems help to solve this problem by keeping a single, updated version of each file, with a record of all previous versions AND a record of exactly what changed between the versions.
Describe the 2nd of 4 benefits of using Version Control.
It keeps a record of all changes made to the files. This can be of great help when you are collaborating with many people on the same files - the version control software keeps track of who, when, and why those specific changes were made. It's like "Track changes" to the extreme!
Describe the 3rd of 4 benefits of using Version Control.
The Version Control record is helpful when developing code, if you realize after some time that you made a mistake and introduced an error. You can find the last time you edited that particular bit of code, see the changes you made, and revert back to that original, unbroken code, leaving everything else you've done in the meanwhile untouched!
Describe the 4th of 4 benefits of using Version Control.
When working with a group of people on the same set of files, version control is helpful for ensuring that you aren't making changes to files that conflict with other changes. If you've ever shared a document with another person for editing, you know the frustration of integrating their edits with a document that has changed since you sent the original file - now you have two versions of that same original document. Version control allows multiple people to work on the same file and then helps merge all of the versions of the file and all of their edits into one cohesive file.
What is Git?
Git is a free and open source Version Control system.
True or False. Subversion is the most popular Version Control system.
False. Git, developed in 2005, is the most commonly used Version Control system available with almost 70% of users of Version Control using it.
Describe one of the main benefits of using Git as Version Control.
One of the main benefits of Git is that it keeps a local copy of your work and revisions, which you can then edit offline, and then once you return to internet service, you can sync your copy of the work, with all of your new edits and tracked changes to the main repository online. Additionally, since all collaborators on a project have their own local copy of the code, everybody can simultaneously work on their own parts of the code, without disturbing that common repository.
Describe another significant benefit of using Git as Version Control.
RStudio and Git interface with each other with ease.
In the context of Version Control, what is GitHub?
GitHub is an online interface for Git.
Explain the relationship between Git and GitHub.
Git is software used locally on your computer to record changes. GitHub is a host for your files and the records of the changes made. The files are on your computer, but they are also hosted online and are accessible from any computer. GitHub has the added benefit of interfacing with Git to keep track of all of your file versions and changes.
True or False. You can sort of think of GitHub as similar to DropBox.
True
In the context of Version Control vocabulary, what is a repository?
All of your version controlled files (and the recorded changes) are located in a repository. Repositories are what are hosted on GitHub.
True or False. A repository is equivalent to a project's folder/directory.
True
List the 3 different hosting options for repositories on Github.
1) Keep the repository private.
2) Share the repository with select collaborators.
3) Make the repository public so anybody can see the files and their history.
In the context of Version Control vocabulary, repository is often shortened to ___________.
repo
In the context of Version Control vocabulary, what is a commit?
To commit is to save your edits and the changes made. A commit is like a snapshot of your files: Git compares the previous version of all of your files in the repo to the current version and identifies those that have changed since then. Those that have not changed, it maintains that previously stored file, untouched. Those that have changed, it compares the files, logs the changes and uploads the new version of your file.
What is commonly done when you commit a file?
When you commit a file, typically you accompany that file change with a little note about what you changed and why.
Explain the usefulness of commits in the context of a Version Control system.
Commits are at the heart of a Version Control system. If you find a mistake, you revert your files to a previous commit. If you want to see what has changed in a file over time, you compare the commits and look at the messages to see why and who.
In the context of Version Control vocabulary, what is a push? Why would you want to do this?
Push means updating the repository with your edits. Since Git involves making changes locally, you need to be able to share your changes with the common, online repository. Pushing is sending those committed changes to that repository, so now everybody has access to your edits.
In the context of Version Control vocabulary, what is a pu11? Why would you want to do this?
Pull mean updating your local version of the repository to the current version, since others may have edited in the meanwhile. Because the shared repository is hosted online and any of your collaborators (or even yourself on a different computer!) could have made changes to the files and then pushed them to the shared repository, you are behind the times! The files you have locally on your computer may be outdated, so you pull to check if you are up to date with the main repository.
In the context of Version Control vocabulary, what is staging? Why is this important?
Staging is the act of preparing a file for a commit. For example, if since your last commit you have edited three files for completely different reasons, you don't want to commit all of the changes in one go; your message on why you are making the commit and what has changed will be complicated since three files have been changed for different reasons. So instead, you can stage just one of the files and prepare it for committing. Once you've committed that file, you can stage the second file and commit it. And so on. Staging allows you to separate out file changes into separate commits.
Files are hosted in a ___________ that is shared online with collaborators. You ___________ the repository's contents so that you have a local copy of the files that you can edit. Once you are happy with your changes to a file, you ___________ the file and then ___________ it. You ___________ this commit to the shared repository. This uploads your new file and all of the changes and is accompanied by a message explaining what changed, why and by whom.
repository, pull, stage, commit, push
In the context of Version Control vocabulary, what is a branch? How do you stop working on a branch?
A branch occurs when the same file has two simultaneous copies. When you are working locally and editing a file, you have created a branch where your edits are not shared with the main repository (yet) - so there are two versions of the file: the version that everybody has access to on the repository and your local edited version of the file. Until you push your changes and merge them back into the main repository, you are working on a branch. Following a branch point, the version history splits into two and tracks the independent changes made to both the original file in the repository that others may be editing, and tracking your changes on your branch, and then merges the files together.
In the context of Version Control vocabulary, what is a merge? How can problems occur with a merge? How are they resolved?
In a merge, independent edits of the same file are incorporated into a single, unified file. Independent edits are identified by Git and are brought together into a single file, with both sets of edits incorporated. But, you can see a potential problem here - if both people made an edit to the same sentence that precludes one of the edits from being possible, we have a problem! Git recognizes this disparity (conflict) and asks for user assistance in picking which edit to keep.
In the context of Version Control vocabulary, what is a conflict? How is this resolved?
Conflict occurs when multiple people make changes to the same file and Git is unable to merge the edits. You are presented with the option to manually try and merge the edits or to keep one edit over the other.
In the context of Version Control vocabulary, what is a clone?
A clone is making a copy of an existing Git repository. If you have just been brought on to a project that has been tracked with version control, you would clone the repository to get access to and create a local version of all of the repository's files and all of the tracked changes.
In the context of Version Control vocabulary, what is a fork?
A fork is a personal copy of a repository that you have taken from another person. If somebody is working on a cool project and you want to play around with it, you can fork their repository and then when you make changes, the edits are logged on your repository, not theirs.
When it comes to Version Control software like Git, describe the 1st of 3 best practices.
Make purposeful single issue commits. Each commit should only address a single issue. This way if you need to identify when you changed a certain line of code, there is only one place to look to identify the change and you can easily see how to revert the code.
When it comes to Version Control software like Git, describe the 2nd of 3 best practices.
Make sure you write informative messages on each commit is a helpful habit to get into. If each message is precise in what was being changed, anybody can examine the committed file and identify the purpose for your change. Additionally, if you are looking for a specific edit you made in the past, you can easily scan through all of your commits to identify those changes related to the desired edit.
When it comes to Version Control software like Git, describe the 3rd of 3 best practices.
Be cognizant of the version of files you are working on. Frequently check that you are up to date with the current repo by frequently pulling. Additionally, don't horde your edited files - once you have committed your files (and written that helpful message!), you should push those changes to the common repository. If you are done editing a section of code and are planning on moving on to an unrelated problem, you need to share that edit with your collaborators! Pull and push often.
____________ is a cloud-based management system for your version controlled files.
Github
Describe how GitHub is similar to Dropbox.
Like Dropbox, when you use GitHub, your files are both locally on your computer and hosted online and easily accessible.
Describe how GitHub operates.
GitHub's interface allows you to manage version control and provides users with a web-based interface for creating projects, sharing them, updating code, etc.
Where do you go to create a GitHub account?
https://github.com/
After you have logged into GitHub where can you control your account and view your contribution histories and repositories?
This information can be found in your profile. To access, after logging in, go to User Settings in the upper right-hand corner and select Your profile in the drop-down menu that appears.
On the account page of GitHub, why must you be cautious about changing the Username?
There can be unintended consequences when you change your username. After changing your username, your old username becomes available for anyone else to claim. Most references to your repositories under the old username automatically change to the new username. However, some links to your profile won't automatically redirect.
When you are active on GitHub and are collaborating with others, where can you can find messages and notifications for all the repositories, teams, and conversations you are a part of?
After logging into GitHub, click on the bell icon in the upper right-hand corner to view notifications.
If you get stuck after logging into GitHub what can you do?
Along the bottom of every. single. page. there is the "Help" button. GitHub has a great help system in place - if you ever have a question about GitHub, this should be your first point to search!
What is an additional resource for learning how to use GitHub?
After logging into GitHub you can click on the green button labeled Read the Guide. This is a mini tutorial to get you started with GitHub.
When using GitHub, describe what you can expect to see when you check out the commit history of a repository.
You can find all of the changes that have been made to the repository, and you can see who made the change, when they made the change, and provided you wrote an appropriate commit message, you can see why they made the change!
Describe what information can be found by clicking on the Repositories tab on the Profile page of GitHub.
By clicking on the Repositories tab of the Profile page you can see all of your repositories, a brief description, the time of the last edit, and along the right hand side, there is an activity graph, showing when and how many edits have been made on the repository.
How can you view the latest repository that was created in GitHub?
This information can be found in your profile. To access, after logging in to GitHub, go to User Settings in the upper right-hand corner and select Your profile in the drop-down menu that appears. On the Profile page you will see the latest repository that was created.
___________ is the free and open source version control system which GitHub is built on.
Git
For the purposes of this course, what is one of the main benefits of using the Git system as the version control?
One of the main benefits of using the Git system is its compatibility with RStudio.
Where do you go to download and install Git?
https://git-scm.com/download
Click on the appropriate download link for your operating system. This should initiate the download process.
Why must Git be configured after installing it on your computer?
After Git is installed, you need to configure it for use with GitHub, in preparation for linking it with RStudio.
___________ is a code hosting platform for version control and collaboration. It lets you and others work together on projects from anywhere.
GitHub
List the 9 steps after creating a repository in GitHub for creating and editing a branch and then creating a pull request to merge the branch with the main branch.
1) Create a new branch by giving it a name.
2) Make changes to the newly created branch which is a copy of the main branch.
3) Write a commit message which describes your changes.
4) Click Commit changes button.
5) Open a Pull Request to solicit feedback for your proposed changes and look over your changes in the diffs on the Compare page, make sure they are what you want to submit.
6) Create a pull request, give it a title and brief description.
7) Merge pull request with the main branch.
8) Confirm merge.
9) After merge is completed, delete branch.
What Git Bash?
Git Bash is the command line interface on your computer that is used to configure Git.
Why would you want to link GitHub and Git to RStudio?
Linking Git/GitHub with RStudio maximizes the benefits of using RStudio in your version control pipelines. Once linked, RStudio will recognize Git as your version control software.
True or False. Once GitHub and Git are linked to RStudio it is possible to push and pull your repositories from within RStudio.
True
In what quadrant of RStudio can you find the Git tab?
Environment
List the 4 order of commands to send a file to GitHub from within RStudio?
1) Stage
2) Commit message
3) Commit
4) Push
List 4 things you can do in the Commit window of RStudio.
1) See the differences between your original file and your updated file.
2) Stage files.
3) Pull and push content from the repository.
4) Write a commit message.
List the 3 options in the upper-right hand corner of the repository page of GitHub.
1) watch
2) star
3) fork
Provide the git command that will initiate a git repository locally.
git init
Provide the git command that you can use to see the Git configuration.
git config --list
What git command can you use to change the name associated with each of your commits to Jane Doe?
git config --global user.name "Jane Doe"
True or False. Version control software is difficult and time-consuming to use.
False
True or False. Version control software allows you to go back if you make mistakes.
True
True or False. Version control software is a way to allow multiple people to work together on a set of files for a project.
True
Why is it important to start version control in Git and create a repository on GitHub to link with RStudio before starting an R project?
RStudio and GitHub recognize this can happen and have steps in place to help you if you start an R project before creating a repository on GitHub (with Git for version control) and linking it with RStudio. However, admittedly, this is slightly more troublesome to do than just creating a repository first on GitHub (with Git version control) and then linking it with RStudio before starting the R project.
Describe how to create an R project in RStudio with no version control or Git repository.
In RStudio, go to File > New Project > New Directory > New Project and name your project. Do NOT click "Create a git repository" so there is no version control. Click Create Project.
Describe the steps of linking Git for version control after already starting an R project that has no version control.
1) Set up RStudio to interact with Git. Open Git Bash or Terminal and navigate to the directory containing your project files. Move around directories by typing:
cd ~/dir/name/of/path/to/file
When the command prompt in the line before the dollar sign says the correct directory location of your project, you are in the correct location.
2) Once here, type git init followed by git add . - this initializes (init) this directory as a git repository and adds all of the files in the directory (.) to your local repository.
3) Commit these changes to the git repository using git commit -m "Initial commit"
Describe the steps of linking to a GitHub repository after already starting an R project and having linked the R project to Git for version control.
1) Go to GitHub.com and create a new repository. Make sure the exact same name is used as your R project. Do NOT initialize a README file, .gitignore, or license.
2) Upon creating the repository, you should see that there is an option to "Push an existing repository from the command line" with instructions below containing code on how to do so. In Git Bash or Terminal, copy and paste these lines of code to link your repository with GitHub.
3) When you re-open your project in RStudio, you should now have access to the Git tab in the upper right quadrant and can push to GitHub from within RStudio any future changes. You can push your R project repository to your GitHub repository of the same name.
Describe the steps to clone an existing project from a GitHub repository to a new project in RStudio. This would be useful for when there is an existing project that others are working on that you are asked to contribute to, you can link the existing project with your RStudio.
1) In RStudio, go to File > New Project > Version Control. Select Git as your version control system and provide the URL to the repository that you are attempting to clone and select a location on your computer to store the files locally.
2) Create the project. All the existing files in the repository should now be stored locally on your computer and you have the ability to push edits from your RStudio interface.
What is R Markdown?
R Markdown is a way of creating fully reproducible documents, in which both text and code can be combined.
List at least 5 things you can make using R Markdown.
1) bullets
2) bold
3) italics
4) links
5) run inline r code
R Markdown can be used to render plain documents in what 4 formats?
1) HTML pages
2) PDF
3) Word documents
4) Slides
True or False. The symbols in R Markdown you use to signal bold or italic letters is compatible with HTML, PDF, Word documents and slides.
True
List 2 major reasons to use R Markdown.
1) The reproducibility of using R Markdown.
2) Since R Markdown is plain text, it works very well with version control systems. It is easy to track what character changes occur between commits; unlike other formats that are not plain text.
What is meant by reproducibility? How does R Markdown make things reproducible?
Using R Markdown, you can easily combine text and code chunks in one document, you can easily integrate introductions, hypotheses, your code that you are running, the results of that code and your conclusions all in one document. Sharing what you did, why you did it and how it turned out becomes so simple - and that person you share it with can re-run your code and get the exact same answers you got.
Provide another reason why making things reproducible is a desirable feature of R Markdown.
Sometimes you will be working on a project that takes many weeks to complete; you want to be able to see what you did a long time ago (and perhaps be reminded exactly why you were doing this) and you can see exactly what you ran AND the results of that code - and R Markdown documents allow you to do that.
Describe how to install R Markdown.
To install, run the following by typing into the console window of RStudio and clicking enter:
install.packages("rmarkdown")
Describe how to create a R Markdown document in RStudio.
1) To create an R Markdown document, in R Studio, go to File > New File > R Markdown.
2) In the box that appears, make sure Document is selected on the left, type in a title, author, select a Default Output Format and click OK.
3) In the Source window you will see a little explanation on R Markdown files.
What are the three main sections of an R Markdown document?
1) Header
2) Text sections
3) Code chunks
The header section of an R Markdown document is at the __________ and is bounded by _______________.
top, 3 dashes
Describe the header section of an R Markdown document.
This is where you can specify details like the title, your name, the date, and what kind of document you want output. If you filled in the blanks in the window earlier, these should be filled out for you.
The text sections of an R Markdown document starts with ___________.
"## R Markdown"
Describe the text sections of an R Markdown document.
This section will render as text when you produce the PDF (selected Default Output Format) of this file - and all of the formatting you will learn generally applies to this section.
Describe the code chunks sections of an R Markdown document.
These are bounded by the triple backticks. These are pieces of R code ("chunks") that you can run right from within your document - and the output of this code will be included in the PDF (selected Default Output Format) when you create it.
When you are done with a document, in R Markdown, you are said to ____________ your plain text and code into your final document.
"knit"
When you are done with a document, in R Markdown, how do you go about "knitting" your plain text and code into a final document?
1) To do so, click on the "Knit" button along the top of the source panel in RStudio.
2) When you do so, it will prompt you to save the document as an RMD file.
For an R Markdown document, to ___________ text, you surround it by two ___________ on either side. Similarly, to ___________ text, you surround the word with a single ___________ on either side.
bold, asterisks, italicise, asterisk
Describe how to make section headers in an R Markdown document.
To make a section header, you put a series of hash marks (#). The number of hash marks determines what level of heading it is. One hash mark is the highest level and will make the largest text, two hash marks is the next highest level and so on.
# Header level 1
## Header level 2
### Header level 3...
Describe how to make an R code chunk in an R Markdown document.
To make an R code chunk, you type 3 backticks, followed by curly brackets surrounding a lower case R, put your code on a new line and end the chunk with 3 more backticks.
Describe 2 ways that it can be easier to write R code chunks in RStudio.
RStudio recognizes you will be writing R code chunks frequently and it provides 2 short cuts:
1) Ctrl+Alt+I (Windows) or Cmd + Option + I (Mac).
2) Along the top of the Source quadrant in RStudio, there is a green "Insert" button, that produces an empty code chunk.
In RStudio, where do you see the output of running the code in the R Markdown document?
In the Console window of RStudio.
Say you are not ready to knit your R Markdown document, but want to see the output of your code, describe 2 ways you can view the output of a segment of code in the R Markdown document.
Select the line of code you want to run:
1) Use Ctrl+Enter
2) Click on the "Run" button along the top of the Source window in RStudio.
Describe 3 ways you can run multiple lines of code in a chunk all in one go from an R Markdown document using RStudio.
You can run the entire chunk by:
1) Using Ctrl+Shift+Enter
2) Select the line of code you want to run and hitting the green arrow button on the right side of the chunk.
3) Select the line of code you want to run and go to the Run menu and select Run current chunk.
Describe how to make bulleted lists in an R Markdown document.
Lists are easily created by preceding each prospective bullet point by a single dash, followed by a space. At the end of each bullet's line, end with 2 spaces.
For creating bulleted lists in an R Markdown document why is it important to end each bullet's line with 2 spaces?
This is a quirk of R Markdown that will cause spacing problems if not included.
What is a great resource for using R Markdown?
RStudio developers have produced an "R Markdown cheatsheet" that gives you everything you can do with R Markdown.
How would you strike through some text using R Markdown?
--strikethrough--
What is the format for including a link that appears as blue text in your R Markdown document?
[text that is shown](link.com)
How do you produce bold text in an R Markdown document?
*
bold
*
How do you produce italicized text in an R Markdown document?
some text
List the 6 categories in order of difficulty that data science and data analysis questions fall.
1) Descriptive
2) Exploratory
3) Inferential
4) Predictive
5) Causal
6) Mechanistic
What is the goal of descriptive analysis?
The goal of descriptive analysis is to describe or summarize a set of data.
True or False. Descriptive analysis will generate simple summaries about the samples and their measurements.
True
What is usually the first type of analysis that is performed on a set of data?
Whenever you get a new dataset to examine, descriptive analysis is usually the first kind of analysis you will perform.
Name the 2 main type of common descriptive statistics.
1) Measures of Central Tendency
2) Measures of Variability
Name 3 examples of measures of central tendency.
1) Mean
2) Median
3) Mode
Name 3 examples of measures of variability.
1) Range
2) Variance
3) Standard deviation
True or False. Descriptive statistics or descriptive data can be used to generalize the results of the analysis to a larger population or trying to make conclusions.
False. Descriptive statistics is aimed at summarizing your sample - not for generalizing the results of the analysis to a larger population or trying to make conclusions. Description of data is separated from making interpretations; generalizations and interpretations require additional statistical steps.
Give a common example of purely descriptive analysis or descriptive statistics.
Purely descriptive analysis can be seen in censuses.
What makes the census a purely descriptive type of data analysis?
In the census, the government collects a series of measurements on all of the country's citizens, which can then be summarized. Here, you are being shown the age distribution in the US, stratified by sex. The goal of this is just to describe the distribution. There is no inferences about what this means or predictions on how the data might trend in the future. It is just to show you a summary of the data collected.
What is the goal of exploratory analysis?
The goal of exploratory analysis is to examine or explore the data and find relationships that weren't previously known. Exploratory analyses explore how different measures might be related to each other but do not confirm that relationship as causitive.
Exploratory analyses lies at the root of what common saying?
"Correlation does not imply causation". Just because you observe a relationship between two variables during exploratory analysis, it does not mean that one necessarily causes the other.
True or False. Exploratory analysis is useful for discovering new connections and should be the final say in answering the question on how or why data might be related to each other.
False. Exploratory analyses, while useful for discovering new connections, should not be the final say in answering a question! Exploratory analysis alone should never be used as the final say on why or how data might be related to each other.
How can exploratory analysis be useful if it is not the final say on why or how data might be related to each other?
It can allow you to formulate hypotheses and drive the design of future studies and data collection. Exploratory analysis can look at how two or more variables might be related to each other. The causes of these relationships is not apparent from exploratory analysis. All exploratory analysis can tell us is that a relationship exists, not the cause.
What is the goal of inferential analysis?
The goal of inferential analyses is to use a relatively small sample of data to infer or say something about the population at large.
Inferential analysis is commonly the goal of _______________, where you have a small amount of information to extrapolate and generalize that information to a larger group.
statistical modelling
What does inferential analysis typically involve?
Inferential analysis typically involves using the data you have to estimate that value in the population and then give a measure of your uncertainty about your estimate.
What must you be cautious about when doing inferential analysis? Why can this be an issue?
Since you are moving from a small amount of data and trying to generalize to a larger population, your ability to accurately infer information about the larger population depends heavily on your sampling scheme - if the data you collect is not from a representative sample of the population, the generalizations you infer won't be accurate for the population.
Why should you NOT use census data for inferential analysis?
A census already collects information on (functionally) the entire population, there is nobody left to infer to.
Why can you not infer data from the US census to another country?
The US isn't necessarily representative of another country that you are trying to infer knowledge about.
Give an example of inferential analysis.
A good example of inferential analysis is a study in which a subset of the US population was assayed for their life expectancy given the level of air pollution they experienced. This study uses the data they collected from a sample of the US population to infer how air pollution might be impacting life expectancy in the entire US.
What is the goal of Predictive Analysis?
The goal of predictive analysis is to use current data to make predictions about future data. Essentially, you are using current and historical data to find patterns and predict the likelihood of future outcomes.
Describe how predictive analysis is similar to inferential analysis.
Like in inferential analysis, your accuracy in predictive analysis is dependent on measuring the right variables. If you aren't measuring the right variables to predict an outcome, your predictive analysis is not going to be accurate.
True or False. There are many ways to build up Predictive Analysis models with some being better or worse for specific cases.
True
List 2 characteristics of a better performing predictive analysis model.
1) Having more data.
2) A simple model.
What is one caveat about predictive analysis that is similar to exploratory analysis?
Much like in exploratory analysis, in predictive analysis, just because one variable may predict another, it does not mean that one causes the other; you are just capitalizing on this observed relationship to predict the second variable.
Describe the challenges of predictive analysis.
A common saying is that prediction is hard, especially about the future. There aren't easy ways to gauge how well you are going to predict an event until that event has come to pass; so evaluating different approaches or models is a challenge.
Describe how FiveThirtyEight uses predictive analysis.
Using historical polling data and trends and current polling, FiveThirtyEight builds models to predict the outcomes in the next US Presidential vote - and has been fairly accurate at doing so! FiveThirtyEight's models accurately predicted the 2008 and 2012 elections and was widely considered an outlier in the 2016 US elections, as it was one of the few models to suggest Donald Trump at having a chance of winning.
Describe the common weakness of the following forms of data analysis or data science categories: descriptive analysis, exploratory analysis, inferential analysis and predictive analysis.
The caveat to these forms of analysis is that we can only see correlations and can't get at the cause of the relationships we observe.
What is the goal of causal analysis?
The goal of causal analysis is to see what happens to one variable when we manipulate another variable - looking at the cause and effect of a relationship.
List 4 challenges of causal analysis.
1) Fairly complicated to do with observed data alone.
2) There will always be questions as to whether it is correlation driving your conclusions.
3) There will always be questions as to whether the assumptions underlying your analysis are valid.
4) Getting appropriate data for doing a causal analysis is a challenge.
True or False. Causal analysis is often considered the gold standard in data analysis, and is seen frequently in scientific studies where scientists are trying to identify the cause of a phenomenon, but often getting appropriate data for doing a causal analysis is a challenge.
True
How is causal analysis often applied?
Causal analyses are often applied to the results of randomized studies that were designed to identify causation.
What is one important thing to note about causal analysis?
When using causal analysis, the data is usually analysed in aggregate and observed relationships are usually average effects; so, while on average giving a certain population a drug may alleviate the symptoms of a disease, this causal relationship may not hold true for every single affected individual.
Describe an example where causal analysis is used in scientific analysis.
Randomized control trials for drugs are a prime example of causal analysis. For example, one randomized control trial examined the effects of a new drug on treating infants with spinal muscular atrophy. Comparing a sample of infants receiving the drug versus a sample receiving a mock control, they measure various clinical outcomes in the babies and look at how the drug affects the outcomes.
True or False. Mechanistic analyses are not nearly as commonly used as the previous analyses.
True
What is the goal of Mechanistic Analysis?
The goal of mechanistic analysis is to understand the exact changes in variables that lead to exact changes in other variables.
Mechanistic analyses are exceedingly hard to use to infer much, except in what 2 scenarios?
1) Simple situations
2) Situations that are nicely modeled by deterministic equations.
How is mechanistic analysis commonly applied?
Mechanistic analyses are most commonly applied to physical or engineering sciences.
Why is mechanistic analysis not commonly used for biological sciences?
Biological sciences are far too noisy of data sets to use mechanistic analysis.
When mechanistic analysis is applied what is the source of noise in the data?
When these Mechanistic analyses are applied, the only noise in the data is measurement error, which can be accounted for.
Describe a mechanistic analysis application to engineering.
Engineers apply mechanistic analyses through a careful balance of controlling and manipulating variables with very accurate measures of both those variables and the desired outcome.
True or False. As a data scientist, you are a scientist and as such, need to have the ability to design proper experiments to best answer your data science questions! Proper experimental design is important to a data scientist.
True
What does experimental design mean?
Experimental design is organizing an experiment so that you have the correct data (and enough of it!) to clearly and effectively answer your data science question.
List the 4 steps of the experimental design process.
1) Clearly formulating your question in advance of any data collection.
2) Designing the best set-up possible to gather the data to answer your question.
3) Identifying problems or sources of error in your design.
4) Collecting the appropriate data.
Why should you care about experimental design? What is the risk of not caring?
Going into an analysis, you need to have a plan in advance of what you are going to do and how you are going to analyze the data. If you do the wrong analysis, you can come to the wrong conclusions!
True or False. Poor scientific practices have caused scientific papers to be retracted, or removed from literature. Sometimes these are a result of poor experimental design and analysis.
True
True or False. When the stakes are this high, experimental design is paramount.
True
In experimental design, define the independent variable.
The variable that the experimenter manipulates; it does not depend on other variables being measured. Often displayed on the x-axis.
In experimental design, define the dependent variable.
The variable that is expected to change as a result of changes in the independent variable. Often displayed on the y-axis, so that changes in X, the independent variable, effect changes in Y.
What is another name for the independent variable?
factor
What 3 things must be done when you are designing an experiment?
1) You have to decide what variables you will measure, and which you will manipulate to effect changes in other measured variables.
2) You must develop your hypothesis.
3) Consider if there are problems with this experiment that might cause an erroneous result.
In experimental design, define the hypothesis.
An hypothesis is essentially an educated guess as to the relationship between your variables and the outcome of your experiment. A hypothesis is the expected outcome of your experiment.
In experimental design, define the sample size.
Sample size is the number of experimental subjects you will include in your experiment.
True or False. There are ways to pick an optimal sample size.
True
In experimental design, define the confounder.
The Confounder is an extraneous variable that may affect the relationship between the dependent and independent variables.
In experimental design, define the control group.
A Control group is when you have a group of experimental subjects that are not manipulated.
In experimental design paradigms where you have a control group, you would have a group that received the experiment (___________) and a group that did not (___________). This way, you can compare the effects of the drug in the ____________ versus ___________ group.
treatment, control, treatment, control
A _______________ is a group of subjects that do not receive the treatment, but still have their dependent variables measured
control group
Name 3 strategies for controlling Confounding effects.
1) Blind the subjects to their assigned treatment group.
2) Balance Confounders - any potential confounding variables should be distributed between each group roughly equally. Also known as stratifying variables.
2) Randomization
In experimental design, define the placebo effect.
The placebo effect is when a subject knows that they are in the treatment group (eg: receiving the experimental drug), they can feel better, not from the drug itself, but from knowing they are receiving treatment.
Describe how to control for the placebo effect.
To combat the placebo effect, often participants are blinded to the treatment group they are in; this is usually achieved by giving the control group a mock treatment (eg: given a sugar pill when they are told it is the drug). If the placebo effect is causing a problem with your experiment, both groups should experience it equally.
True or False. Blinding your study means that your subjects don't know what group they belong to - all participants receive a "treatment"
True
How is the balancing of Confounders achieved?
The "balancing" of confounders is often achieved by randomization.
List 2 reasons why randomization or randomly assigning individuals to each of your groups equally to balance confounders is a good idea.
1) Lessen the risk of accidentally biasing one group to be enriched for a confounder.
2) Help eliminate/reduce systematic errors.
True or False. Randomizing subjects to either the control or treatment group is a great strategy to reduce confounders' effects.
True
In experimental design, define replication.
Replication is repeating an experiment with different experimental subjects. if you can repeat the experiment and collect a whole new set of data and still come to the same conclusion, your study is much stronger.
Name 4 possible sources of error in an experiment.
1) A single experiment's results may have occurred by chance.
2) A confounder was unevenly distributed across your groups.
3) There was a systematic error in the data collection.
4) There were some outliers.
Outside of possibly coming to the same conclusion from an experiment again, how else is replication useful?
Replication allows you to measure the variability of your data more accurately, which allows you to better assess whether any differences you see in your data are significant.
True or False. Replication studies are a great way to bolster your experimental results and get measures of variability in your data.
True
Once you have collected and analyzed your data, what is one of the next steps of being a good citizen scientist?
Share your data and code for analysis.
Where is a great place to share your code?
GitHub
What is a great resource on how to best share data?
Consult PDF document "How to share data with a statistician."
To facilitate the most efficient and timely analysis what 4 items of information should you pass to a statistician?
1. The raw data.
2. A tidy data set
3. A code book describing each variable and its values in the tidy data set.
4. An explicit and exact recipe you used to go from 1 -> 2,3
Why is it important to pass the raw data to a statistician?
Passing raw data ensures that data provenance can be maintained throughout the workflow
List 4 characteristics of raw data in the right format.
You know the raw data are in the right format if you:
1) Ran no software on the data.
2) Did not modify any of the data values. If you made any modifications of the raw data it is not the raw form of the data.
3) You did not remove any data from the data set.
4) You did not summarize the data in any way.
List 2 reasons why reporting modified data as raw data a bad idea.
1) Reporting modified data as raw data is a very common way to slow down the analysis process, since the analyst will often have to do a forensic study of your data to figure out why the raw data looks weird.
2) If new data arrives, there will be confusion between the mixing of old and new data and what data is modified and un-modified.
List 4 characteristics of a tidy data set.
1) Each variable you measure should be in one column.
2) Each different observation of that variable should be in a different row.
3) There should be one table for each "kind" of variable.
4) If you have multiple tables, they should include a column in the table that allows them to be joined or merged.
Describe one simple thing you can do to make a data set much easier to handle.
Include a row at the top of each data table/spreadsheet that contains full row names.
List 5 rules for presenting tidy data to a statistician using Excel.
1) The tidy data should be in one Excel file per table.
2) There should not be multiple worksheets.
3) No macros should be applied to the data.
4) No columns/cells should be highlighted.
5) Share the data in a CSV or TAB-delimited text file.
What is the risk of using Excel to present data to a statistician?
Reading CSV files into Excel can sometimes lead to non-reproducible handling of date and time variables.
At a minimum what 3 items of information must be included in the code book?
1) Information about the variables (including units!) in the data set not contained in the tidy data.
2) Information about the summary choices you made.
3) Information about the experimental study design you used.
Describe the structure of the code book.
1) A "Study Design" section that has a thorough description of how the data was collected and the study design.
2) A "Code Book" section that describes each variable and its units.
What is the common format of the document used for the code book?
A Word file.
Give 3 examples of questions you could address by providing information about how you did the data collection/study design in the code book section.
Is the study based on:
1) the first 20 observations received?
2) highly selected observations by some characteristic?
3) randomized experiments?
List the 5 data types for coding variables.
1) Continuous
2) Ordinal
3) Categorical
4) Missing
5) Censored
What are continuous variables?
Continuous variables are anything measured on a quantitative scale that could be any
fractional number.
Give an example of a continuous variable.
An example would be something like mass measured in kg.
What is an ordinal variable?
Ordinal data are data that have a fixed, small (< 100) number of levels but are ordered.
Give an example of an ordinal variable.
This could be for example survey responses where the choices are: poor, fair, good.
What is a categorical variable?
Categorical data are data where there are multiple categories, but they aren't ordered.
Give an example of a categorical variable.
One example would be sex: male or female.
Why is a categorical variable attractive?
The categorical variable coding is attractive because it is self-documenting.
What is a missing variable?
Missing data are data that are unobserved and you don't know the mechanism.
How should missing data be coded?
You should code missing values as NA.
What is a censored variable?
Censored data are data where you know the missingness mechanism on some level.
Provide 2 examples of a censored variable.
1) A measurement being below a detection limit.
2) A patient being lost to follow-up.
Describe how censored data should be coded.
Censored data should be coded as NA when you don't have the data. But you should also add a new column to your tidy data called, "VariableNameCensored" which should have values of TRUE if censored and FALSE if not.
Regarding censored data or variables, how should it be reported in the code book?
In the code book you should explain why those values are missing. It is absolutely critical to report to the analyst if there is a reason you know about that some of the data are missing. You should also not impute/make up/throw away missing observations.
What is a good rule of thumb regarding categorical and ordinal variables?
In general, try to avoid coding categorical or ordinal variables as numbers. When you enter the value for sex in the tidy data, it should be "male" or "female". The ordinal values in the data set should be "poor", "fair", and "good" not 1, 2 ,3.
Why is it a good idea in general, to avoid coding categorical or ordinal variables as numbers?
This will avoid potential mix-ups about which direction effects go and will help identify coding errors.
Why is it always important to encode every piece of information about your observations using text?
For example, if you are storing data in Excel and use a form of colored text or cell background formatting to indicate information about an observation ("red variable entries were observed in experiment 1.") then this information will not be exported (and will be lost!) when the data is exported as raw text. Every piece of data should be encoded as actual text that can be exported.
True or False. Reproducibility is a big deal in computational science.
True
What does reproducibility mean in the context of computational science?
This means, when you submit your paper, the reviewers and the rest of the world should be able to exactly replicate the analyses from raw data all the way to final results.
What must be done to data before it is considered tidy? Why would you want to do this?
You will need to perform some summarization/data analysis steps before the data can be considered tidy. It can be inefficient to work with raw data. It is better to work with tidy data for data analysis and save time.
What is the ideal thing to do when performing summarization of data?
The ideal thing for you to do when performing summarization is to create a computer script that takes the raw data as input and produces the tidy data you are sharing as output.
List 2 types of computer scripts you can use to summarize raw data into tidy data you are sharing as output?
1) Python
2) R
How can you test the effectiveness of computer scripts used to summarize raw data into tidy data you are sharing as output?
You can try running your script a couple of times and see if the code produces the same output.
In many cases, why would the person who collected the data (the scientist) have an incentive to make it tidy for a statistician?
In many cases, the person who collected the data has incentive to make it tidy for a statistician to speed the process of collaboration. When you turn over a properly tidied data set it dramatically decreases the workload on the statistician. So hopefully they will get back to you much sooner.
Describe what the person who collected the data (the scientist) can do if they do not know how to code in scripting language in order to make the data tidy.
In that case, they should provide the statistician something called pseudocode.
Describe the 3 steps of pseudocode or instruction list you (the scientist) can provide the statistician if you do not know how to code in scripting language.
Step 1 - take the raw file, run version 3.1.2 of summarize software with parameters a=1, b=2, c=3
Step 2 - run the software separately for each sample.
Step 3 - take column three of outputfile.out for each sample and that is the corresponding row in the output data set.
List 3 additional steps after the computer script or pseudocode/instruction list you should include in the reproducibility instructions you provide to exactly replicate the analyses from raw data all the way to final results.
1) Include information about which system (Mac/Windows/Linux) you used the software on.
2) Whether you tried it more than once to confirm it gave the same results.
3) Run this by a fellow student/labmate to confirm that they can obtain the same output file you did.
List the 4 things a statistician will do when they receive data that was tidied from a raw data set from a scientist.
Most careful statisticians will:
1) Check your recipe.
2) Ask questions about steps you performed.
3) Try to confirm that they can obtain the same tidy data that you did with.
4) At a minimum, conduct spot checks.
List the 3 things the statistician can provide to the scientist they partner with.
1. An analysis script that performs each of the analyses (not just instructions).
2. The exact computer code they used to run the analysis.
3. All output files/figures they generated.
Why is the information that the statistician provides important to your results?
This is the information you will use in the supplement to establish reproducibility and precision of your results.
True or False. Each of the steps in the analysis should be clearly explained and you should ask questions when you don't understand what the analyst or statistician did. It is the responsibility of both the statistician and the scientist to understand the statistical analysis.
True
True or False. You (the scientist) may not be able to perform the exact analyses without the statistician's code, but you (the scientist) should be able to explain why the statistician performed each step to a labmate/your principal investigator.
True
What is a p-value?
A p-value is is a value that tells you the probability that the results of your experiment were observed by chance. It is one of the many things often reported in experiments.
What does it mean when you have a p-value of less than 0.05?
When your p-value is less than 0.05 (in other words, there is a 5 percent chance that the differences you saw were observed by chance), a result is considered significant.
What pre-cautions must be taken with p-values? What is p-hacking?
What you need to look out for is when you manipulate p-values towards your own end. If you do 20 tests, by chance, you would expect one of the twenty (5%) to be significant. In the age of big data, testing twenty hypotheses is a very easy proposition. And this is where the term p-hacking comes from: This is when you exhaustively search a data set to find patterns and correlations that appear statistically significant by virtue of the sheer number of tests you have performed. These spurious correlations can be reported as significant and if you perform enough tests, you can find a data set and analysis that will show you what you wanted to see.
What is big data?
Big data are very large data sets.
What three qualities that are commonly attributed to big data sets?
volume, velocity, variety
True or False. Big data involves large data sets of diverse data types that are being generated very rapidly.
True
Give 3 reasons why the concept of big data has been so recently popularized.
1) As technology and data storage has evolved to be able to hold larger and larger data sets, the definition of "big" has evolved too.
2) Our ability to collect and record data has improved with time such that the speed with which data is collected is unprecedented.
3) What is considered "data" has evolved, so that there is now more than ever - companies have recognized the benefits to collecting different sorts of information, and the rise of the internet and technology have allowed different and varied data sets to be more easily collected and available for analysis.
What is one of the main shifts in data science?
One of the main shifts in data science has been moving from structured data sets to tackling unstructured data.
What is structured data?
Structured data is what you traditionally might think of data; long tables, spreadsheets, or databases with columns and rows of information that you can sum or average or analyse however you like within those confines.
True or False. Most of the data presented to you these days is structured data.
False. Unfortunately, structured data is rarely how data is presented to you in this day and age. The data sets we commonly encounter are much messier, and it is our job to extract the information we want and corral it into something tidy and structured.
Describe the rise of the availability of large amounts of unstructured data.
With the digital age and the advance of the internet, many pieces of information that weren't traditionally collected were suddenly able to be translated into a format that a computer could record, store, search, and analyze. And once this was appreciated, there was a proliferation of this unstructured data being collected from all of our digital interactions. The amount of data and the various sources that can record and transmit data has exploded.
Provide some examples of unstructured data being collected from all of our digital interactions.
emails, Facebook and other social media interactions, text messages, shopping habits, smartphones (and their GPS tracking), websites you visit, how long you are on that website and what you look at, CCTV cameras and other video sources, etc.
List 8 Unstructured Data Types or examples of sources of unstructured data sources.
1) Text files and documents
2) Websites and applications
3) Sensor data
4) Image files
5) Audio files
6) Video files
7) Email data
8) Social media data
Explain the issues with large unstructured data sources.
These unstructured data sets are now so large and complex that we need new tools and approaches to make the most of them. As you can guess given the variety of data types and sources, very rarely is the data stored in a neat, ordered spreadsheet, that traditional methods for cleaning and analysis can be applied to!
List the 4 big challenges of working with big data.
1) It is big.
2) It is constantly changing and updating.
3) The variety can be overwhelming.
4) Big data is messy.
What is the issue with the size of big data?
There is a lot of raw data that you need to be able to store and analyse.
What is the issue with big data constantly changing and updating?
By the time you finish your analysis, there is even more new data you could incorporate into your analysis! Every second you are analyzing, is another second of data you haven't used!
What is the issue with the variety of big data can be overwhelming?
There are so many sources of information that it can sometimes be difficult to determine what source of data may be best suited to answer your data science question!
What is the issue with big data being messy?
You don't have neat data tables to quickly analyze - you have messy data. Before you can start looking for answers, you need to turn your unstructured data into a format that you can analyze!
List 4 reasons why you would want to work with big data and not just stick to analyzing smaller, more manageable, curated datasets and arriving at your answers that way?
1) The sheer volume of big (and messy) data negates the effect of these small errors.
2) Analyses that are accurate to the current state and make on the spot, rapid, informed predictions and decisions.
3) Questions that previously were inaccessible now have newer, unconventional data sources that may allow you to answer these formerly unfeasible questions.
4) It can identify hidden correlations. Hidden correlations can be resolved.
Explain how the sheer volume of big data can negate the effect of the small errors of messy data.
Sometimes questions are best addressed using these smaller datasets, but many questions benefit from having lots and lots of data, and if there is some messiness or inaccuracies in this data, the sheer volume of big data negates the effect of these small errors. So we are able to get closer to the truth even with these messier datasets.
Explain how analyses that are accurate to the current state and make on the spot, rapid, informed predictions and decisions support the use of big data.
When you have data that is constantly updating, while this can be a challenge to analyze, the ability to have real time, up to date information allows you to do analyses that are accurate to the current state and make on the spot, rapid, informed predictions and decisions.
Explain how big data answers questions that previously were inaccessible and now have newer, unconventional data sources that may allow you to answer these formerly unfeasible questions.
One of the benefits of having all these new sources of information is that questions that weren't previously able to be answered due to lack of information, suddenly have many more sources to glean information from and new connections and discoveries are now able to be made!
Explain how big data can identify hidden correlations.
Since we can collect data on a myriad of qualities on any one subject, we can look for qualities that may not be obviously related to our outcome variable, but the big data can identify a correlation there - instead of trying to understand precisely why an engine breaks down or why a drug's side effect disappears, researchers can instead collect and analyze massive quantities of information about such events and everything that is associated with them, looking for patterns that might help predict future occurrences. Big data helps answer what, not why, and often that's good enough.
True or False. Big data has now made it possible to collect vast amounts of data, very rapidly, from a variety of sources (and improvements in technology have made it cheaper to collect, store and analyse).
True
True or False. As long as you have big data, you can answer any question.
False. Regardless of the size of the data, you need the right data to answer a question. Essentially, any given data set may not be suited for your question, even if you really wanted it to; and big data does not fix this. Even the largest data sets around might not be big enough to be able to answer your question if it's not the right data.
True or False. Data science is question driven science and even the largest of data sets may not be appropriate for your case.
True
Sets with similar terms
Data Science
37 terms
ch 8 anglow cis
39 terms
MIS - Ch.8
47 terms
Market Research Chapters 9- 12
20 terms
Other sets by this creator
Statistical Inference
25 terms
Exploratory Data Analysis part 1
500 terms
Getting and Cleaning Data part 3
224 terms
Getting and Cleaning Data Part 2
500 terms
Other Quizlet sets
Counseling Theories and Procedures Ch. 14
32 terms
Marketing Final Exam
42 terms
CW106: Unit 2 Quiz (Reproductive System)
15 terms
OAATS - Hot Weather Operations
24 terms