Revature Week 3

5.0 (1 review)
What was the "Hadoop Explosion"?
Click the card to flip 👆
1 / 38
Terms in this set (38)
o Unix was an OS. Large community of hackers + academics modified the source of Unix to create and share distributions. But they decided to close the source
o The community worked on the GNU (Gnu's Not Unix) Project, which was an attempt to recreate Unix, open source + copyleft, from the ground up. GNU/Linux is the Linux we know today, Ubuntu is a distribution based off of GNU/Linux.
ls -al: displays contents of a directory in longform, with details. Lets us see permissions
cd: change directory
pwd: print working directory (prints out current file path you're in)
mkdir: make directory to make a new folder
touch: make a new file with touch filename
nano: basic command line text editor
man: see the manual for list of all commands
less: prints contents of file to the command line
cat: reads a file
mv: moves file from source to destination
cp: copy's specified file
rm: removes file
history: shows history of commands. history | grep [old command] : this will show you prior usage of some command. very handy when you forget.
groups can contain multiple users. All users belonging to a group will have the same Linux group permissions access to the file.
A user or account of a system is uniquely identified by a numerical number called the UID (unique identification number).
A root or super user can access all the files, while the normal user has limited access to files.
How does the chmod command change file permissions?The chmod command can be used to explicitly assign privileges to owner, group, and user. This can be accomplished either using binary number format, i.e. 777 for all privileges to all groups, or through letter format, i.e. o + rwx, g + rwx, u + rwxWhat is a package manager? what package manager do we have on Ubuntu?package managers download applications or parts of applications for you, installing and managing dependencies automatically. On Ubuntu we'll use APT (Advanced Package Tool).What is ssh?SSH, also known as Secure Shell or Secure Socket Shell, is a network protocol that gives users, particularly system administrators, a secure way to access a computer over an unsecured network.Be able to explain the significance of Mapper[LongWritable, Text, Text, IntWritable] and Reducer[Text, IntWritable, Text, IntWritable]The Mapper class is a generic type, with four formal type parameters that specify the input key, input value, output key, and output value types of the map function. o Rather than use built-in Java types, Hadoop provides its own set of basic types that are optimized for network serialization. These are found in the org.apache.hadoop.io package. o Here we use LongWritable, which corresponds to a Java Long, Text (like Java String),and IntWritable (like Java Integer).What are the 3 Vs of big data (Gardener's 3 V's)?Volume : Big data processing involves large amounts of data at least >1TB Velocity : Big/Fast data involves processing data that is produced rapidly and may need to be processed in near-real-time. Variety : Big data involves processing data in multiple formats from multiple sources.What are some examples of structured data? Unstructured data?Structured data is highly-organized and formatted in a way so it's easily searchable in relational databases (dates, phone numbers, ssn, addresses, etc). Unstructured data has no pre-defined format or organization, making it much more difficult to collect, process, and analyze (text files, reports, images, video files, etc).What is a daemon?A daemon is just a long-running process. HDFS and YARN both involve multiple different daemons running on different machines. Typically, applications running on a cluster will have one or more master daemons responsible for coordinating work and many worker daemons responsible for actually doing the work.What is data locality and why is it important?In Hadoop, Data locality is the process of moving the computation close to where the actual data resides on the node, instead of moving large data to computation. This minimizes network congestion and increases the overall throughput of the system.How many blocks will a 200MB file be stored in in HDFS, if we assume default HDFS block size for Hadoop v2+?2 (one 128MB the other 72MB)What is the default number of replications for each block? How are these replications typically distributed across the cluster?3. Replication information and other metadata is stored on the NameNode, and the NameNode makes all decisions about where data/replicas will be stored on the cluster. Each file/block within a file is replicated across the cluster.What is rack awareness?Rack awareness is the knowledge of network structure ie location of different dataNode across the Hadoop cluster. While reading/writing data in HDFS, NameNode chooses the Data node which is in the same rack or if not available, at least in a nearby rackWhat is the job of the NameNode? What about the DataNode?NameNode: master daemon. The NN keeps the image of the distributed filesystem. It doesn't store any of the actual data in the files/directories. It does contain the metadata for files and directories. DataNode: worker daemon. There are many of these in your cluster, typically 1 per machine, except the NN. The DN stores the actual data stored in the filesystem and communicates its status with the NN. runs on "commodity hardware", which just means regular servers, nothing specialized.How many NameNodes exist on a cluster?There is one of these per cluster unless your cluster is multiple thousands of machines.How are DataNodes fault tolerant?For DataNodes, their fault tolerance is handled by the NameNode. DNs send heartbeats to the NN, so when a DN goes down, it stops sending those heartbeats, and the NN knows to make new replicas of all the data stored on the downed DN.How does a Standby NameNode make the NameNode fault tolerant?This is a daemon that runs on another machine and follows the same steps as the NameNode while they are occuring in real time. it just receives the information in the EditLog and keeps its own FSImage. Then, if the real NameNode fails, the Standby NameNode steps in and becomes the new NameNode. This is called failover. This is the best option, but requires more resources.What purpose does a Secondary NameNode serve?This periodically (every hour) keeps backups of the NN metadata. It isn't capable of stepping in or functioning as a replacement NameNode, it just provides functionality to preserve FS information in a secondary location in the case of total failure of the NN. Avoid catastrophic data loss.How might we scale a HDFS cluster past a few thousand machines?o HDFS Federations, with multiple NameNodes, can be used if you need 10000s of machines.In a typical Hadoop cluster, what's the relationship between HDFS data nodes and YARN node managers?one per machine, the worker daemon. Node managers manage bundles of resources called containers running on their machine and report the status back to the RM. We submit jobs to the Resource Manager. Tasks are the individual pieces Jobs are broken up into. Tasks are what run inside of containers. Data Nodes are responsible for these map and reduce tasks.When does the combine phase run, and where does each combine task run?o The Combiner is a partial reduction before shuffle and sort o Output of combiner will be sent over network to actual reduce task as input.Know the input and output of the shuffle + sort phasetakes output from mapper and orders all associated keys before passing it to the reducer to make it easier to parse dataWhat does the NodeManager do?Node managers manage bundles of resources called containers running on their machine and report the status back to the ResourceManager.What does the ResourceManager do?one resource manager per cluster, the RM is the master daemon. Responsible for providing computing resources for jobs (ie RAM, cores, disk).Which responsibilities does the Scheduler have?this is responsible for allocating resources (containers) across the cluster based on requests.Which responsibilities does the ApplicationsManager have?Accepts job submissions, and creates the ApplicationMaster for each submitted job. Also responsible for the fault tolerance of ApplicationMastersWhat is an ApplicationMaster? How many of them are there per job?1 per job (managed by the applications manager) run in containers on the cluster, and are responsible for communicating with the scheduler to achieve their jobs. This allows the ApplicationsManager to be ultimately responsible for job completion, while offloading most of the work to ApplicationMasters running on worker nodes.What is a Container in YARN?Bundles of resources, tasks are what run inside them The RM makes tasks run in containers across the cluster and the scheduler allocates containers across the cluster, based on request ApplicationMasters run in containers on the clusterHow do we interact with the distributed filesystem?Through fs shell commands Jar: runs a jar file. Users can bundle their MapReduce code in a JAR file and execute it using this commandWhat do the following commands do? 1) hdfs dfs -get /user/adam/myfile ~ 2) hdfs dfs -put ~/coolfile /user/adam/1) gets a file from hdfs and puts it into our local system 2) puts a file from our local environment into hdfs