CAB401 Exam revision


Terms in this set (...)

What are the stages of the Fetch, Decode, Execute Cycle?
- Next Instruction (PC++) from memory to Instruction Register (IR)
- 3 ypes of operations:
-- Load (memory -> register), Store (register -> memory)
-- Arithmetic (reg + reg -> reg)
-- (conditional) branch (address -> PC)
What could overclocking cause?
- Incorrect results can be computed
- The system may overheat.
What is the Von Neumann Bottleneck?
A limitation on throughput caused by the standard personal computer architecture.
Cache replacement policies
Fully associative - can be stored anywhere in the cache
Direct mapped - can be stored in only one place in the cache
N-way set associative - can be stored in one of N-places
Superscaler processors
Execute more than one instruction at the same time (ILP)
Have multiple functional units (ALU, FPU, SSE)
Typically instruction execution is pipelined
Hardware needs to determine dependencies between instructions to determine which instructions can be overlapped.
To keep pipeline full, need to fetch future instructions.
What is moore's law?
The number of transistors on a chip will double approximately ever two years.
What are the consequences of moore's law
As transistors and circuits get smaller, clock rate can be increased, resulting in faster processors.
Processor speeds have doubled every 2 years.
Not just CPUs have benefited, also camera megapixels and memory capacity.
Capacity has increased while cost has decreased or stayed the same.
Linked to pace of recent human development.
possible solutions to limits to miniaturisation?
3D chips, new materials, quantum computing
how can miniaturisation be exploited?
Allocate more of the chip to caching.
Create more complex instruction execution infrastructure.
Place more than on cpu on each chip
How can cores on a multi-core chip communicate with each other?
via message passing (network on chip)
via shared memory
Cache coherence
when one core writes to it's private cache all other private caches need to be invalidated
caches can be inclusive or exclusive
What is snooping?
To determine when cache lines need to be invalidated
what is PC?
Program counter
What is IR?
Instruction Resister
What is ALU?
Arithmetic Logic Unit
What is FPU?
Floating Point Unit
What is SSE?
Streaming SIMD Extensions
What is SIMD?
Single instruction, multiple data
What is Instruction level parallelism?
Multiple instructions can execute in each cycle
What is vector parallelism?
vector instructions operate on a list of data values
What is TLP?
Thread Level Parralellism
Explain TLP
Multiple threads execute at the same time
What is shared memory?
Threads (within one process on the same machine) communicate via share memory.
What is Distributed memory?
Processes (on different machines) communicate via message passing.
Distributed memory programs contain operations to send and receive messages respectively.
What is Distributed shared memory?
Provide a shared memory programming model for distributed memory machines.
What is SISD?
Single Instruction, Single Data
What is MISD?
Multiple Instructions, Single Data
What is MIMD?
Multiple Instructions, Multiple Data
Explain SISD
Von Neumann sequential computer.
One tread of execution performing scalar operations.
Explain SIMD
Includes superscalar(?), vector and stream processors
Explain MISD
Includes fault tolerant redundant systems
Arguably includes pipelined processors.
Explain MIMD
Includes most thread and process level parallelism
Includes both shared and distributed memory systems
Name types of supercomputers
Vector Processors
Symmetric Multi-Processor (SMP)
Massively Parallel Processors (MPP)
Cluster ("Beowulf")
Asymmetric Multi-Processor (AMP)
Cycle Stealing Systems
Want is inherent parallelism?
Often, however, we don't need to fundamentally change the algorithm.
Even if the algorithm isn't expressed in an explicitly parallel fashion, we can through analysis determine computational steps within the algorithm which can be safely performed in parallel.
We call this exploiting Inherent Parallelism.
What does safe parallelism mean?
To preserve all control and data dependencies in the original program
What is a control dependency?
Where one statement can affect whether some other statement actually executes.
What is a data dependency?
Where one statement refers to the same data as some other statement.
What is a Flow (or true) Dependence?
(W -> R) One statement reads a value written be an earlier statement.
What is an Output Dependence?
(W -> W) One statement overwrites a value written by an earlier statement
What is an Anti Dependency?
(R -> W) One statement reads a value before it is overwritten by a later statement
What is Input Dependence?
(R -> R) One statement reads a value also read by an earlier statement (they don't need to be preserved)
What are the limitations of Static Analysis?
Any form of static dependence analysis is going to be inexact in general.
- Alias analysis is undecidable
- Data dependence analysis is undecidable.
- Especially difficult to perform accurately inter-procedural.
If in doubt must assume that a dependency might exist.
If a dependence might exist then we must not parallelise.
Safe, but conservative
- May not find all of the available oarallelism
What is Automatic Parallelization?
When a compiler or tool identifies and implements parallel parts of code.
Name the logic steps of parallelization
Determine what could be safely parallelized
Decide if parallelization will increase performance
Transform program into an explicitly parallel form
What is the overhead in a parallel program?
Creating and managing threads
Synchronising parallel computations
Latency of messages sent between processors.
How do you ensure there will be a speedup?
the amount of computation in each "work unit" needs to be sufficiently large to offset this overhead
What is course grained parallelism?
Data is communicated infrequently, after larger amounts of computation.
What is fine grained parallelism?
Fine-grained parallelism means individual tasks are relatively small in terms of code size and execution time. The data is transferred among processors frequently in amounts of one or a few memory words
What is speedup?
The execution time of the best sequential program divided by the execution time of the parallel program.
What is scalable parallelism?
Scalable parallelism is where the speedup continues to increase as the number of processors increases.
If we execute each iteration of a loop that performs N iterations in parallel then it can potentially execute N times faster (if we had that many processors).
Ideally the amount of parallelism increases with the size of the problem.
What is the definition of NP?
Problems that we can check the solution to easily but not solve easily.
What is NP?
Non-deterministic polynomial time.
What is the definition of NP-Complete?
It represents the intersection of NP and NP-Hard problems, meaning that problems are difficult to solve but can be verified.
What are the steps of the parallelization methodology?
Obtain representative and realistic data sets
Time and profile sequential version
View source and understand high-level structure
Analyze dependencies
Determine sections that could be parallelized
Decide what parallelism might be worth exploiting
Consider restructuring program or replacing algorithms to expose more parallelism
Transform program into an explicitly parallel form.
Test and Debug parallel version
Time and profile parallel version
Determine issues inhibiting greater performance
What is the Heisenberg Uncertainty Principle?
is any of a variety of mathematical inequalities[1] asserting a fundamental limit to the precision with which certain pairs of physical properties of a particle, known as complementary variables, such as position x and momentum p, can be known.
What is the Pareto Principle?
The principle states that 20% of the invested input is responsible for 80% of the results obtained.
What is premature optimization?
Programmers wasting time on trying to optimise code before identifying if that they are improving is critical code. This can be avoided by analysing the complete program with profiling tools.
What is Communicating Sequential Processes (CSP)?
Processes operate independently, and interact with each other solely through message-passing communication
Programming model used by most distributed memory machines
Need to map computation and data to processors
What is Parallel Random Access Machine (PRAM)?
Processes operate asynchronously and have constant-time access to shared memory
Programming model used for Thread Level Parallelism (TLP)
Only need to map computation to processors.
What are explicit threads?
Threads that a manually implemented but the programmer, it is easy to see the flow and control of each thread through code.
What are implicit threads?
Threads that are created by a high level library. Which encapsulates information like, how many threads were created, and the flow and control of the threads.
What resources does a thread have?
Stack and registers (including Program counter)
All threads access the same shared memory
Thread local memory can also be allocated
What does the scheduler do?
Maps threads to processors.
What can programmers assign to threads?
a relative priority
an affinity mask to run only on certain processors
What are some scheduling algorithms?
pre-emptive, cooperative, round-robin, priority-based, fair
What is Processor Affinity?
A mask that can be applied to a process or thread to define what processors it can execute on.
What is a Thread Pool?
A thread pool is a collection of worker threads that efficiently execute asynchronous callbacks on behalf of the application. The thread pool is primarily used to reduce the number of application threads and provide management of the worker threads.
What is load balancing?
There are different ways of allocating work to processors.
What is the goal of load balancing?
One goal is to equalise the work given to each processor.
What are some work balancing strategies?
Dynamic work allocation
- Good if execution times are unpredictable, but has higher overhead
- Over decomposition (create more work units than processors)
- Work pools (create a queue of tasks awaiting execution)
-- Can be centralized or decentralized.
-- Work stealing (steal from another processor's work queue)
Static work allocation
- Block or Cyclic
Difference between block and cyclic allocation?
block sizes are defined before the loop whereas the loop just numbers in sizes of the number of threads.

I block size can used to divide the work into a specific number of threads. A cyclic implementation can create more threads than the number specified.
What is the definition of a race condition?
A Race condition is said to exist if the behavior of the program depends on the non-deterministic scheduling and execution times of its threads.
What is Mutual Exclusion?
n computer science, mutual exclusion is a property of concurrency control, which is instituted for the purpose of preventing race conditions; it is the requirement that one thread of execution never enter its critical section at the same time that another concurrent thread of execution enters its own critical section
What are the desired properties of mutual exclusion?
Safety: No more than one thread can be in a critical section at any time.
Liveness: A thread that is seeking to enter the critical section will eventually succeed
Fairness: If two threads are both trying to enter a critical section, they have equal chances of success
What are Mutual Exclusion Abstractions?
a binary flag (lock) used to protect a shared resource by ensuring mutual exclusion inside critical sections of code.
used to protect several equivalent resources
also used for signaling events between tasks.
Keeps a count of how many resources are available.
Supports two operations: P (try to reduce) and V (increase)
monitors are objects implemented so that only one thread can be executing any of its methods at any given point in time.
Condition Variables
wait for a condition and signal when it becomes true.
What is sequential consistency?
All memory operations appear to execute one at a time, and
The operations of a single processor appear to execute in the order described by that processor's program
What are the 4 necessary conditions for a deadlock?
Mutual exclusion - non sharable resources
Hold and wait - acquire one resource then wait for another
No preemption - resources cannot be forcibly acquired
Circular wait - circular chain of threads each waiting for the next
What is Livelock?
A livelock is similar to a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another, none progressing. Livelock is a special case of resource starvation; the general definition only states that a specific process is not progressing.

A real-world example of livelock occurs when two people meet in a narrow corridor, and each tries to be polite by moving aside to let the other pass, but they end up swaying from side to side without making any progress because they both repeatedly move the same way at the same time.
What is starvation?
Starvation occurs when a thread is not blocked or livelocked but fails to make progress because it doesn't get scheduled
other threads are consistently scheduled ahead of it.
What is barrier synchronisation?
a barrier is a type of synchronisation method. A barrier for a group of threads or processes in the source code means any thread/process must stop at this point and cannot proceed until all other threads/processes reach this barrier.
What is Temporary Locality?
when the same memory location is accessed again within a relatively
short period of time.
- the line is likely to be still in the cache and so result in a cache hit.
What is Spatial Locality?
when a memory location is access which is located very close to
another memory location that has recently been accessed.
- the other memory location is likely to have been loaded as part of the
same cache line and so will likely already be in the cache.
What is False Sharing?
When different threads frequently access different
memory locations that are both part of the same
cache line
What are methods for removing false dependencies?
Variable renaming
• Using thread local memory
• Promoting scalar variables to arrays
• Creating copies of shared state.
How can barriers for speedup be overcome?
Are overheads of parallelism killing speedup?
- look for larger (grain) sections of code to parallelize
2. Test for Load Balance
- do some processors finish before others?
dynamic scheduling
over decomposition
3. Test for excessive synchronization
- do threads spend much of their time waiting?
- is there excessive overhead from acquiring/releasing locks?
- lock at a different granularity
4. Apply loop transformations to expose parallelism
5. Alter data structures to improve data locality