Theoretical background: Parallel Architectures and Their Programming
Parallel computer architectures can be divided in different types. According to Flynn’s taxonomy there are in general four types: Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), Multiple Instructions Multiple Data (MIMD), Pipeline computers or Multiple Instructions Single Data (MISD)  
Most processors today are pipelined in the context that they divide the instructions into several stages so that several instructions can be in progress at the same time. This will increase the throughput and the utilization of the processor. A downside with pipelines is that they increase the latency, the time from when an instruction is started until it is finished. Another downside is when the code got a lot of branches, meaning conditionals leading to different execution of instructions. If an instruction is already in the pipeline and a previous instruction just finished its last stage which affects a conditional before the first mentioned instruction. The instruction that came after the one affecting the conditional might be useful because another branch should be executed. In that the case the pipeline has to be “flushed” and the process of filling up the pipeline will restart. The main advantage of a pipeline computer is when executing a sequential stream of instructions. Then the throughput is close to or equal to one instruction per clock cycle for single issue processors .
MISD implies different operations are being carried out on one data. There are no good examples of these systems. According to some descriptions a Pipeline computer can be considered as a MISD. With that assumption also Systolic arrays could be included in this category. A Systolic array is basically a two or multi-dimensional pipeline system .
SIMD or sometimes called Array computers means that there are several processing elements (PEs). These are executing operations in parallel. There is one controller that is decoding the instruction and all the PEs are processing the same operation synchronously. This kind of parallelism is called Data-Level Parallelism (DLP) and is typically used in vector processors, which is described in detail in section 2.4    .
MIMD means that processors are doing independent computations on independent data in parallel. There are a couple of variants of this type of architecture. The main difference between them are if they have shared memory or distributed memory (each processor have full control over its own memory or memory area). In the shared memory case there have to be some kind of arbitrator or controller which distributes access time for the processors .
Multicore and Multiprocessors (Multiprocessors (Parallel and multi-core architectures))
A multicore and multiprocessor are processors of the type MIMD. The difference between multicore and multiprocessor is that the multicore processors are on the same die and are usually more tightly connected when it comes to sharing resources when it comes to Front Side Bus (FSB), memory and cache. Multiprocessors usually don’t share die but will usually interact with each other in one way or another.
The problem that occurs when using multiple cores is the demands it put on the programmer. To make use of the systems performance there has to be code that makes use of it. The code must have parts that can be run in parallel. The minimum is then that they at least run one thread per core. Some architectures allow many threads per core. The speedup using multiple cores can be calculated by using the formula derived from Amdahl’s law. It is presented by equation (1), where is the fraction of the parallel code that can be ran on all the cores in parallel.
It is not unusual that the different CPU need to communicate with each other. There are a couple of ways for the different CPUs to communicate. A common way is by using messages. One CPU will send a message to another CPU using some kind of interconnection infrastructure like a bus or Network on Chip (NoC). Another way is by using a shared memory and store data which both CPUs can access. In this way a CPU that is making calculations that another CPU is dependent on can store them at location in the shared memory, which have been agreed on. These two kinds of communications can of course be mixed so that some communications is done by message passing and some by shared memory 
A coprocessor is a processor that has the purpose to assist a CPU (Central Processing Unit). A coprocessor is sometimes called an accelerator. It has commonly a specialization, and gives therefore the CPU an extra feature or speedup. The specializations could for example be: bit based, integer and floating point arithmetic, vector arithmetic, graphics or encryption. A coprocessor cannot be the main unit and is therefore in need of a host for controlling it in some way. Some coprocessors can’t fetch instructions on its own, they then need instructions sent to it by the host. An example of a coprocessor is a GPU (Graphics Processing Unit). Basically all PCs have a GPU. It is a coprocessor that will do graphics calculations for the CPU. This is used for gaming (3D processing), photo editing (Image processing) and Video (Decoding) etc
A vector processor is a processor where one instruction can represent tens or hundreds of operations. Therefore, the area, time and energy overheads associated with fetching and decoding of SIMD instructions is a lot smaller in vector processor than in a scalar processor. To give a comparison between a scalar processor and a vector processor consider an addition between two vectors of ten numbers to produce a new vector of ten numbers. In a scalar processor it could look something like this:
- execute this loop 10 times
- fetch the next element of first vector
- fetch the next element of second vector
- add them
- put the results in the next element of the third vector
- end loop But when using a vector processor this could look like this
- read the vector instruction and decode it
- fetch 10 numbers of first vector
- fetch 10 numbers of second vector
- add corresponding elements
- put the results back in memory
An advantage with this is also that when using a vector instruction like this is that we ensure that the ten numbers have no data dependency and therefore the checking for data hazards between elements is unnecessary.
The architecture of a vector processor has two mayor variants, register-memory and memory-memory. The former assumes data located in a register file, while the latter works directly towards memory. The register-memory type is the most common. One benefit with the register-memory type is the lower access latency when working with a register inside the architecture compared to a memory .
A typical architecture of a register-memory vector processor is presented in Figure 2. It includes some basic parts typical for a vector processor:
- Vector registers: Registers containing data in form of a vector. This is the data that is going to be processed.
- Vector functional units: Represents the different operations/functions the vector processor can do on data.
- Vector load/store unit: Handles transfers between main memory and vector registers.
- A controller that handles the correct functionality of the VP .
What gives the vector processor its parallel features are its vector lanes. The vector lanes contain computing elements which do calculation in parallel to other lanes. In Figure 3 (a) is an example of a single lane and in Figure 3 (b) an example of multiple lanes. The figure shows the execution of an addition between the elements of a vector A and a vector B. The example show how the addition is queued in different setups. In the example with only one lane can we see that all the elements are queued, while in the four lane example the elements are distributed between the lanes 
1.2 NEED FOR HIGH PERFORMANCE COMPUTING AND PARALLEL COMPUTING
1.3 GOALS AND SCOPE
1.4 THESIS OUTLINE
2 Theoretical background: Parallel Architectures and Their Programming
2.1 PARALLEL ARCHITECTURES
2.2 MULTICORE AND MULTIPROCESSORS (MULTIPROCESSORS (PARALLEL AND MULTI-CORE ARCHITECTURES))
2.4 VECTOR PROCESSORS
2.5 ENERGY CONSUMPTION IN ELECTRONIC DEVICES
3 Vector Coprocessor Architecture
3.1 TEST SYSTEM
3.2 VECTOR COPROCESSOR
4 Method and Development tools
4.1 HARDWARE/SOFTWARE DEVELOPMENT
4.2 PROTOTYPING VECTOR PROCESSOR ARCHITECTURE
4.3 SYSTEM SIMULATION AND ANALYSIS
5 Application Development and Performance Evaluation
5.1 APPLICATION DEVELOPMENT PROCESS
5.2 MATRIX-MATRIX MULTIPLICATION
5.4 SPARSE MATRIX-VECTOR MULTIPLICATION
6 Extension of the Architecture
6.1 MOTIVATION FOR EXTENSION
6.2 THROUGHPUT MEASURING BLOCK INTERFACE
6.3 THROUGHPUT MEASURING BLOCK DESIGN
6.4 TEST AND PERFORMANCE
7 Discussion and conclusions
7.1 DISCUSSION OF METHOD
7.3 FUTURE WORK
GET THE COMPLETE PROJECT