Assignment #1
Fall 2019

1. For a certain important application involving matrices, analysis reveals that floating point instructions account for 40% of the total execution time, and the floating point transpose function alone is responsible for 50% of the total floating point time. As a result of this analysis, two mutually exclusive enhancements are being considered: 1) to add a FPTRANS instruction to the current floating point unit that will accelerate the transpose computation by a factor of 5, or 2) to speed up the entire floating point unit by a factor of two. Which proposal will provide the higher performance?

2. Your task is to compare the memory efficiency of four different styles of instruction set architectures. The architecture styles are:

1. **Accumulator** - All operations occur between a single register and a memory location.

2. **Memory-memory** - All three operands of each instruction are in memory.

3. **Stack** - All operations occur on the top of the stack. Only `push` and `pop` access memory, and all other instructions remove their operands from the top of stack and replace them with the result. The implementation uses a pair of hardware registers to hold the top two entries on the stack; accesses that use other stack positions require memory references.

4. **Load-store** - All operations occur in registers, and register-to-register instructions have three operands per instruction. there are 16 general purpose registers, making the register specifiers 4 bits long.

To measure memory efficiency, assume the following about all four instruction sets:

- The opcode is always one byte (8 bits).
- All memory addresses are 2 bytes (16 bits).
- All data operands are 4 bytes (32 bits).
- All instructions are an integral number of bytes in length.

There are no other optimizations to reduce memory traffic, and the variables A, B, C and D are initially in memory.

Invent your own assembly language mnemonics and write the best equivalent assembly language code for the high-level-language fragments given. Write the four code sequences for:

```
A = B + C;
B = A + C;
D = A - B;
```

Calculate the instruction bytes fetched and the memory-data bytes transferred. Which architecture is most efficient as measured by code size? Which architecture is most efficient as measured by total memory bandwidth (code + data) required?