VLIW Data Path

The way things are organized inside processors to allow storage of data!

CISC and DSP
-Memory-to-memory and complex addressing modes
-Accumulator:a target register of ALU
-Some storage specialities force complier to make binding choice and optimizations too early

RISC and VLIW
-Registers-to-registers: large register files
-Decoupling scheduling and register allocation

Processor: 16×16 bit, 32x32bit
ARM7, ARM9E, ARMv5TE

VLIW Machine:A VLIW machine with 8 32-bit independent datapaths

In the embedded world, characteristics of application domain are also very important
Simple Integer and Compare Operations
Carry, Overflow, and other flags

Fixed Point Multiplication

Partial interconnect clusters promise better scalability
The bypass network of a clustered VLIW is also partitioned.

Index Register Files
illegal, outputs locals, static
-compiler can explicitly allocate a variable-sized section of the register file
-used in proc call and return for stack management

VLIW instructions are separated into primitive instructions by the compiler.
During each instruction cycle, one VLIW word is fetched from the cache and decoded.
Instructions are then issued to functional units to be executed in parallel.
Since the primitive instructions are from the same VLIW word, they are guaranteed to be independent.

VLIW is load/store architecture
– mostly reuse RISC memory-addressing concepts
– special addressing modes
– registers – data and address
– memory could be banked, power efficient
– X-Y memory
– GPUs constant memory, shader memory, shared memory, local memory

ISA Design

Overview: What to hide
Basic: VLIW Design Principle
Designing a VLIW ISA for Embedded Systems
Instruction-set Encoding

Terminology: Fundamental RISC-like minimal unit of work
Instruction:
-Fundamental unit of encoding
-Refers to a parallel set of operations
Bundle: a memory-aligned encoding unit
-VEX calls syllable (the min-sized encoding unit)

The ISA allows the designer to treat the processor as a black box.
Understanding the ISA can help determine processor to use for a given system.

Delay slots of early RISC machines
Illusion of instant registers update in superscalar

VLIW Architecture
-Exposes a scheduler in the compiler
-Conversely, superscalar architecture hides this

Memory: off-chip, and not specialized
Registers: fast, on-chip, connected to logics
Baseline model: sequential execution
Pipelining: Parallelism in time
-whether implementation is hidden or exposed is a design choice
-hidden: out-of-order

Pipelines in modern processors

modern high performance processors:
15 to 20 stages: pentium 4 had a 20 stage pipeline

sequential program semantic
-tries to issue an instruction every clock cycle
-but, there are dependencies, control hazards and long latency instructions
-delays result in execution of < 1 instruction per cycle on an average VEX example: clock cycle and instruction Modern CPU Techniques: Pipelinig Execution stages are divided into several steps A later operation can share the resource used by the first operation in previous cycles Shared hardware can be pipeline e.g. Integer multiplier is pipelined. Instructions can be overlapped In embedded system, the choice of ISA is crucial! data or memory access: set a register to a fixed constant value control flow: branch, call functions arithmetic or logic: add, multiply, subtract, divide