Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L...

UNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors

Heterogeneous Clustered VLIW

Microarchitectures

CGO’07, San Jose, California - March 2007

Northeastern UniversityNortheastern UniversityNortheastern UniversityNortheastern University

Microarchitectures

Alex Aletà, Alex Aletà, Alex Aletà, Alex Aletà, Josep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David Kaeli

Clustered Microarchitectures

� Challenges Challenges Challenges Challenges in processor designin processor designin processor designin processor design

� Wire delays

� Power consumption

� Clustering: Clustering: Clustering: Clustering: divide the system into semi-independent

� Each unit ⇒⇒⇒⇒ Cluster

� Fast interconnects intra-cluster

� Slow interconnects inter-clusters

� Common trend in commercial VLIW processors

� DSP/embedded domain

Clustered VLIW Architecture

Register buses

Clustered VLIW processorClustered VLIW processorClustered VLIW processorClustered VLIW processor

REGISTER FILE

I-CACHE

CLUSTER

1CLUSTER

2CLUSTER

MAIN MEMORY

DATA CACHE

INT INT FP FP MEM MEM

REGISTER FILE

Motivation

� Not all instructions have the same impact Not all instructions have the same impact Not all instructions have the same impact Not all instructions have the same impact

on execution timeon execution timeon execution timeon execution time

� Divide resourcesDivide resourcesDivide resourcesDivide resources

� Performance oriented clusters

� Higher voltages

� Faster

� Place critical instructions

� Power oriented clusters

� Lower voltages

� Consume less power

� Place non-critical instructions

Motivation

CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus

C0C0C0C0 C1C1C1C1 C2C2C2C2 ICNICNICNICN

Cycle timeCycle timeCycle timeCycle time 1111 1111 1111 1111Homogeneous

0000 AAAA

1111 BBBB Com ACom ACom ACom A

2222 CCCC IIII LLLL

3333 DDDD JJJJ MMMM

4444 EEEE KKKK NNNN

5555 FFFF

6666 GGGG

7777 HHHH

Scheduling

Motivation

C0C0C0C0 C1C1C1C1 C2C2C2C2 ICNICNICNICN

Cycle timeCycle timeCycle timeCycle time 1111 2222 2222 1111Heterogeneous

0000 AAAA

1111 BBBB Com ACom ACom ACom A

2222 CCCCIIII LLLL

3333 DDDD

4444 EEEEJJJJ MMMM

5555 FFFF

6666 GGGGKKKK NNNN

7777 HHHH

Scheduling

Talk Outline

� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture

� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques

� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation

� ConclusionsConclusionsConclusionsConclusions

Heterogeneous Architecture

� Configured similar to a multiple clock domain designConfigured similar to a multiple clock domain designConfigured similar to a multiple clock domain designConfigured similar to a multiple clock domain design

� Domain boundaries:

� Each cluster

� Inter-connection Network

� Memory hierarchy

� Each domain can use a different voltage / frequencyEach domain can use a different voltage / frequencyEach domain can use a different voltage / frequencyEach domain can use a different voltage / frequency

� Performance oriented: higher voltage/frequency

� Power oriented: lower voltage/frequency

� Communication between domains: synchronization Communication between domains: synchronization Communication between domains: synchronization Communication between domains: synchronization

queuesqueuesqueuesqueues

I-CACHE

Register buses

Clustered VLIW processorClustered VLIW processorClustered VLIW processorClustered VLIW processor synchronization

queues

DATA CACHE

INT INT FP FP MEM MEM

REGISTER FILECLUSTER

1CLUSTER

2CLUSTER

MAIN MEMORY

DATA CACHE

Distributed Control Path

� Fetch and decode units distributed among clustersFetch and decode units distributed among clustersFetch and decode units distributed among clustersFetch and decode units distributed among clusters

� Instruction lay-out

� Grouped by cluster

I1C1 I1C4I1C3I1C2 I2C1 I2C4I2C3I2C2 I3C1 I3C4I3C3I3C2

Centralized Control Path

I1C1 I3C1I2C1 I1C2 I3C2I2C2

I1C3 I3C3I2C3 I1C4 I3C4I2C4

I C1 I C4I C3I C2 I C1 I C4I C3I C2 I C1 I C4I C3I C2

... ...

......Distributed

Control Path

PC3 PC4

PC1 PC2

[Zhong et al. PACT’05]

Talk Outline

Statically Scheduled Processors

� Performance relies on the compilerPerformance relies on the compilerPerformance relies on the compilerPerformance relies on the compiler

� Instruction scheduling

� Register allocation

� Clustered microarchitectures

� Cluster assignment

� Communications

� Multimedia and numeric codeMultimedia and numeric codeMultimedia and numeric codeMultimedia and numeric code

� Majority of the execution in loop bodies

� Modulo scheduling

cycle 2cycle 2cycle 2cycle 2

Modulo Scheduling

� Effective technique for Effective technique for Effective technique for Effective technique for scheduling loopsscheduling loopsscheduling loopsscheduling loops

� Overlaps loop iterations

� Increases parallelism

Iteration space

3rd It

1st It

4th ItA

3rd ItB

Kernel: pattern repeated every II cyclesA

2nd It

II ≥MII (Minimum Initiation Interval)

� Constrained by resources and recurrences

�MII = max {resMII, recMII}

MS for Heterogeneous

EAcycle 4cycle 4cycle 4cycle 4

Iteration space

1st It

2nd It

KernelEach cluster has its own II

Constant Iteration Time (IT)

3rd It

MIT = max {resMIT, recMIT}

Proposed Technique

Profile homogenous

architecture

Select frequencies

and voltages

Modulo Scheduling

Select Voltages and Frequencies

� For each domainFor each domainFor each domainFor each domain

� At program levelAt program levelAt program levelAt program level

� Consider different delays between fast and slow Consider different delays between fast and slow Consider different delays between fast and slow Consider different delays between fast and slow

domainsdomainsdomainsdomains

� Estimate execution time

� Estimate energy consumption

� Select minimum ED2

Estimate Execution Time

� Use profilingUse profilingUse profilingUse profiling

� Texec= niters · (II + SC - 1) · Tcycle

� Estimated Estimated Estimated Estimated IT

� Enough to accommodate

� All instructions

� All recurrences

� Value lifetimes of the profiled homogeneous

� Communications required by the profiled homogeneous

Estimate Energy Consumption

� Same microarchitecture, different voltagesSame microarchitecture, different voltagesSame microarchitecture, different voltagesSame microarchitecture, different voltages

� Relative dynamic power

� Relative static power

⋅⋅⋅

⋅⋅⋅=

� Relative static power

⋅⋅⋅

MS Algorithm

Compute

IT:=MIT

Increase

Select IIs

& freqs

ClusterClusterClusterCluster Supported cycle timesSupported cycle timesSupported cycle timesSupported cycle times

C0C0C0C0 1ns1ns1ns1ns

C1C1C1C1 5/4 ns ; 4/3 ns ; 3/2 ns5/4 ns ; 4/3 ns ; 3/2 ns5/4 ns ; 4/3 ns ; 3/2 ns5/4 ns ; 4/3 ns ; 3/2 ns

IT= 7ns II(C0)= 7 /1 = 7 cycles

II(C1)= 7 / 1.3333 = 5.25

II(C1)= 7 / 1.25 = 5.6

II(C1)= 7 / 1.5 = 4.667

IT= 8ns II(C0)= 8 / 1= 8 cycles

II(C1)= 8 / 1.3333 = 6

II(C1)= 8 / 1.25 = 6.4

MS Algorithm

Compute

IT:=MIT

Increase

Select IIs

& freqs

Partition

Schedule

OK?YES

MS for Heterogeneous

�Multilevel graph partitioning algorithmMultilevel graph partitioning algorithmMultilevel graph partitioning algorithmMultilevel graph partitioning algorithm

� Coarsening

� Place critical recurrences in fast clusters

� Refinement

� Estimate execution time and energy consumption

� Optimize for ED2

� SchedulingSchedulingSchedulingScheduling

� Instructions delayed due to synchronization hazards

Recurrence Constrained Loops

� Large benefits expectedLarge benefits expectedLarge benefits expectedLarge benefits expected

� A small number of instructions are critical for execution time

� 2222----cycle latency instructionscycle latency instructionscycle latency instructionscycle latency instructions

� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster

� HomogeneousHomogeneousHomogeneousHomogeneous� Recurrence: 6 cycles

� II= 6 cycles (all clusters!) lots of unused slots

� HeterogeneousHeterogeneousHeterogeneousHeterogeneous� 1 fast cluster, cycle time= 1ns

� 3 slow clusters, cycle time= 2ns

IT= 6ns

II(fast cluster)= 6

II(slow clusters)= 3

Talk Outline

Experimental Environment

�MicroarchitectureMicroarchitectureMicroarchitectureMicroarchitecture

� Clusters

� 4 clusters

� 1 FP-unit, 1 INT-unit, 1 memory port, 16 registers per cluster

� Inter-connection Network

� 1-cycle latency broadcast buses

� Heterogeneity

� Clusters: 1 performance oriented / 3 power oriented

� Benchmarks:Benchmarks:Benchmarks:Benchmarks:

� SpecFP2k Fortran programs

� Loops obtained with ORC

� Baseline: homogeneous architectureBaseline: homogeneous architectureBaseline: homogeneous architectureBaseline: homogeneous architecture

Results

2 buses

35%30%

High ILP Programs

� All instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution time

� No benefit using different frequencies

� Higher IPC

� Dynamic energy accounts for a majority of the total energy

consumption

� Use lower Vdd and lower frequencies

� Memory hierarchy and inter-connection network can use different voltages

Talk Outline

Conclusions

� Heterogeneous clustered architecturesHeterogeneous clustered architecturesHeterogeneous clustered architecturesHeterogeneous clustered architectures

� Clusters run at different voltages / frequencies

� Instructions that impact execution time scheduled in fast

clusters

� Remaining instructions in power-oriented clusters

� Proposed compiler techniquesProposed compiler techniquesProposed compiler techniquesProposed compiler techniques

� Algorithm to select the voltages / frequencies

� MS for heterogeneous configurations

� EDEDEDED2222 : 15% improvement on average: 15% improvement on average: 15% improvement on average: 15% improvement on average

� Up to 35% for selected programs

UNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors

Heterogeneous Clustered VLIW

Microarchitectures

CGO’07, San Jose, California - March 2007

Microarchitectures

Alex Aletà, Alex Aletà, Alex Aletà, Alex Aletà, Josep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David Kaeli

Motivating Example

C1C1C1C1 BusBusBusBus C2C2C2C2

0000 AAAA

1111 BBBB Comm AComm AComm AComm A

2222 CCCC EEEE

3333 DDDD

C1C1C1C1 BusBusBusBus C2C2C2C2

0000 AAAA

1111 BBBB Comm AComm AComm AComm A

2222 CCCCEEEE

3333 DDDD

C1C1C1C1 C2C2C2C2

Cycle timeCycle timeCycle timeCycle time 1 ns1 ns1 ns1 ns 1 ns1 ns1 ns1 ns2 ns

� SignalsSignalsSignalsSignals

enable_C1

enable_mem

enable_CN

Multiplier

Divider

syncqueues

enable_all

On Chip

Memory

Branch Instructions

� Branches decoupled in several instructions Branches decoupled in several instructions Branches decoupled in several instructions Branches decoupled in several instructions

(Unbundled Branch Architecture)(Unbundled Branch Architecture)(Unbundled Branch Architecture)(Unbundled Branch Architecture)

� Branch target computation

� Independent in each cluster

� Branch condition evaluation

� Computed in one cluster

� Broadcasted to the rest

� Control transfer

� Different in each cluster

Results

ED2 ED21 bus 2 buses

Recurrence Constrained Loops

� Largest benefits obtainedLargest benefits obtainedLargest benefits obtainedLargest benefits obtained

� A small number of instructions are critical for execution time

� Improvement: 30% - 35% for 189.lucas and 200.sixtrack

� 2222----cycle latency instructionscycle latency instructionscycle latency instructionscycle latency instructions

� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster

� HomogeneousHomogeneousHomogeneousHomogeneous� Recurrence: 6 cycles

� II= 6 cycles (all clusters!) lots of unused slots

� HeterogeneousHeterogeneousHeterogeneousHeterogeneous� 1 fast cluster, cycle time= 1ns

� 3 slow clusters, cycle time= 2ns

IT= 6ns

II(fast cluster)= 6

II(slow clusters)= 3

Smallest Benefits

� Around 5%Around 5%Around 5%Around 5%

� 168.wupwise

� Loops have different characteristics

� 173.applu

� Low number of iterations

� Schedule length has a big impact

Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L...

Documents

Transcript of Heterogeneous Clustered VLIW Microarchitectures · Motivation Cycle C0CC00C0 C1CC11C1 CC22C2 Bus L...

PASSAGE (mm) HEIGHT (mm) CYCLE CYCLE VAJILLA (mm) …

Brayton Cycle

Clase 3 Rankine Cycle

UNIVERSIDAD ICN

Microorganismes, microbes, virus au Cycle 3

ICN Members’ Address List - 2001 · 2019-03-04 · ICN Members’ Address List - 2019 ICN Members’ Address List - 2019 Bermuda Nurses' Association PO Box HM 1466 Hamilton HM FX

UNIVERSIDAD ICN PREPARATORIA ICN BIOLOGIA: LOS ECOSISTEMAS Y SU INTERACCION CON EL MEDIO AMBIENTE MAESTRO: RAUL RAMIREZ POSADA.

Collins-Based Nitrogen Liquefaction Cycle

Fish Life Cycle - sairatrev.com

Cycle City 35 digital

Cycle City 12

Program Cycle 4

JUAN CARLOS D’OLIVO, ICN-UNAM

Bàrbara Baro Cell Cycle Group

Customer life cycle #DDayEADA

Cycle City 26 digital

Cell Cycle Arrest and Cell Apoptosis

SYLLABUS CYCLE II.pdf

UNIVERSIDAD ICN La Fotosíntesis y las células de las plantas UNIVERSIDAD ICN La Fotosíntesis y las células de las plantas Raul Ramirez Posada.

Cycle City 29 Digital