Post on 17-Oct-2020
UNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors
Heterogeneous Clustered VLIW
Microarchitectures
CGO’07, San Jose, California - March 2007
Northeastern UniversityNortheastern UniversityNortheastern UniversityNortheastern University
Microarchitectures
Alex Aletà, Alex Aletà, Alex Aletà, Alex Aletà, Josep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David Kaeli
Clustered Microarchitectures
� Challenges Challenges Challenges Challenges in processor designin processor designin processor designin processor design
� Wire delays
� Power consumption
� Clustering: Clustering: Clustering: Clustering: divide the system into semi-independent
units
2
units
� Each unit ⇒⇒⇒⇒ Cluster
� Fast interconnects intra-cluster
� Slow interconnects inter-clusters
� Common trend in commercial VLIW processors
� DSP/embedded domain
Clustered VLIW Architecture
Register buses
Clustered VLIW processorClustered VLIW processorClustered VLIW processorClustered VLIW processor
REGISTER FILE
I-CACHE
3
CLUSTER
1CLUSTER
2CLUSTER
N
MAIN MEMORY
DATA CACHE
DATA CACHE
INT INT FP FP MEM MEM
REGISTER FILE
Motivation
� Not all instructions have the same impact Not all instructions have the same impact Not all instructions have the same impact Not all instructions have the same impact
on execution timeon execution timeon execution timeon execution time
� Divide resourcesDivide resourcesDivide resourcesDivide resources
� Performance oriented clusters
� Higher voltages
L
M
I
J
A
B
C
4
� Higher voltages
� Faster
� Place critical instructions
� Power oriented clusters
� Lower voltages
� Consume less power
� Place non-critical instructions
NK D
E
F
G
H
Motivation
CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus
L
M
I
J
A
B
C
C0C0C0C0 C1C1C1C1 C2C2C2C2 ICNICNICNICN
Cycle timeCycle timeCycle timeCycle time 1111 1111 1111 1111Homogeneous
5
CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus
0000 AAAA
1111 BBBB Com ACom ACom ACom A
2222 CCCC IIII LLLL
3333 DDDD JJJJ MMMM
4444 EEEE KKKK NNNN
5555 FFFF
6666 GGGG
7777 HHHH
NK D
E
F
G
H
Scheduling
Motivation
CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus
L
M
I
J
A
B
C
C0C0C0C0 C1C1C1C1 C2C2C2C2 ICNICNICNICN
Cycle timeCycle timeCycle timeCycle time 1111 2222 2222 1111Heterogeneous
6
CycleCycleCycleCycle C0C0C0C0 C1C1C1C1 C2C2C2C2 BusBusBusBus
0000 AAAA
1111 BBBB Com ACom ACom ACom A
2222 CCCCIIII LLLL
3333 DDDD
4444 EEEEJJJJ MMMM
5555 FFFF
6666 GGGGKKKK NNNN
7777 HHHH
NK D
E
F
G
H
Scheduling
Talk Outline
� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture
� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques
7
� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation
� ConclusionsConclusionsConclusionsConclusions
Heterogeneous Architecture
� Configured similar to a multiple clock domain designConfigured similar to a multiple clock domain designConfigured similar to a multiple clock domain designConfigured similar to a multiple clock domain design
� Domain boundaries:
� Each cluster
� Inter-connection Network
� Memory hierarchy
8
� Each domain can use a different voltage / frequencyEach domain can use a different voltage / frequencyEach domain can use a different voltage / frequencyEach domain can use a different voltage / frequency
� Performance oriented: higher voltage/frequency
� Power oriented: lower voltage/frequency
� Communication between domains: synchronization Communication between domains: synchronization Communication between domains: synchronization Communication between domains: synchronization
queuesqueuesqueuesqueues
I-CACHE
Heterogeneous Architecture
Register buses
Clustered VLIW processorClustered VLIW processorClustered VLIW processorClustered VLIW processor synchronization
queues
9
DATA CACHE
INT INT FP FP MEM MEM
REGISTER FILECLUSTER
1CLUSTER
2CLUSTER
N
MAIN MEMORY
DATA CACHE
Distributed Control Path
� Fetch and decode units distributed among clustersFetch and decode units distributed among clustersFetch and decode units distributed among clustersFetch and decode units distributed among clusters
� Instruction lay-out
� Grouped by cluster
I1C1 I1C4I1C3I1C2 I2C1 I2C4I2C3I2C2 I3C1 I3C4I3C3I3C2
Centralized Control Path
10
I1C1 I3C1I2C1 I1C2 I3C2I2C2
I1C3 I3C3I2C3 I1C4 I3C4I2C4
I C1 I C4I C3I C2 I C1 I C4I C3I C2 I C1 I C4I C3I C2
PC
... ...
......Distributed
Control Path
PC3 PC4
PC1 PC2
[Zhong et al. PACT’05]
Talk Outline
� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture
� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques
11
� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation
� ConclusionsConclusionsConclusionsConclusions
Statically Scheduled Processors
� Performance relies on the compilerPerformance relies on the compilerPerformance relies on the compilerPerformance relies on the compiler
� Instruction scheduling
� Register allocation
� Clustered microarchitectures
12
� Clustered microarchitectures
� Cluster assignment
� Communications
� Multimedia and numeric codeMultimedia and numeric codeMultimedia and numeric codeMultimedia and numeric code
� Majority of the execution in loop bodies
� Modulo scheduling
cycle 2cycle 2cycle 2cycle 2
cycle 1cycle 1cycle 1cycle 1
Modulo Scheduling
� Effective technique for Effective technique for Effective technique for Effective technique for scheduling loopsscheduling loopsscheduling loopsscheduling loops
� Overlaps loop iterations
� Increases parallelism
Iteration space
3rd It
A
B
1st It
13
cycle 6cycle 6cycle 6cycle 6
cycle 5cycle 5cycle 5cycle 5
cycle 4cycle 4cycle 4cycle 4
cycle 3cycle 3cycle 3cycle 3
cycle 2cycle 2cycle 2cycle 2
A
B
C
4th ItA
B
C
3rd ItB
C
Tim
e
Kernel: pattern repeated every II cyclesA
B
C
2nd It
II ≥MII (Minimum Initiation Interval)
� Constrained by resources and recurrences
�MII = max {resMII, recMII}
MS for Heterogeneous
A
B
CD
EAcycle 4cycle 4cycle 4cycle 4
cycle 3cycle 3cycle 3cycle 3
cycle 2cycle 2cycle 2cycle 2
cycle 1cycle 1cycle 1cycle 1
Iteration space
1st It
2nd It
AE
KernelEach cluster has its own II
Constant Iteration Time (IT)
14
EA
B
CD
EA
B
CD
E
cycle 6cycle 6cycle 6cycle 6
cycle 5cycle 5cycle 5cycle 5
cycle 4cycle 4cycle 4cycle 4
Tim
e
3rd It
A
B
CD
E
MIT = max {resMIT, recMIT}
Proposed Technique
Profile homogenous
architecture
15
Select frequencies
and voltages
Modulo Scheduling
Select Voltages and Frequencies
� For each domainFor each domainFor each domainFor each domain
� At program levelAt program levelAt program levelAt program level
� Consider different delays between fast and slow Consider different delays between fast and slow Consider different delays between fast and slow Consider different delays between fast and slow
domainsdomainsdomainsdomains
16
domainsdomainsdomainsdomains
� Estimate execution time
� Estimate energy consumption
� Select minimum ED2
Estimate Execution Time
� Use profilingUse profilingUse profilingUse profiling
� Texec= niters · (II + SC - 1) · Tcycle
� Estimated Estimated Estimated Estimated IT
� Enough to accommodate
� All instructions
17
� All instructions
� All recurrences
� Value lifetimes of the profiled homogeneous
� Communications required by the profiled homogeneous
Estimate Energy Consumption
� Same microarchitecture, different voltagesSame microarchitecture, different voltagesSame microarchitecture, different voltagesSame microarchitecture, different voltages
� Relative dynamic power
� Relative static power
2
22
2
11
2
1
ddLt
ddLt
dyn
dyn
VCfp
VCfp
P
P
⋅⋅⋅
⋅⋅⋅=
2
22
2
11
dd
dd
Vf
Vf
⋅
⋅=
18
� Relative static power
20
0
10
0
2
1
2
1
10
10
ddS
V
t
ddS
V
t
stat
stat
VWW
I
VWW
I
P
P
th
th
⋅⋅⋅
⋅⋅⋅
=−
−
2
112
10dd
ddS
VV
V
Vthth
⋅=
−
MS Algorithm
Compute
MIT
IT:=MIT
Increase
IT
Select IIs
& freqs
OK?NO
ClusterClusterClusterCluster Supported cycle timesSupported cycle timesSupported cycle timesSupported cycle times
19
C0C0C0C0 1ns1ns1ns1ns
C1C1C1C1 5/4 ns ; 4/3 ns ; 3/2 ns5/4 ns ; 4/3 ns ; 3/2 ns5/4 ns ; 4/3 ns ; 3/2 ns5/4 ns ; 4/3 ns ; 3/2 ns
IT= 7ns II(C0)= 7 /1 = 7 cycles
II(C1)= 7 / 1.3333 = 5.25
II(C1)= 7 / 1.25 = 5.6
II(C1)= 7 / 1.5 = 4.667
IT= 8ns II(C0)= 8 / 1= 8 cycles
II(C1)= 8 / 1.3333 = 6
II(C1)= 8 / 1.25 = 6.4
MS Algorithm
Compute
MIT
IT:=MIT
Increase
IT
Select IIs
& freqs
OK?
YES
NO
20
Done
Partition
DDG
Schedule
OK?YES
YES
NO
MS for Heterogeneous
�Multilevel graph partitioning algorithmMultilevel graph partitioning algorithmMultilevel graph partitioning algorithmMultilevel graph partitioning algorithm
� Coarsening
� Place critical recurrences in fast clusters
� Refinement
� Estimate execution time and energy consumption
�
21
� Optimize for ED2
� SchedulingSchedulingSchedulingScheduling
� Instructions delayed due to synchronization hazards
Recurrence Constrained Loops
� Large benefits expectedLarge benefits expectedLarge benefits expectedLarge benefits expected
� A small number of instructions are critical for execution time
� 2222----cycle latency instructionscycle latency instructionscycle latency instructionscycle latency instructions
� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster
A
B
22
� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster
� HomogeneousHomogeneousHomogeneousHomogeneous� Recurrence: 6 cycles
� II= 6 cycles (all clusters!) lots of unused slots
B
C
H
I
J
K
D
E
F
G
� HeterogeneousHeterogeneousHeterogeneousHeterogeneous� 1 fast cluster, cycle time= 1ns
� 3 slow clusters, cycle time= 2ns
IT= 6ns
II(fast cluster)= 6
II(slow clusters)= 3
Talk Outline
� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture
� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques
23
� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation
� ConclusionsConclusionsConclusionsConclusions
Experimental Environment
�MicroarchitectureMicroarchitectureMicroarchitectureMicroarchitecture
� Clusters
� 4 clusters
� 1 FP-unit, 1 INT-unit, 1 memory port, 16 registers per cluster
� Inter-connection Network
�
24
� 1-cycle latency broadcast buses
� Heterogeneity
� Clusters: 1 performance oriented / 3 power oriented
� Benchmarks:Benchmarks:Benchmarks:Benchmarks:
� SpecFP2k Fortran programs
� Loops obtained with ORC
� Baseline: homogeneous architectureBaseline: homogeneous architectureBaseline: homogeneous architectureBaseline: homogeneous architecture
Results
0.8
0.9
1
1 bus
2 buses
ED2
35%30%
17%
25
0.5
0.6
0.7
wup
wise
swim
mgrid
applu
galg
el
face
rec
luca
s
fma3d
sixt
rack
aplsi
mean
35%30%
High ILP Programs
� All instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution time
A
H
I
B
C
D
E
F
G
26
� All instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution timeAll instructions have a similar impact on execution time
� No benefit using different frequencies
� Higher IPC
� Dynamic energy accounts for a majority of the total energy
consumption
� Use lower Vdd and lower frequencies
� Memory hierarchy and inter-connection network can use different voltages
Talk Outline
� Heterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered ArchitectureHeterogeneous Clustered Architecture
� Proposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler TechniquesProposed Compiler Techniques
27
� Experimental EvaluationExperimental EvaluationExperimental EvaluationExperimental Evaluation
� ConclusionsConclusionsConclusionsConclusions
Conclusions
� Heterogeneous clustered architecturesHeterogeneous clustered architecturesHeterogeneous clustered architecturesHeterogeneous clustered architectures
� Clusters run at different voltages / frequencies
� Instructions that impact execution time scheduled in fast
clusters
� Remaining instructions in power-oriented clusters
28
� Proposed compiler techniquesProposed compiler techniquesProposed compiler techniquesProposed compiler techniques
� Algorithm to select the voltages / frequencies
� MS for heterogeneous configurations
� EDEDEDED2222 : 15% improvement on average: 15% improvement on average: 15% improvement on average: 15% improvement on average
� Up to 35% for selected programs
UNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYAUNIVERSITAT POLITÈCNICA DE CATALUNYADepartament d’Arquitectura de Computadors
Heterogeneous Clustered VLIW
Microarchitectures
CGO’07, San Jose, California - March 2007
Microarchitectures
Alex Aletà, Alex Aletà, Alex Aletà, Alex Aletà, Josep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David KaeliJosep M. Codina, Antonio González and David Kaeli
Motivating Example
C1C1C1C1 BusBusBusBus C2C2C2C2
0000 AAAA
1111 BBBB Comm AComm AComm AComm A
2222 CCCC EEEE
3333 DDDD
A
B E
30
3333 DDDD
C
B E
D
C1C1C1C1 BusBusBusBus C2C2C2C2
0000 AAAA
1111 BBBB Comm AComm AComm AComm A
2222 CCCCEEEE
3333 DDDD
C1C1C1C1 C2C2C2C2
Cycle timeCycle timeCycle timeCycle time 1 ns1 ns1 ns1 ns 1 ns1 ns1 ns1 ns2 ns
Heterogeneous Architecture
� SignalsSignalsSignalsSignals
enable_C1
Freq.
C1
clock
31
enable_mem
enable_CN
Freq.
Multiplier
Divider
syncqueues
bus
CN
enable_all
clock
On Chip
Memory
Branch Instructions
� Branches decoupled in several instructions Branches decoupled in several instructions Branches decoupled in several instructions Branches decoupled in several instructions
(Unbundled Branch Architecture)(Unbundled Branch Architecture)(Unbundled Branch Architecture)(Unbundled Branch Architecture)
� Branch target computation
� Independent in each cluster
� Branch condition evaluation
32
� Branch condition evaluation
� Computed in one cluster
� Broadcasted to the rest
� Control transfer
� Different in each cluster
Results
0.8
0.9
1
0.8
0.9
1
ED2 ED21 bus 2 buses
33
0.5
0.6
0.7
wup
wis
esw
imm
grid
appl
uga
lgel
face
rec
luca
sfm
a3d
sixt
rack
apls
im
ean
0.5
0.6
0.7
wup
wis
esw
imm
grid
appl
uga
lgel
face
rec
luca
sfm
a3d
sixt
rack
apls
im
ean
Recurrence Constrained Loops
� Largest benefits obtainedLargest benefits obtainedLargest benefits obtainedLargest benefits obtained
� A small number of instructions are critical for execution time
� Improvement: 30% - 35% for 189.lucas and 200.sixtrack
� 2222----cycle latency instructionscycle latency instructionscycle latency instructionscycle latency instructions
� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster
A
B
34
� 4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster4 clusters, 1 FU per cluster
� HomogeneousHomogeneousHomogeneousHomogeneous� Recurrence: 6 cycles
� II= 6 cycles (all clusters!) lots of unused slots
B
C
H
I
J
K
D
E
F
G
� HeterogeneousHeterogeneousHeterogeneousHeterogeneous� 1 fast cluster, cycle time= 1ns
� 3 slow clusters, cycle time= 2ns
IT= 6ns
II(fast cluster)= 6
II(slow clusters)= 3
Smallest Benefits
� Around 5%Around 5%Around 5%Around 5%
� 168.wupwise
� Loops have different characteristics
� 173.applu
35
� Low number of iterations
� Schedule length has a big impact