7/25/2019 Mtbf Presentation
1/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
1 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
AVAILABILITY MEASUREMENT
SESSION NMS-2201
222 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Agenda
Introduction
Availability Measurement Methodologies
Trouble Ticketing
Device Reachability: ICMP (Ping), SA Agent, COOL
SNMP: Uptime, Ping-MIB, COOL, EEM, SA Agent
Application
Developing an Availability Culture
7/25/2019 Mtbf Presentation
2/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
333 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Associated Sessions
NMS-1N01: Intro to Network Management
NMS-1N02: Intro to SNMP and MIBs
NMS-1N04: Intro to Service Assurance Agent
NMS-1N41: Introduction to Performance Management
NMS-2042: Performance Measurement with Cisco IOS
ACC-2010: Deploying Mobility in HA Wireless LANs
NMS-2202: How Cisco Achieved HA in Its LAN
RST-2514: HA in Campus Network Deployments
NMS-4043: Advanced Service Assurance Agent
RST-4312: High Availability in Routing
INTRODUCTION
WHY MEASURE AVAILABILITY?
4 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
7/25/2019 Mtbf Presentation
3/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
555 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Why Measure Availability?
1. Baseline the network
2. Identify areas for network improvement
3. Measure the impact of improvement projects
666 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Why Should We Care About
Network Availability?
Where are we now? (baseline)
Where are we going? (business objectives)
How best do we get from where we are not to wherewe are going? (improvements)
What if, we cant get there from here?
7/25/2019 Mtbf Presentation
4/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
777 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Why Should We Care AboutNetwork Availability?
Percent of downtime that isunscheduled: 44%
18% of customers experience over 100hours of unscheduled downtime or anavailability of 98.5%
Average cost of network downtime peryear: $21.6 million or $2,169 per minute!
SOURCE: Sage Research, IP Service Provider Downtime Study: Analysis of DowntimeCauses, Costs and Containment Strategies, August 17, 2001, Prepared for Cisco SPLOB
Recent Studies by Sage Research Determined ThatUS-Based Service Providers Encountered:
DowntimeCosts too Much!!!
7 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
888 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Cause of Network Outages
Changemanagement
Processconsistency
Hardware
Links
Design
Environmentalissues
Natural disasters
Source: Gartner Group
Software andApplication
40%
User Errorand Process
40%
Technology20%
Software issues
Performanceand load
Scaling
7/25/2019 Mtbf Presentation
5/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
999 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Top Three Causes of Network Outages
Congestive degradation
Capacity(unanticipated peaks)
Solutions validation
Software quality
Inadvertent configurationchange
Change management
Network design
WAN failure (e.g., major fibercut or carrier failure)
Power
Critical services failure(e.g. DNS/DHCP)
Protocol implementationsand misbehavior
Hardware fault
101010 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Method for Attaining a
Highly-Available Network
Establish a standardmeasurement method
Define business goals asrelated to metrics
Categorize failures, rootcauses, and improvements
Take action for root causeresolution and improvementimplementation
Or a Road to Five Nines
7/25/2019 Mtbf Presentation
6/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
111111 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Where Are We Going?Or What Are Your Business Goals?
Financial
ROI Economic Value Added Revenue/Employee
Productivity
Time to market
Organizational mission
Customer perspective
Satisfaction Retention Market Share
Define Your End-State?What Is Your Goal?
121212 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Why Availability for Business
Requirements?
Availability as a basis for productivity data
Measurement of total-factor productivity
Benchmarking the organization
Overall organizational performance metric
Availability as a basis for organizationalcompetency
Availability as a core competency
Availability improvement as an innovation metric Resource allocation information
Identify defects
Identify root cause
Measure MTTRtied to process
7/25/2019 Mtbf Presentation
7/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
131313 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
It Takes a Design Effort to Achieve HA
Hardware and Software Design
Network andPhysical Plant Design
Process Design
INTRODUCTION
WHAT IS NETWORKAVAILABILITY?
14 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
7/25/2019 Mtbf Presentation
8/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
151515 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
What Is High Availability?
30 Seconds99.9999%
5 Minutes99.999%
53 Minutes99.990%
23 Minutes4 Hours99.950%
46 Minutes8 Hours99.900%
48 Minutes19 Hours1 Day99.500%
36 Minutes15 Hours3 Days99.000%
Downtime per Year (24x7x365)Availability
High Availability Means an Average EndUser Will Experience Less than FiveMinutes Downtime per Year
161616 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Availability Definition
Availability definition isbased on businessobjectives
Is it the user experience you areinteresting in measuring?
Are some users more importantthan other?
Availability groups?
Definitions of different groups
Exceptions to the availabilitydefinition
i.e. the CEO should neverexperience a network problem
7/25/2019 Mtbf Presentation
9/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
171717 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
How You Define Availability
Define availability perspective (customer, business, etc.)
Define availability groups and levels of redundancy
Define an outage
Define impact to network
Ensure SLAs are compatible with outage definition
Understand how maintenance windows affect outage definition
Identify how to handle DNS and DHCP within definition ofLayer 3 outage
Examine component level sparing strategy
Define what to measure
Define measurement accuracy requirements
181818 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Network Design
What Is Reliability?
Reliability is often used as a general term thatrefers to the quality of a product
Failure rate
MTBF (Mean Time Between Failures) or
MTTF (Mean Time To Failure)
Engineered availability
Reliability is defined as the probability of survival(or no failure) for a stated length of time
7/25/2019 Mtbf Presentation
10/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
191919 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
MTBF Defined
MTBF stands for Mean Time Between Failure
MTTF stands for Mean Time to Failure
This is the average length of time between failures (MTBF)or, to a failure (MTTF)
More technically, it is the mean time to go from anOPERATIONAL STATE to a NON-OPERATIONAL STATE
MTBF is usually used for repairable systems, and MTTF isused for non-repairable systems
MTTR stands for Mean Time to Repair
202020 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
One Method of Calculating Availability
Availability = MTBF(MTBF + MTTR)
What is the availability of a computer with MTBF =10,000 hrs. and MTTR = 12 hrs?
A = 10000 (10000 + 12) = 99.88%
Annual uptime
8,760 hrs/year X (0.9988)= 8,749.5 hrs
Conversely, annual DOWN time is,
8,760 hrs/year X (1- 0.9988)= 10.5 hrs
7/25/2019 Mtbf Presentation
11/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
212121 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Networks Consist of Series-Parallel
Combinations of in-series and redundantcomponents
D1D1
D2D2
D3D3
EE FFCCB1B1
B2B2
AA
RBD
1/2 2/3
222222 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
More Complex Redundancy
Pure active parallel
All components are on
Standby redundant
Backup components are not operating
Perfect switching
Switch-over is immediate and without fail
Switch-over reliabilityThe probability of switchover when it is not perfect
Load sharing
All units are on and workload is distributed
7/25/2019 Mtbf Presentation
12/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
MEASURING THEPRODUCTION NETWORK
23 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
242424 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Reliability or Engineered Availability vs.
Measured Availability
1. Reliability is an engineered probability of thenetwork being available
2. Measured Availability is the actual outcomeproduced by physically measuring over time theengineered system
Calculations Are SimilarBoth AreBased on MTBF and MTTR
7/25/2019 Mtbf Presentation
13/81
7/25/2019 Mtbf Presentation
14/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
272727 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Some Types of Availability Metrics
Mean Time to Repair (MTTR)
Impacted User Minutes (IUM)
Defects per Million (DPM)
MTBF (Mean Time Between Failure)
Performance (e.g. latency, drops)
282828 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Back to How Availability Is Calculated?
Availability (%) is calculated by tabulating end user outagetime, typically on a monthly basis
Some customers prefer to use DPM (Defects per Million) torepresent network availability
Availability (%) = (Total User Time Total User Outage Time) X 102
Total User Time
DPM = Total User Outage Time X 106
Total User TimeTotal User Time = Total # of End Users X Time in Reporting Period
Total User Outage Time = (# of End Users X Outage Time in Reporting Period)
Is over All the Incidents in the Reporting PeriodPorts or Connections May Be Substituted for End Users
7/25/2019 Mtbf Presentation
15/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
292929 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Defects per Million
Started with mass produced items like toasters
For PVCs,
DPM = (#conns*outage minutes)
(#conns*total minutes)
For SVCs or phone calls,
DPM = (#existing calls lost + #new calls blocked)
total calls attempted
For connectionless traffic (application dependent),DPM = (#end users*outage minutes)
(#end users*total minutes)
NETWORK AVAILABILITYCOLLECTION METHODS
TROUBLE TICKETING METHODS
30 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
7/25/2019 Mtbf Presentation
16/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
313131 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Availability Improvement Process
Step I
Validate data collection/calculation methodology
Establish network availability baseline
Set high availability goals
Step II
Measure uptime ongoing
Track defects per million (DPM) or IUM oravailability (%)
Step III
Track customer impact for each ticket/MTTR
Categorize DPM by reason code andbegin trending
Identify initiatives/areas for a focus toeliminate defects
323232 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Data Collection/Analysis Process
Understand current data collection methodology
Customer internal ticket database
Manual
Monthly collection of network performance data and exportthe following fields to a spreadsheet or database system:
Outage start time (date/time)
Service restore time (date/time)
Problem description
Root cause
ResolutionNumber of customers impacted
Equipment model
Component/part
Planned maintenance activity/unplanned activity
Total customers/ports on network
7/25/2019 Mtbf Presentation
17/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
333333 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Network Availability Results
Methodology and assumptions must bedocumented
Network availability should include:Overall % network availability (baseline/trending)
Conversion of downtime to DPM by:
Planned and unplanned
Root cause
Resolution
Equipment type
Overall MTTR
MTTR by:
Root cause
Resolution
Equipment type
Results are not necessarily limited to theabove but should be customized based onyour network and requirements
343434 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Availability = 1 - 8 x 24 .100 x 24 x 365
DPM = 8 x 24 x 106
100 x 24 x 365
MTBF = 24 x 365 .8
MTTR = 1095 x (1-0.978082) .0.978082
= 219.2 failures for every1 million user hours
= 0.978082
= 1095 (hours)
= 0.24 (hours)
Availability Metrics: Reviewed
Network has 100 customers
Time in reporting period is one year or 24 hours x 365 days
8 customers have 24 hours down time per year
7/25/2019 Mtbf Presentation
18/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
TROUBLE TICKETING METHOD
SAMPLE OUTPUT
35 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
363636 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Network Availability
99.5099.55
99.60
99.65
99.70
99.75
99.80
99.85
99.90
99.95
100.00
July Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun
Overall Network Availability
(Planned/Unplanned)
Key takeaways
Illus
trativ
e
7/25/2019 Mtbf Presentation
19/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
373737 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Platform Related DPM Comparison
Platform related DPM contributed
13% of total DPM in September
Platform DPM includes events from:Backbone
NAS
PG
POP
Radius Server
VPN Radius Server
All other events are included in theOther category
Breakdown of Platform Related DPM
Network Access Server (NAS)
accounts for 50% of the totalPlatform related DPM in September
Private Access Gateway (PG)showing significant decrease overthe past 3 months
52.610482.549.2Total Platform Related
3.42.88.80VPN Radius
.31.200Radius Server
1.6.53.90POP
18.956.859.626PG26.12719.421.7NAS
2.315.7.81.5Backbone
SeptAugJulyJune
0
100
200
300
400
500
600
June July Aug Sept Oct Dec
100
Nov
100
Oct
100
414.8
52.6
362.2
Sept
100
Dec
100100100------99.99% Target
498.7507.4388.7Total DPM
10482.549.2Platform Related
394.7424.9339.5Other
AugJulyJune
DPM
Illus
trativ
e
383838 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
0
500
1000
1500
2000
2500
Dec Jan Feb Mar Apr May
DPM
1964.81641.91293.112261202.23789.3TOTAL
20.2
474.3
37
89.7
0
87.7
Mar
106.6
422.5
314.2
19
133.4
106
80
Apr
201117.5101.6406Config/SW
240553.6512.7884.3HW
604.4212.4136.2145.7Other
14.811.131.4566.1Power
12718.468.836.1Environmental
115.28.9823.618.2Human Error
95.2Unknown
MayFebJanDec
Illus
trativ
e
DPM by Cause
7/25/2019 Mtbf Presentation
20/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
393939 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
MTTR Analysis: Hardware Faults
Number of faults increased slightlyin September however MTTRdecreased 49% of faults resolved in< 1 Hour in September
11% of faults resolved in > 24 hourswith an additional 3% >100 Hhours
Produce for Each Fault TypeRouter HW
12.42
15.1
8.49
7.19
0
2
4
6
8
10
12
14
16
Jun Jul Aug Sep Oct Nov Dec
Hours
0
20
40
60
80
100
120
140
Jun Jul Aug Sep Oct Nov Dec
#o
fFaults
>100
>24 Hr
12-24 Hr
4-12 Hr
1-4 Hr
100
>24 Hr
12-24 Hr
4-12 Hr
1-4 Hr
7/25/2019 Mtbf Presentation
21/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
414141 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Trouble Ticketing Method
Pros
Easy to get started
No network overhead
Outages can be categorized based on event
Cons
Some internal subjective/consistency process issues
Outages may occur that are not included in the troubleticketing systems
Resources needed to scrub data and create reports
May not work with existing trouble ticketingsystem/process
Network Availability Collection Methods
AUTOMATED FAULTMANAGEMENT EVENTS METHOD
42 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
7/25/2019 Mtbf Presentation
22/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
434343 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Availability Improvement Process
Step I
Determine availability goals
Validate fault management data collection
Determine a calculation methodology
Build software package to use customer event log
Step II
Establish network availability baseline
Measure uptime on an ongoing basis
Step III
Track root cause and customer impactBegin trending of availability issues
Identify initiatives and areas of focusto eliminate defects
444444 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Event Log Example
Fri Jun 15 11:05:31 2001 Debug: Looking for message header ...
Fri Jun 15 11:05:33 2001 Debug: Message header is okay
Fri Jun 15 11:05:33 2001 Debug: $(LDT) -> "06152001110532"
Fri Jun 15 11:05:33 2001 Debug: $(MesgID) -> "100013"
Fri Jun 15 11:05:33 2001 Debug: $(NodeName) -> "ixc00asm"
Fri Jun 15 11:05:33 2001 Debug: $(IPAddr) -> "10.25.0.235"
Fri Jun 15 11:05:33 2001 Debug: $(ROCom) -> "xlr8ed!"
Fri Jun 15 11:05:33 2001 Debug: $(RWCom) -> "s39o!d%"
Fri Jun 15 11:05:33 2001 Debug: $(NPG) -> "CISCO-Large-special"
Fri Jun 15 11:05:33 2001 Debug: $(AlrmDN) -> "aSnmpStatus"
Fri Jun 15 11:05:33 2001 Debug: $(AlrmProp) -> "system"
Fri Jun 15 11:05:33 2001 Debug: $(OSN) -> "Testing"
Fri Jun 15 11:05:33 2001 Debug: $(OSS) -> "Normal"
Fri Jun 15 11:05:33 2001 Debug: $(DSN) -> "SNMP_Down"Fri Jun 15 11:05:33 2001 Debug: $(DSS) -> "Agent_Down"
Fri Jun 15 11:05:33 2001 Debug: $(TrigName) -> "NodeStateUp"
Fri Jun 15 11:05:33 2001 Debug: $(BON) -> "nl-ping"
Fri Jun 15 11:05:33 2001 Debug: $(TrapGN) -> "-2"
Fri Jun 15 11:05:33 2001 Debug: $(TrapSN) -> "-2
Event Log
Analysis of eventsreceived from thenetwork devices
Analysis of accuracyof the data
7/25/2019 Mtbf Presentation
23/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
454545 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Calculation Methodology: Example
Primary events are device down/up
Down time is calculated based on device-typeoutage duration
Availability is calculated based on the totalnumber of device types, the total time, and thetotal down time
MTTR numbers are calculated from averageduration of downtime
With MTTR the shortest and longest outageprovides a simplified curve
464646 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Automated Fault Management Methodology
Pros
Outage duration and scope can be fairly accurate
Can be implemented within a NMS fault management system
No additional network overhead
Cons
Requires an excellent change management/provisioningprocess
Requires an efficient and effective fault management system
Requires a custom development
Does not account for routing problems
Not true end-to-end measure
7/25/2019 Mtbf Presentation
24/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
NETWORK AVAILABILITYDATA COLLECTION
SAMPLE OUTPUT
47 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
484848 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Automated Fault Management:
Example Reports
18.726:38:110:23:100:00:2099.9170%.0830%844:59:1626478018GRAND
TOTAL
16.842:16:100:26:070:00:1799.9491%.0509%212:29:46173897OtherTotals
14.909:49:350:22:360:00:2499.8691%.1309%430:02:0316734732NetworkTotals
24.427:48:460:20:470:00:1999.9327%.0673%202:27:278012389HostTotals
Eventsper
Device
LongestOutage
Duration
MeanTime toRepair
ShortestOutage
Duration
%Up
%
Down
Total DownTime
hhh:mm:ss
Count ofIncidents
# ofDevices
DeviceType
7/25/2019 Mtbf Presentation
25/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
494949 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Count of Incidents
Automated Fault Management:Example Reports (2)
Other Totals11% Host Totals
30%
NetworkTotals59%
Host Totals
Network Totals
Other Totals
Other Totals7% Host Totals
30%
NetworkTotals63%
Host Totals
Network Totals
Other Totals
Total Down Time
Other Totals25%
Host Totals24%
NetworkTotals51%
Host Totals
Network Totals
Other Totals
Number of Managed Devices
Network Availability Collection Methods
ICMP ECHO (PING) AND SNMP ASDATA GATHERING TECHNIQUES
50 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
7/25/2019 Mtbf Presentation
26/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
515151 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Data Gathering Techniques
ICMP ping
Link and device polling (SNMP)
Embedded RMON
Embedded event management
Syslog messages
COOL
525252 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Data Gathering Techniques
Method definition:
Central workstation or computer configured to send pingpackets to the network edges(device or ports) to determinereachability
How:
Edge interfaces and/or devices are defined and pinged
on a determined interval
Unavailability:
Pre-defined, non-response from the interface
ICMP Reachability
7/25/2019 Mtbf Presentation
27/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
535353 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Availability Measurement Through ICMP
Periodic ICMP Test
Periodic Pings to Network Devices Period Ping to Network Leaf Nodes
545454 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Data Gathering Techniques
Pros
Fairly accurate network availability
Accounts for routing problems
Can be implemented for fairly low network overhead
Cons
Point to multipoint implies not true end-to-end measure
Availability granularity limited by ping frequency
Maintenance of device databasemust have a solidchange management and provisioning process
ICMP Reachability
7/25/2019 Mtbf Presentation
28/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
555555 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Data Gathering Techniques
Method definition:
SNMP polling and trapping on links, edge ports,or edge devices
How:
An agent is configured to SNMP poll and tabulate outagetimes for defined devices or links; database maintainsoutage times and total service time; sometimes trapinformation is used to augment this method by providingmore accurate information on outages
Unavailability:
Pre-defined, non-redundant links, ports, or devices thatare down
Link and Device Status
565656 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Polling Interval vs. Sample Size
Polling interval is the rate at which data is collectedfrom the network
Polling interval = 1Sampling Rate
The smaller the polling interval the more detailed(granular) the data collected
Example polling data once every 15 minutes provides 4 times thedetail (granularity) of polling once an hour
A smaller polling interval does not necessarily provide abetter margin of error
Example polling once every 15 minutes for one hour, has thesame margin of error as polling once an hour for 4 hours
7/25/2019 Mtbf Presentation
29/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
575757 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Link and Device Status Method
Method definition
SNMP polling and trapping on links, edge ports,or edge devices
How:
Utilizing existing NMS systems that are currently SNMPpolling to tabulate outage times for defined devices or links
A database maintains outage times and total service time
SNMP Trap information is also used to augment thismethod by providing more accurate information on
outages
585858 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Link and Device Status Method
Pros
Outage duration and scope can be fairly accurate
Utilize existing NMS systems
Low network overhead
Cons
No canned SW to do this; custom development
Maintaining element device database challenging
Requires an excellent change mgmt and provisioningprocess
Does not account for routing problems
Not a true end-to-end measure
7/25/2019 Mtbf Presentation
30/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
CISCO SERVICE ASSURANCEAGENT (SA AGENT)
59 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
606060 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Service Assurance Agent
Method Definition:
SA Agent is an embedded feature of Cisco IOS softwareand requires configuration of the feature on routers withinthe customer network; use of the SA agent can provide fora rapid, cost-effective deployment without additionalhardware probes
How:
A data collector creates SA Agents on the routers tomonitor certain network/service performances; the data
collector then collects this data from the routers,aggregates it and makes it available
Unavailability:
Pre-defined paths with reporting on non-redundant links,ports, or devices that are down within a path
7/25/2019 Mtbf Presentation
31/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
616161 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Case Study:Financial Institution (Collection)
SA Agent Collectors
Remote Sites
DNS
InternetWeb Sites
626262 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Availability = 1 - Probes with No ResponseTotal Probes Sent
DPM = Probes with No Response x 106
Total Probes Sent
Availability Using Network-Based Probes
DPM equations used with network-based probes as input data
Probes can be
Simple ICMP Ping probe, modified Ping to test specific applications,Cisco IOS SA Agent
DPM will be for connectivity between 2 points on the network,the source and destination of probe
Source of probe is usually a management system and the destination arethe devices managed
Can calculate DPM for every device managed
7/25/2019 Mtbf Presentation
32/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
636363 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
DPM = 1 x 106
10000= 100 probes out of 1 million will fail
Availability = 1 - 1 .
10000= 0.9999
Availability Using Network-Based Probes:Example
Network probe is a ping
10000 probes are sent between managementsystem and managed device
1 probe failed to respond
646464 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Sample Size
Sample size is the number of samples that havebeen collected
The more samples collected the higher the confidence thatthe data accurately represents the network
Confidence (margin of error) is defined by
Example data is collected from the network every 1 hour
After One Day After One Month
0367.03124
1m =
x=2041.0
24
1m ==
sizesample
1m =
7/25/2019 Mtbf Presentation
33/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
656565 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Service Assurance Agent
Pros
Accurate network availability for defined paths
Accounts for routing problems
Implementation with very low network overhead
Cons
Requires a system to collect the SAA data
Requires implementation in the router configurations
Availability granularity limited by polling frequency
Definition of the critical network paths to be measured
COMPONENT OUTAGE ONLINEMEASUREMENT (COOL)
66 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
7/25/2019 Mtbf Presentation
34/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
676767 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
COOL Objectives
To automate the measurement to increaseoperational efficiency and reduce operational cost
To measure the outage as close to the source ofoutage events as possible to pin point the cause ofthe outages
To cope with large number of network elementswithout causing system and network performancedegradation
To maintain measurement data reliably in presents
of element failure or network partition To support simplicity in deployment, configuration,
and data collection (autonomous measurement)
686868 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
COOL Features
NetToolsNetTools3rd Party Tools3rd Party Tools
Customer Equipment
Access Router
NMS
C-NOTEC-NOTE
PNLPNL
COOL Embedded in Router
Automated Real-Time MeasurementAutonomous Measurement
Outage Data Stored in Router
Outage Monitor MIB Open access via Outage Monitor MIB
Event Notification Filtering
7/25/2019 Mtbf Presentation
35/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
696969 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
COOL Features (Cont.)
Support NMS or tools forsuch applications as
Calculation of software orhardware MTBF, MTTR,
availability per object, device,or network
Verification of customers SLA
Trouble shooting in real-time
Two-tier framework
Reduces performance impact onthe router
Provides scalability to the NMS
Makes easy to deploy
Provides flexibility to availabilitycalculation
NMS
Customer Equipment
NMS
COOL
Outage Monitor MIB
Access Routers
Access RouterCore Router
Outag
eMonitoringand
M
easurement
OutageCorrelationand
Calculation
NMS
COOL
Outage Monitor MIB
707070 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
A
DD
RP
Power Fan,Etc.
PhysicalInterface
LogicalInterface
Access Router
Outage ModelC
B
Failure of Remote Device (Customer Equipment or PeerNetworking Device) or Link In-between
Remote ObjectsC
Failure of Software Processes Running on the RPs and LineCards
Software ObjectsD
Interface Hardware or Software Failure, Loss of SignalInterface ObjectsB
Component Hardware or Software Failure Including the Failureof Line Card, Power Supplies, Fan, Switch Fabric, and So on
Physical EntityObjects
A
Failure ModesObjects MonitoredType
NetworkManagement
System
CustomerEquipmentMUX/
Hub/Switch
PeerRouter
Link
Link
A
DD
7/25/2019 Mtbf Presentation
36/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
717171 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Outage Characterization
Data Definition
Defect threshold: a value across which the object is considered to bedefective (service degradation or complete outage)
Duration threshold: the minimum period beyond which an outage needsto be reported (given SLA)
Start time: when the object outage starts
End time: when the outage ends
Down Event
Up Event
Outage Duration
DurationThreshold
DefectThreshold
Start Time End Time
Time
727272 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Architecture
OutageOutage
ManagerManager
Internal ComponentInternal Component
Outage DetectorOutage Detector
Fault Manager(IOS)
EventSource
Callbacks Syslog
Remote ComponentOutage Detector
Remote ComponentOutage Detector
Customer EquipmentDetection Function
PingSAAAPIs
Data Table StructureData Table Structure HA and Persistent Data StoreHA and Persistent Data Store
Time StampTemp Event DataCrash Reason
Outage Data
NVRAM
ATA Flash
Outage Monitor MIBOutage Monitor MIB
SNMP Polling SNMP Notification
ConfigurationConfiguration
CustomerAuthentication
CLI
Baseline Optional
CPUUsageDetect
Outage Component Table
Event History Table
Event Map Table
Process Map Table
Remote Component Map Table
Measurement Metrics
Customer Interfaces
Measurement Methods
7/25/2019 Mtbf Presentation
37/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
737373 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Outage Data: AOT and NAF
Requirements of measurement metrics:
Enable calculation of MTTR, MTBF, availability, and SLA assessment
Ensure measurement efficiency in terms of resource (CPU, memory, andnetwork bandwidth)
Measurement metrics per object:
AOT: Accumulated Outage Time since measurement started
NAF: Number of Accumulated Failures since measurement started
AOT = 20 and NAF = 2
Router 1
Time10 10
System CrashSystem Crash
Down
Up
747474 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Outage Data: AOT and NAF
Object containment model
Containment independent propertyRouter Device
AOT = 20;
NAF = 2;
Service AffectingAOT = 27;
NAF = 3;
InterfaceAOT = 7;NAF = 1;
Interface 1
Interface Failure2020
77
20
Router 1 Interface 1
Router Device
Line Card
Physical Interface
Logical Interface
Router 1
Time10 10
System Crash System Crash
Down
Up
Time10 10
Up7
7/25/2019 Mtbf Presentation
38/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
757575 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Example: MTTR
Find MTTR for Object i
MTTRi = AOTi/NAFi
= 14/2
= 7 min
Object i
Time10 min. 4 min.
Measurement Interval (T2T1)
Failure FailureT1 T2
TTR TTR
Down
Up
767676 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Example: MTBF and MTTF
Find MTBF and MTTF for Object i
MTBF = 700,000 = 1,400,000/2
MTTR = 699,993 = (700,000 7)
MTBFi = (T2 T1)/NAFi
MTTFi = MTBFi MTTRi = (T2 T1 AOTi)/NAFi
Object i
Time10 min. 4 min.
Measurement Interval (T2T1)
Failure FailureT1 T2
TTR TTF
Down
Up
TBF
(T2T1) = 1,400,000 min
7/25/2019 Mtbf Presentation
39/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
777777 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Time10 min. 4 min.
Failure FailureT1 T2
Down
Up
Example: Availability and DPM
Find availability and DPM for Object i
Availability = 99.999% = (700,000/700,007) * 100
DPMi = [AOTi/(T2 T1)] x 106 = 10 DPM
Object i
Measurement Interval = 1,400,000 min.
Availability (%) =MTBF
MTBF + MTTR* 100
787878 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Planned Outage Measurement
To capture operation CLI commands both reload andforced switchover
There is a simple rule to derive an upper bound of theplanned outage
If there is no NVRAM soft crash file, check the reboot reason orswitchover reason
If its reload or forced switchover, it can be considered as an upperbound of the planned outage
Send BreakSend Break
Reload
Forced Switchover
Planned Outage
OperationCausedOutage
Upper Boundof the PlannedOutage
7/25/2019 Mtbf Presentation
40/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
797979 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Event Filtering
Flapping interface detection and filtering:
Some faulty interface state can be keep changing up and down
May cause virtual network disconnection
May occurs event storm when hundreds of messages for eachflapping event
May make the object MTBF unreasonably low due to frequentshort failures
This unstable condition needs to get operators attention
COOL detects the flapping status
Catching very short outage event (less than the duration threshold)
Increasing the event counter,
Flapping status, if it becomes over the flapping threshold (3 eventcounter) for the short period (1 sec); sends a notification
Stable status, if it becomes less than the threshold; sends anothernotification
808080 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Data Persistency and Redundancy
Data persistency
To avoid data loss due to link outage or router itself crash
Data redundancy
To continue the outage measurement after the switchover
To retain the outage data even if the RP is physically replaced
Copy
NVRAM
RAMOutage Data
FLASHPersistent
Outage Data
NVRAM
RAMOutage Data
FLASHPersistent
Outage Data
Copy
Active RP Standby RP
COOLCOOL
Router
PersistentOutage Data
PersistentOutage Data
PeriodicUpdate
EventDrivenUpdate
7/25/2019 Mtbf Presentation
41/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
818181 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Outage Monitor MIB
(Physical Entity Object Description)
(Interface Object Description)
ifTable
entPhysicalTable
(Process Object Description)
cpmProcessTable
CISCO-OUTAGE-MONITOR-MIB
cOutageHistoryTable
cOutageObjectTable
Remote Object Map Table
(Remote Object Description)
Object-Type;Object-Index;
Event-Reason-Index;Event-Time;Event-Interval;
Object-Type;Object-Index;
Object-Status;Object-AOT;Object-NAF;
IF-MIB
ENTITY-MIB
CISCO-PROCESS-MIB
Iso.org.dod.internet.private.enterprise.cisco.ciscoMgmt.ciscoOutageMIB
1.3.6.1.4.1.9.9.280
Event Reason Map Table
(Event Description)
Process MIB Map
828282 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Configuration
MIB Display
Customer EquipmentDetection Function
Cisco IOSConfigurationCOOL
Update
Update
Show CLI
run;
add;
removal
filtering-enable;
Config CLI
Show event-table
Show object-table
Object Table
Event Table
7/25/2019 Mtbf Presentation
42/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
838383 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Enabling COOL
ari#dirDirectory of disk0:/
1 -rw- 19014056 Oct 29 2003 16:09:28 +00:00 gsr-k4p-mz.120-26.S.bin
128057344 bytes total (109051904 bytes free)ari#copy tftp disk0:Address or name of remote host []? 88.1.88.9Source filename []? auth_fileDestination filename [auth_file]?Accessing tftp://88.1.88.9/auth_file...Loading auth_file from 88.1.88.9 (via FastEthernet1/2): ![OK - 705 bytes]
705 bytes copied in 0.532 secs (1325 bytes/sec)ari#clear cool perari#clear cool persist-filesari#conf tEnter configuration commands, one per line. End with CNTL/Z.
ari(config)#cool run
ari(config)#^Zari#wr memBuilding configuration...[OK][OK][OK]
ObtainAuthorization
File
Enable COOL
848484 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
COOL
Pros
Accurate network availability for devices, components,and software
Accounts for routing problems
Implementation with low network overhead.
Enables correlation between active and passive availabilitymethodologies
Cons
Only a few system currently have the COOL featureRequires implementation in the router configurations ofproduction devices
Availability granularity limited by polling frequency
New Cisco IOS Feature
7/25/2019 Mtbf Presentation
43/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
Network Availability Collection Methods
APPLICATION LAYERMEASUREMENT
85 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
868686 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Application Reachability
Similar to ICMP Reachability
Method definition:
Central workstation or computer configured to send packets thatmimic application packets
How:
Agents on client and server computers and collecting data
Fire Runner, Ganymede Chariot, Gyra Research, ResponseNetworks, Vital Signs Software, NetScout, Custom applicationsqueries on customer systems
Installing special probes located on user and serversubnets to send, receive and collect data; NikSun andNetScout
Unavailability:
Pre-defined QoS definition
7/25/2019 Mtbf Presentation
44/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
878787 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Application Reachability
Pros
Actual application availability can be understood
QoS, by application, can be factored into the availabilitymeasurement
Cons
Depending on scale, potential high overhead and cost canbe expected
DATA COLLECTION FOR ROOTCAUSE ANALYSIS (RCA) OFNETWORK OR DEVICEDOWNTIME
88 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
7/25/2019 Mtbf Presentation
45/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
898989 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Data Gathering Techniques
Alarm and event
History and statistics
Set thresholds in router configuration
Configure SNMP trap to be sent when MIB variablerises above and/or falls below a given threshold
Alleviates need for frequent polling
Not an availability methodology by itself but canadd valuable information and customization to thedata collection method
Cisco IOS Embedded RMON
909090 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Data Gathering Techniques
Provide information on what the router is doing
Categorized by feature and severity level
User can configure Syslog logging levels
User can configure Syslog messages to be sent asSNMP traps
Not an availability methodology by itself but canadd valuable information and customization to thedata collection method
Syslog Messages
7/25/2019 Mtbf Presentation
46/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
919191 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Expression and Event MIB
Expression MIB
Allows you to create new SNMP objects based upon formulas
MIB persistence is supported a MIBs SNMP data persists acrossreloads
Delta and wildcard support allows you to:
Calculate utilization for all interfaces with one expression
Calculate errors as a percentage of traffic
Event MIB
Allows you to create custom notifications and log them and/or sendthem as SNMP traps or informs
MIB persistence is supported a MIBs SNMP data persists across
reloads Can be used to test objects on other devices
More flexible than RMON events/alarms
RMON is tailored for use with counter objects
929292 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Data Gathering Techniques
Underlying philosophy:
Embed intelligence in routers and switches to enable ascalable and distributed solution, with OPEN interfaces forNMS/EMS leverage of the features
Mission statement:
Provide robust, scalable, powerful, and easy-to-use
embedded managers to solve problems such as syslog andevent management within Cisco routers and switches
Embedded Event Manager
7/25/2019 Mtbf Presentation
47/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
939393 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Embedded Event Manager (Cont.)
Development goal: predictable, consistent, scalablemanagement
Distributed
Independent of central management system
Control is in the customers hands
Customization
Local programmable actions:
Triggered by specific events
949494 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
EEMPolicies
EEMPolicies
Cisco IOS Embedded Event Manager:
Basic Architecture (v1)
Event Detector Feeds EEMEvent Detector Feeds EEM
Embedded Event Manager EEMPolicies
Notify
SyslogEvent Detector
OtherEvent Detector
Switch-over
Reload
Actions
NetworkKnowledge
SNMPEvent Detector
Syslog EventSyslog Event SNMP DataSNMP Data Other EventOther Event
7/25/2019 Mtbf Presentation
48/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
959595 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
EEM Versions
EEM Version 1
Allows policies to be defined using the Cisco IOS CLI applet
The following policy actions can be established:
Generate prioritized syslog messages
Generate a CNS event for upstream processing byCisco CNS devices
Reload the Cisco IOS software
Switch to a secondary processor in a fully redundant hardwareconfiguration
EEM Version 2
EEM Version 2 adds programmable actions using the Tclsubsystem within Cisco IOS
Includes more event detectors and capabilities
969696 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
PosixPosixProcessProcessManagerManager
IOS ProcessIOS ProcessWatchdogWatchdog
SyslogSyslogDaemonDaemon
SystemSystemManagerManager
WatchdogWatchdogSysmonSysmon
HAHARedundancyRedundancy
FacilityFacility
SyslogSyslog
SystemSystemManagerManager
TimerTimerServicesServices
CountersCounters
InterfaceInterfaceCounters andCounters and
StatsStats
RedundancyRedundancyFacilityFacility
SNMPSNMP
IOS SubsystemsSubscribers to
Receive ApplicationEvents, PublishesApplication EventsUsing Application
Specific EventDetector
Tcl Shell
EEM PolicySubscribers to
Receive Events,Implements Policy
Actions
Embedded EventEmbedded EventManager ServerManager Server
ApplicationSpecific
Event Detector
Event Detectors
EventSubscriber
Event Publishers
EEM Version 2 Architecture
More eventdetectors!
Define policies orprogrammablelocal actionsusing Tcl
Register policywith EEM Server
Events triggerpolicy execution
Tcl extensions forCLI control anddefined actions
Cisco Internal Use Only 96Cisco Internal Use Only 9696
7/25/2019 Mtbf Presentation
49/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
979797 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
What Does This Mean to the Business?
Better problem determinationWidely applicable scripts from Cisco engineering and TAC
Automated local action triggered by events
Automated data collection
Faster problem resolutionReduces the next time it happensplease collect
Better diagnostic data to Cisco engineering
Faster identification and repair
Less downtimeReduce susceptibility and Mean Time to Repair (MTTR)
Better serviceResponsiveness
Prevent recurrenceHigher availability
Not an availability methodology by itself but can add valuableinformation and customization to the data collection method
INSTILLING ANAVAILABILITY CULTURE
98 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
7/25/2019 Mtbf Presentation
50/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
999999 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Putting an Availability Programinto Practice
Track network availability
Identify defects
Identify root cause andimplement fix
Reduce operating expenseby eliminating non valueadded work
How much does an outage
cost today?How much can i save thruprocess and productenhancements?
100100100 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
How Do I Start?
1. What are you using now?
a. Add or modify trouble ticketing analysis
b. Add or improve active monitoring method
2. Processanalyze the data!
a. What caused an outage?
b. Can a root cause be identified andaddressed?
3. Implement improvements or fixes
4. Measure the results
5. Back to step 1are other metricsneeded?
7/25/2019 Mtbf Presentation
51/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
101101101 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
If You Have a Network Availability Method
Use the current method and metric for improvement
Dont try to change completely
Use incremental improvements
Develop additional methods to gather data as identified
Concentrate on understanding unavailabilitycausesAll unavailability causes should beclassified at a minimum under:
Change, SW, HW, power/facility, or link
Identify the actions to correct unavailability causes
i.e., network design, customer process change, HW MTBFimprovement, etc.
102102102 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Multilayer Network Design
Distribution
Access
Core/Backbone
WAN Internet PSTN
Server Farm
Building BlockAdditions
Core
SA Agent
Between Accessand Distribution
7/25/2019 Mtbf Presentation
52/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
103103103 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Distribution
Access
Core/Backbone
WAN Internet PSTN
Server Farm
Building BlockAdditions
Core
Multilayer Network DesignSA Agentbetween
Servers andWAN Users
104104104 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Distribution
Access
Core/Backbone
WAN Internet PSTN
Server Farm
Building BlockAdditions
Core
Multilayer Network Design
COOL for High-
End CoreDevices
7/25/2019 Mtbf Presentation
53/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
105105105 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Distribution
Access
Core/Backbone
WAN Internet PSTN
Server Farm
Building BlockAdditions
Core
Multilayer Network DesignTrouble
TicketingMethodology
AVAILABILITY MEASUREMENTSUMMARY
106 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
7/25/2019 Mtbf Presentation
54/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
107107107 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Summary
Availability metric is governed by your businessobjectives
Availability measurements primary goal is:
To provide an availability baseline (maintain)
To help identify where to improve the network
To monitor and control improvement projects
Can you identify Where you are now? for yournetwork?
Do you know Where you are going? as networkoriented business objectives?
Do you have a plan to take you there?
108108108 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Complete Your Online Session Evaluation!
WHAT: Complete an online session evaluationand your name will be entered into adaily drawing
WHY: Win fabulous prizes! Give us your feedback!
WHERE: Go to the Internet stations locatedthroughout the Convention Center
HOW: Winners will be posted on the onsiteNetworkers Website; four winners per day
7/25/2019 Mtbf Presentation
55/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
109 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
110110110 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Recommended Reading
Performance and FaultManagement
ISBN: 1-57870-180-5
High Availability NetworkFundamentals
ISBN: 1-58713-017-3
Network PerformanceBaselining
ISBN: 1-57870-240-2
The Practical PerformanceAnalyst
ISBN: 0-07-912946-3
7/25/2019 Mtbf Presentation
56/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
111111111 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Recommended Reading (Cont.)
The Visual Display of Quantitative Information
by Edward Tufte (ISBN: 0-9613921-0)
Practical Planning for Network Growth
by John Blommers (ISBN: 0-13-206111-2)
The Art of Computer Systems Performance Analysis
by Raj Jain (ISBN: 0-421-50336-3)
Implementing Global Networked Systems Management: Strategiesand Solutions
by Raj Ananthanpillai (ISBN: 0-07-001601-1)
Information Systems in Organizations: Improving BusinessProcesses
by Richard Maddison and Geoffrey Darnton (ISBN: 0-412-62530-X)
Integrated Management of Networked SystemsConcepts,Architectures, and Their Operational Application
by Hegering, Abeck, Neumair (ISBN: 1558605711)
112112112 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Appendix A: Acronyms
AVGAverage
ATMAsynchronous Transfer Mode
DPMDefects Per Million
FCAPSFault, Config, Acct, Perf,Security
GEGigabit Ethernet
HAHigh Availability
HDLCHigh Level Data Link Control
HSRPHot Standby RoutingProtocol
IPMInternet Performance Monitor IUMImpacted User Minutes
MIBManagement Information Base
MTBFMean Time Between Failure
MTTRMean Time to Repair
RMEResource Manager Essentials
RMONRemote Monitor
SA AgentService Assurance Agent
SNMPSimple Network ManagementProtocol
SPFSingle Point of Failure; ShortestPath First (routing protocol)
TCPTransmission Control Protocol
7/25/2019 Mtbf Presentation
57/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
BACKUP SLIDES
113 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
ADDITIONALRELIABILITY SLIDES
114 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
7/25/2019 Mtbf Presentation
58/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
115115115 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Network DesignWhat Is Reliability?
Reliability is often used as a general term thatrefers to the quality of a product
Failure Rate
MTBF (Mean Time Between Failures) or
MTTF (Mean Time to Failure)
Availability
116116116 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Reliability Defined
1. The probability of survival (or no failure) for astated length of time
2. Or, the fraction of units that will not fail in thestated length of time
A mission time must be stated
Annual reliability is the probability ofsurvival for one year
Reliability:
7/25/2019 Mtbf Presentation
59/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
117117117 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Availability Defined
1. The probability that an item (or network, etc.) isoperational, and ready-to-go, at any point in time
2. Or, the expected fraction of time it is operational.annual uptime is the amount (in days, hrs., min.,etc.) the item is operational in a year
Example: For 98% availability, the annual availability is0.98 * 365 days = 357.7 days
Availability:
118118118 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
MTBF Defined
MTBF stands for Mean Time Between Failure
MTTF stands for Mean Time to Failure
This is the average length of time between failures (MTBF)or, to a failure (MTTF)
More technically, it is the mean time to go from anoperational state to a non-operational state
MTBF is usually used for repairable systems, and MTTF is
used for non-repairable systems
7/25/2019 Mtbf Presentation
60/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
119119119 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
How Reliable Is It?
MTBF Reliability:
R = e-(MTBF/MTBF)
R = e-1 = 36.7%
MTBF reliability is only 37%; that is, 63% of yourHARDWARE fails before the MTBF!
But remember, failures are still random!
120120120 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
MTTR Defined
MTTR stands for Mean Time to Repair
or
MRT (Mean Restore Time)
This is the average length of time it takes to repair an item
More technically, it is the mean time to go from a non-operational state to an operational state
7/25/2019 Mtbf Presentation
61/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
121121121 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
One Method of Calculating Availability
Availability = MTBF(MTBF + MTTR)
What is the availability of a computer withMTBF = 10,000 hrs. and MTTR = 12 hrs?
A = 10000 (10000 + 12) = 99.88%
122122122 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Uptime
Annual uptime
8,760 hrs/year X (0.9988)= 8,749.5 hrs
Conversely, annual DOWNtime is,
8,760 hrs/year X (1- 0.9988)= 10.5 hrs
7/25/2019 Mtbf Presentation
62/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
123123123 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Systems
Components In-Series
Components In-Parallel (Redundant)
Component 1 Component 2
Component 1
Component 2
RBD
124124124 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
In-Series
Part 1
Part 2
In-Series
Up Up Up
UpUp Up
Up Up Up Up
Down Down
Down Down
Down DownDown
7/25/2019 Mtbf Presentation
63/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
125125125 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
In-Parallel
In-Parallel
Up Down Up
Part 1
Part 2
Up Up Up
UpUp Up
Down Down
Down Down
126126126 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
In-Series MTBF
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 2
MTBF = 2,500 hrs.
MTTR = 10 hrs.
System Failure Rate
= 0.0004 + 0.0004 = 0.0008
System MTBF
= 1/(0.0008) = 1,250 hrs.
Component Failure Rate
= 1/2500 = 0.0004
7/25/2019 Mtbf Presentation
64/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
127127127 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
In-Series Reliability
System ANNUAL Reliability:
R = 0.03 X 0.03 = 0.0009
Component ANNUAL Reliability:
R = e-(8760/2500) = 0.03
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 2
MTBF = 2,500 hrs.
MTTR = 10 hrs.
128128128 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
In-Series Availability
System Availability:
A = 0.996 X 0.996 = 0.992
Component Availability:
A = 2500 (2500 + 10) = 0.996
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 2
MTBF = 2,500 hrs.
MTTR = 10 hrs.
7/25/2019 Mtbf Presentation
65/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
129129129 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
In-Parallel MTBF
System MTBF*:
= 2500 + 2500/2=3,750 hrs.
COMPONENT 1
MTBF = 2,500 hrs.
COMPONENT 2
MTBF = 2,500 hrs.
In general*, =
n
i
i
MTBF
1*For 1-of-n Redundancy of n Identical Componentswith NO Repair or Replacement of Failed Components
130130130 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
1-of-4 Example
= 5,208 hrs.
*For 1-of-n Redundancy of n Identical Componentswith NO Repair or Replacement of Failed Components
In general*,
=
n
i
i
MTBF
1
42500
32500
22500
12500
4
1
2500 +++==i
i
7/25/2019 Mtbf Presentation
66/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
131131131 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
In-Parallel Reliability
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
System ANNUAL Reliability:
R= 1- [(1-0.03) X (1-0.03)] = 1-0.94 = 0.06
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
Component ANNUAL Reliability:
R = e-(8760/2500) = 0.03 Unreliability
132132132 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
In-Parallel Availability
Unavailability
Component Availability:
A = 2500 (2500 + 10) = 0.996System Availability:
A= 1- [(1-0.996) X (1-0.996)] = 1-0.000016 = 0.999984
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 1
MTBF = 2,500 hrs.
MTTR = 10 hrs.
7/25/2019 Mtbf Presentation
67/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
133133133 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Complex Redundancy
1
2
3
n
m-of-n
.
.
.
Examples:
1-of-2
2-of-3
2-of-4
8-of-10
Pure Active Parallel
134134134 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
More Complex Redundancy
Pure active parallel
All components are on
Standby redundant
Backup components are not operating
Perfect switching
Switch-over is immediate and without fail
Switchover reliabilityThe probability of switchover when it is not perfect
Load sharing
All units are on and workload is distributed
7/25/2019 Mtbf Presentation
68/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
135135135 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Networks Consist of Series-Parallel
Combinations of in-series and redundantcomponents
D1D1
D2D2
D3D3
EE FFCCB1B1
B2B2
AA 2/31/2
136136136 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Failure Rate
The number of failures per time:
Failures/hour
Failures/day
Failures/week
Failures/106 hours
Failures/109 hours called FITs (Failures in Time)
7/25/2019 Mtbf Presentation
69/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
137137137 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Approximating MTBF
13 units are tested in a lab for 1,000 hours with 2failures occurring
Another 4 units were tested for 6,000 hours with 1failure occurring
The failed units are repaired (or replaced)
What is the approximate MTBF?
138138138 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Approximating MTBF (Cont.)
MTBF = 13*1000 + 4*6000
1 + 2
= 37,000
3
= 12,333 hours
7/25/2019 Mtbf Presentation
70/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
139139139 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Modeling
Distributions
Normal
Log-Normal
Weibull
Exponential
Frequency
Time-to-Failure
MTBF
Fr
equency
Time-to-Failure
MTBF
MTBF
140140140 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Constant Failure Rate
The Exponential Distribution
The exponential function:
f(t) = e-t, t > 0
Failure rate, , IS CONSTANT
= 1/MTBF
If MTBF = 2,500 hrs., what is the failure rate?
= 1/2500 = 0.0004 failures/hr.
7/25/2019 Mtbf Presentation
71/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
141141141 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
The Bathtub Curve
Time
FailureRate
Wear-OutUseful Life PeriodInfant
Mortality
DECREASINGFailure Rate
CONSTANT Failure Rate
INCREASINGFailure Rate
142142142 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
The Exponential Reliability Formula
Commonly used for electronic equipment
The exponential reliability formula:
R(t) = e-t or R(t) = e-t/MTBF
7/25/2019 Mtbf Presentation
72/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
143143143 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Calculating Reliability
A certain Cisco router has an MTBF of 100,000 hrs;what is the annual reliability?
Annual reliability is the reliability for one year or 8,760 hrs
R =e-(8760/100000) = 91.6%
This says that the probability of no failure in oneyear is 91.6%; or, 91.6% of all units will surviveone year
ADDITIONAL TROUBLETICKETING SLIDES
144 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
7/25/2019 Mtbf Presentation
73/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
145145145 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Essential Data Elements
Description of Action Taken to Fix the ProblemStringResolution
Identity if the Event Was Due to PlannedMaintenance Activity or Unplanned OutagePlanned/UnplannedType
For HW Problems include Product ID; for SWInclude Release Version
AlphanumericComponent/Part/SWVersion
HW, SW, Process, Environmental, etc.StringRoot Cause
Outline of the ProblemStringProblem Description
Number of Customers that Lost Service; NumberImpacted or Names of Customers Impacted
IntergerCustomers Impacted
Time of Resolutionhh:mmResolution Time
Date of Resolutiondd/mmm/yyResolution Date
Time of Faulthh:mmStart Time
Date of Faultdd/mmm/yyStart Date
Trouble Ticket NumberAlphanumericTicket
Date Ticket Issueddd/mmm/yyDate
DescriptionFormatParameter
Note: Above Is the Minimum Data Set, However, ifOther Information Is Captured it Should Be Provided
146146146 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Trouble Tickets Definitions Data accuracy Collection
processes
OperationalProcess and Procedures
AnalysisData Analysis
HA Metrics/NAIS Synergy
Network reliabilityimprovement analysis
Problem management
Fault management
Resiliency assessment
Change management
Performancemanagement
Availabilitymanagement
Baseline availability
Determine DPM(Defects Per Million)by:
Planned/Unplanned
Root Cause
Resolution
Equipment
MTTR
Analyzed Trouble Ticket DataReferral for Process/Procedural Improvement
Referral forAnalysis
7/25/2019 Mtbf Presentation
74/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
ADDITIONAL SA AGENT SLIDES
147 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
148148148 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
SA Agent: How It Works
1. User configures Collectorsthrough Mgmt Application GUI
2. Mgmt Application provisionsSource routers with Collectors
6. Application retrieves data fromSource routers once an hour
7. Data is written to a database
8. Reports are generated
3. Source router measures andstores performance data,e.g.:
Response time
Availability
4. Source router evaluatesSLAs, sends SNMP Traps
5. Source router stores latestdata point and 2 hours ofaggregated points
SNMP
Management Application SA Agent
7/25/2019 Mtbf Presentation
75/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
149149149 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
SAA Monitoring IP Core
R1
R3
R2
IP CoreIP Core
P1
P2
P3
Management System
150150150 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Monitoring Customer IP Reachability
P1-Pn Service Assurance Agent ICMPPolls to a Test Point in the IP Core
TP1TP1
TPxTPx
P1
P3
P2
PN
Nw1
Nw3
Nw3
NwN
7/25/2019 Mtbf Presentation
76/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
151151151 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Service Assurance Agent Features
Measures Service Level Agreement (SLA) metrics
Packet Loss
Response time Throughput
Availability Jitter
Evaluates SLAs
Proactively sends notification of SLA violations
152152152 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
SA Agent Impact on Devices
Low impact on CPU utilization
18k memory per SA agent
SAA rtr low-memory
7/25/2019 Mtbf Presentation
77/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
153153153 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Monitored Network Availability Calculation
Not calculated:
Already have availability baseline
Fault type, frequency and downtime may be more useful
Faults directly measured from management system(s)
154154154 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Monitored Network Availability
Assumptions
All connections below IP are fixed
Management systems can be notified of all fixedconnection state changes
All (L2) events impact on IP (L3) service
7/25/2019 Mtbf Presentation
78/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
ADDITIONAL COOL SLIDES
155 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
156156156 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
CLIs
[no] cool run
[no] cool interface interface-name(idb)
[no] cool physical-FRU-entity entity-index (int)
[no] cool group-interface group-objectID(string)
[no] cool add-cpu objectID threshold duration
[no] cool remote-device dest-IP(paddr) obj-descr(string) rate(int) repeat(int) [local-ip(paddr) mode(int) ]
[no] cool if-filter group-objectID (string)
Configuration CLI Commands
Router#show cool event-table [] displays all if not s pecified
Router#show cool object-table [] displays all object ty pes if not specified
Router#show cool fru-entity
Display CLI Commands
Router#clear cool event-table
Router#clear cool persistent-files
Exec CLI Commands
7/25/2019 Mtbf Presentation
79/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
157157157 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Measurement Example:Router Device Outage
Reload (Operational) ,Power Outage, orDevice H/W failure
Type: interface(1), physicalEntity(2), Process(3), and remoteObject(4).Index: the corresponding MIB table index. If it is PhysicalEntity(2), index in the ENTITY-MIB.Status: Up (1) Down (2).Last-change: last object status change time.AOT:Accumulated Outage Time (sec).NAF: Number of Accumulated Failure.
158158158 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Measurement Example:
Cisco IOS S/W OutageStandby RP in Slot 0 Crash Using Address Error (4) Test Crash;AdEL Exception It Is Caused Purely by Cisco IOS S/W
Standby RP Crash Using Jump to Zero (5) Test Crash;Bp Exception It Can Be Caused by S/W, H/W, or Operation
7/25/2019 Mtbf Presentation
80/81
2004 Cisco Systems, Inc. All rights reserved. Printed in USA.
Presentation_ID.scr
159159159 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Measurement Example: Linecard Outage
Add a Linecard
Reset the Linecard
Down Event Captured
Up Event Captured
AOT and NAF Updated
160160160 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Measurement Example: Interface Outage
12406-R1202(config)#cool group-interface ATM2/0.12406-R1202(config)#no cool group-interface ATM2/0.3
sh cool object 1 | include ATM2/0.33 1 1054859087 0 0 0 ATM2/0.135 1 1054859088 0 0 0 ATM2/0.239 1 1054859090 0 0 0 ATM2/0.441 1 1054859090 0 0 0 ATM2/0.5
12406-R1202(config)#interface ATM2/012406-R1202(config-if)#shutshow cool event-table**** COOL Event Table ****type index event time-stamp interval hist_id object-name1 33 1 1054859105 18 1 ATM2/0.11 35 1 1054859106 18 2 ATM2/0.21 39 1 1054859107 17 3 ATM2/0.41 41 1 1054859108 18 4 ATM2/0.5
12406-R1202(config)#interface ATM2/012406-R1202(config-if)#no shutshow cool event-table**** COOL Event Table ****type index event time-stamp interval hist_id object-name1 33 0 1054859146 41 1 ATM2/0.11 35 0 1054859147 41 2 ATM2/0.21 39 0 1054859149 42 3 ATM2/0.41 41 0 1054859150 42 4 ATM2/0.5
sh cool object 1 | include ATM2/0.33 1 1054859087 0 41 1 ATM2/0.135 1 1054859088 0 41 1 ATM2/0.239 1 1054859090 0 42 1 ATM2/0.441 1 1054859090 0 42 1 ATM2/0.5
Configure to Monitor All the Interfaces whichIncludes ATM2/0; String, Except ATM2/0.3
1
2 3
4 5
Object Table
Shut ATM2.0Interface Down
Down EventCaptured
Up EventCaptured
No Shut ATM2.0Interface
Object Table Shows AOT and NAF
7/25/2019 Mtbf Presentation
81/81
161161161 2004 Cisco Systems, Inc. All rights reserved.
NMS-22019627_05_2004_c2
Measurement Example:Remote Device Outage
12406-R1202(config)#cool remote-device 1 50.1.1.2 remobj.1 30 2 50.1.1.1 112406-R1202(config)#cool remote-device 2 50.1.2.2 remobj.2 30 2 50.1.2.1 112406-R1202(config)#cool remote-device 3 50.1.3.2 remobj.3 30 2 50.1.3.1 1
sh cool object-table 4 | include remobj1 1 1054867061 0 0 remobj.12 1 1054867063 0 0 remobj.23 1 1054867065 0 0 remobj.3
12406-R1202(config)#interface ATM2/012406-R1202(config-if)#shut
12406-R1202(config)#interface ATM2/012406-R1202(config-if)#no shut
4 2 5 1054867105 42 2 remobj.24 1 5 1054867108 47 3 remobj.14 3 5 1054867130 65 10 remobj.3
4 1 4 1054867171 63 1 remobj.1
4 3 4 1054867193 63 8 remobj.34 2 4 1054867200 95 10 remobj.2
sh cool object-table 4 | include remobj1 1 1054867061 63 1 remobj.12 1 1054867063 63 1 remobj.23 1 1054867065 95 1 remobj.3
3 Remote Devices AreAdded
Object Table
Shut Down the Interface Link Between the RemoteDevice and Router
Down Event Captured
Up Event Captured
Object Table Shows AOT and NAF
No Shut the Interface Link
Top Related