Download - Mtbf Presentation

7/25/2019 Mtbf Presentation

1/81

2004 Cisco Systems, Inc. All rights reserved. Printed in USA.

Presentation_ID.scr

1 2004 Cisco Systems, Inc. All rights reserved.

NMS-22019627_05_2004_c2

AVAILABILITY MEASUREMENT

SESSION NMS-2201


NMS-22019627_05_2004_c2

Agenda

Introduction

Availability Measurement Methodologies

Trouble Ticketing

Device Reachability: ICMP (Ping), SA Agent, COOL

SNMP: Uptime, Ping-MIB, COOL, EEM, SA Agent

Application

Developing an Availability Culture


2/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Associated Sessions

NMS-1N01: Intro to Network Management

NMS-1N02: Intro to SNMP and MIBs

NMS-1N04: Intro to Service Assurance Agent

NMS-1N41: Introduction to Performance Management

NMS-2042: Performance Measurement with Cisco IOS

ACC-2010: Deploying Mobility in HA Wireless LANs

NMS-2202: How Cisco Achieved HA in Its LAN

RST-2514: HA in Campus Network Deployments

NMS-4043: Advanced Service Assurance Agent

RST-4312: High Availability in Routing

INTRODUCTION

WHY MEASURE AVAILABILITY?


NMS-22019627_05_2004_c2


3/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Why Measure Availability?

1. Baseline the network

2. Identify areas for network improvement

3. Measure the impact of improvement projects


NMS-22019627_05_2004_c2

Why Should We Care About

Network Availability?

Where are we now? (baseline)

Where are we going? (business objectives)

How best do we get from where we are not to wherewe are going? (improvements)

What if, we cant get there from here?


4/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Why Should We Care AboutNetwork Availability?

Percent of downtime that isunscheduled: 44%

18% of customers experience over 100hours of unscheduled downtime or anavailability of 98.5%

Average cost of network downtime peryear: $21.6 million or $2,169 per minute!

SOURCE: Sage Research, IP Service Provider Downtime Study: Analysis of DowntimeCauses, Costs and Containment Strategies, August 17, 2001, Prepared for Cisco SPLOB

Recent Studies by Sage Research Determined ThatUS-Based Service Providers Encountered:

DowntimeCosts too Much!!!


NMS-22019627_05_2004_c2


NMS-22019627_05_2004_c2

Cause of Network Outages

Changemanagement

Processconsistency

Hardware

Links

Design

Environmentalissues

Natural disasters

Source: Gartner Group

Software andApplication

40%

User Errorand Process

40%

Technology20%

Software issues

Performanceand load

Scaling


5/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Top Three Causes of Network Outages

Congestive degradation

Capacity(unanticipated peaks)

Solutions validation

Software quality

Inadvertent configurationchange

Change management

Network design

WAN failure (e.g., major fibercut or carrier failure)

Power

Critical services failure(e.g. DNS/DHCP)

Protocol implementationsand misbehavior

Hardware fault


NMS-22019627_05_2004_c2

Method for Attaining a

Highly-Available Network

Establish a standardmeasurement method

Define business goals asrelated to metrics

Categorize failures, rootcauses, and improvements

Take action for root causeresolution and improvementimplementation

Or a Road to Five Nines


6/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Where Are We Going?Or What Are Your Business Goals?

Financial

ROI Economic Value Added Revenue/Employee

Productivity

Time to market

Organizational mission

Customer perspective

Satisfaction Retention Market Share

Define Your End-State?What Is Your Goal?


NMS-22019627_05_2004_c2

Why Availability for Business

Requirements?

Availability as a basis for productivity data

Measurement of total-factor productivity

Benchmarking the organization

Overall organizational performance metric

Availability as a basis for organizationalcompetency

Availability as a core competency

Availability improvement as an innovation metric Resource allocation information

Identify defects

Identify root cause

Measure MTTRtied to process


7/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

It Takes a Design Effort to Achieve HA

Hardware and Software Design

Network andPhysical Plant Design

Process Design

INTRODUCTION

WHAT IS NETWORKAVAILABILITY?


NMS-22019627_05_2004_c2


8/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

What Is High Availability?

30 Seconds99.9999%

5 Minutes99.999%

53 Minutes99.990%

23 Minutes4 Hours99.950%

46 Minutes8 Hours99.900%

48 Minutes19 Hours1 Day99.500%

36 Minutes15 Hours3 Days99.000%

Downtime per Year (24x7x365)Availability

High Availability Means an Average EndUser Will Experience Less than FiveMinutes Downtime per Year


NMS-22019627_05_2004_c2

Availability Definition

Availability definition isbased on businessobjectives

Is it the user experience you areinteresting in measuring?

Are some users more importantthan other?

Availability groups?

Definitions of different groups

Exceptions to the availabilitydefinition

i.e. the CEO should neverexperience a network problem


9/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

How You Define Availability

Define availability perspective (customer, business, etc.)

Define availability groups and levels of redundancy

Define an outage

Define impact to network

Ensure SLAs are compatible with outage definition

Understand how maintenance windows affect outage definition

Identify how to handle DNS and DHCP within definition ofLayer 3 outage

Examine component level sparing strategy

Define what to measure

Define measurement accuracy requirements


NMS-22019627_05_2004_c2

Network Design

What Is Reliability?

Reliability is often used as a general term thatrefers to the quality of a product

Failure rate

MTBF (Mean Time Between Failures) or

MTTF (Mean Time To Failure)

Engineered availability

Reliability is defined as the probability of survival(or no failure) for a stated length of time


10/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

MTBF Defined

MTBF stands for Mean Time Between Failure

MTTF stands for Mean Time to Failure

This is the average length of time between failures (MTBF)or, to a failure (MTTF)

More technically, it is the mean time to go from anOPERATIONAL STATE to a NON-OPERATIONAL STATE

MTBF is usually used for repairable systems, and MTTF isused for non-repairable systems

MTTR stands for Mean Time to Repair


NMS-22019627_05_2004_c2

One Method of Calculating Availability

Availability = MTBF(MTBF + MTTR)

What is the availability of a computer with MTBF =10,000 hrs. and MTTR = 12 hrs?

A = 10000 (10000 + 12) = 99.88%

Annual uptime

8,760 hrs/year X (0.9988)= 8,749.5 hrs

Conversely, annual DOWN time is,

8,760 hrs/year X (1- 0.9988)= 10.5 hrs


11/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Networks Consist of Series-Parallel

Combinations of in-series and redundantcomponents

D1D1

D2D2

D3D3

EE FFCCB1B1

B2B2

AA

RBD

1/2 2/3


NMS-22019627_05_2004_c2

More Complex Redundancy

Pure active parallel

All components are on

Standby redundant

Backup components are not operating

Perfect switching

Switch-over is immediate and without fail

Switch-over reliabilityThe probability of switchover when it is not perfect

Load sharing

All units are on and workload is distributed


12/81


Presentation_ID.scr

MEASURING THEPRODUCTION NETWORK


NMS-22019627_05_2004_c2


NMS-22019627_05_2004_c2

Reliability or Engineered Availability vs.

Measured Availability

1. Reliability is an engineered probability of thenetwork being available

2. Measured Availability is the actual outcomeproduced by physically measuring over time theengineered system

Calculations Are SimilarBoth AreBased on MTBF and MTTR


13/81


14/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Some Types of Availability Metrics

Mean Time to Repair (MTTR)

Impacted User Minutes (IUM)

Defects per Million (DPM)

MTBF (Mean Time Between Failure)

Performance (e.g. latency, drops)


NMS-22019627_05_2004_c2

Back to How Availability Is Calculated?

Availability (%) is calculated by tabulating end user outagetime, typically on a monthly basis

Some customers prefer to use DPM (Defects per Million) torepresent network availability

Availability (%) = (Total User Time Total User Outage Time) X 102

Total User Time

DPM = Total User Outage Time X 106

Total User TimeTotal User Time = Total # of End Users X Time in Reporting Period

Total User Outage Time = (# of End Users X Outage Time in Reporting Period)

Is over All the Incidents in the Reporting PeriodPorts or Connections May Be Substituted for End Users


15/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Defects per Million

Started with mass produced items like toasters

For PVCs,

DPM = (#conns*outage minutes)

(#conns*total minutes)

For SVCs or phone calls,

DPM = (#existing calls lost + #new calls blocked)

total calls attempted

For connectionless traffic (application dependent),DPM = (#end users*outage minutes)

(#end users*total minutes)

NETWORK AVAILABILITYCOLLECTION METHODS

TROUBLE TICKETING METHODS


NMS-22019627_05_2004_c2


16/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Availability Improvement Process

Step I

Validate data collection/calculation methodology

Establish network availability baseline

Set high availability goals

Step II

Measure uptime ongoing

Track defects per million (DPM) or IUM oravailability (%)

Step III

Track customer impact for each ticket/MTTR

Categorize DPM by reason code andbegin trending

Identify initiatives/areas for a focus toeliminate defects


NMS-22019627_05_2004_c2

Data Collection/Analysis Process

Understand current data collection methodology

Customer internal ticket database

Manual

Monthly collection of network performance data and exportthe following fields to a spreadsheet or database system:

Outage start time (date/time)

Service restore time (date/time)

Problem description

Root cause

ResolutionNumber of customers impacted

Equipment model

Component/part

Planned maintenance activity/unplanned activity

Total customers/ports on network


17/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Network Availability Results

Methodology and assumptions must bedocumented

Network availability should include:Overall % network availability (baseline/trending)

Conversion of downtime to DPM by:

Planned and unplanned

Root cause

Resolution

Equipment type

Overall MTTR

MTTR by:

Root cause

Resolution

Equipment type

Results are not necessarily limited to theabove but should be customized based onyour network and requirements


NMS-22019627_05_2004_c2

Availability = 1 - 8 x 24 .100 x 24 x 365

DPM = 8 x 24 x 106

100 x 24 x 365

MTBF = 24 x 365 .8

MTTR = 1095 x (1-0.978082) .0.978082

= 219.2 failures for every1 million user hours

= 0.978082

= 1095 (hours)

= 0.24 (hours)

Availability Metrics: Reviewed

Network has 100 customers

Time in reporting period is one year or 24 hours x 365 days

8 customers have 24 hours down time per year


18/81


Presentation_ID.scr

TROUBLE TICKETING METHOD

SAMPLE OUTPUT


NMS-22019627_05_2004_c2


NMS-22019627_05_2004_c2

Network Availability

99.5099.55

99.60

99.65

99.70

99.75

99.80

99.85

99.90

99.95

100.00

July Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun

Overall Network Availability

(Planned/Unplanned)

Key takeaways

Illus

trativ

e


19/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Platform Related DPM Comparison

Platform related DPM contributed

13% of total DPM in September

Platform DPM includes events from:Backbone

NAS

PG

POP

Radius Server

VPN Radius Server

All other events are included in theOther category

Breakdown of Platform Related DPM

Network Access Server (NAS)

accounts for 50% of the totalPlatform related DPM in September

Private Access Gateway (PG)showing significant decrease overthe past 3 months

52.610482.549.2Total Platform Related

3.42.88.80VPN Radius

.31.200Radius Server

1.6.53.90POP

18.956.859.626PG26.12719.421.7NAS

2.315.7.81.5Backbone

SeptAugJulyJune

0

100

200

300

400

500

600

June July Aug Sept Oct Dec

100

Nov

100

Oct

100

414.8

52.6

362.2

Sept

100

Dec

100100100------99.99% Target

498.7507.4388.7Total DPM

10482.549.2Platform Related

394.7424.9339.5Other

AugJulyJune

DPM

Illus

trativ

e


NMS-22019627_05_2004_c2

0

500

1000

1500

2000

2500

Dec Jan Feb Mar Apr May

DPM

1964.81641.91293.112261202.23789.3TOTAL

20.2

474.3

37

89.7

0

87.7

Mar

106.6

422.5

314.2

19

133.4

106

80

Apr

201117.5101.6406Config/SW

240553.6512.7884.3HW

604.4212.4136.2145.7Other

14.811.131.4566.1Power

12718.468.836.1Environmental

115.28.9823.618.2Human Error

95.2Unknown

MayFebJanDec

Illus

trativ

e

DPM by Cause


20/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

MTTR Analysis: Hardware Faults

Number of faults increased slightlyin September however MTTRdecreased 49% of faults resolved in< 1 Hour in September

11% of faults resolved in > 24 hourswith an additional 3% >100 Hhours

Produce for Each Fault TypeRouter HW

12.42

15.1

8.49

7.19

0

2

4

6

8

10

12

14

16

Jun Jul Aug Sep Oct Nov Dec

Hours

0

20

40

60

80

100

120

140

Jun Jul Aug Sep Oct Nov Dec

#o

fFaults

>100

>24 Hr

12-24 Hr

4-12 Hr

1-4 Hr

100

>24 Hr

12-24 Hr

4-12 Hr

1-4 Hr


21/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Trouble Ticketing Method

Pros

Easy to get started

No network overhead

Outages can be categorized based on event

Cons

Some internal subjective/consistency process issues

Outages may occur that are not included in the troubleticketing systems

Resources needed to scrub data and create reports

May not work with existing trouble ticketingsystem/process

Network Availability Collection Methods

AUTOMATED FAULTMANAGEMENT EVENTS METHOD


NMS-22019627_05_2004_c2


22/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Availability Improvement Process

Step I

Determine availability goals

Validate fault management data collection

Determine a calculation methodology

Build software package to use customer event log

Step II

Establish network availability baseline

Measure uptime on an ongoing basis

Step III

Track root cause and customer impactBegin trending of availability issues

Identify initiatives and areas of focusto eliminate defects


NMS-22019627_05_2004_c2

Event Log Example

Fri Jun 15 11:05:31 2001 Debug: Looking for message header ...

Fri Jun 15 11:05:33 2001 Debug: Message header is okay

Fri Jun 15 11:05:33 2001 Debug: $(LDT) -> "06152001110532"

Fri Jun 15 11:05:33 2001 Debug: $(MesgID) -> "100013"

Fri Jun 15 11:05:33 2001 Debug: $(NodeName) -> "ixc00asm"

Fri Jun 15 11:05:33 2001 Debug: $(IPAddr) -> "10.25.0.235"

Fri Jun 15 11:05:33 2001 Debug: $(ROCom) -> "xlr8ed!"

Fri Jun 15 11:05:33 2001 Debug: $(RWCom) -> "s39o!d%"

Fri Jun 15 11:05:33 2001 Debug: $(NPG) -> "CISCO-Large-special"

Fri Jun 15 11:05:33 2001 Debug: $(AlrmDN) -> "aSnmpStatus"

Fri Jun 15 11:05:33 2001 Debug: $(AlrmProp) -> "system"

Fri Jun 15 11:05:33 2001 Debug: $(OSN) -> "Testing"

Fri Jun 15 11:05:33 2001 Debug: $(OSS) -> "Normal"

Fri Jun 15 11:05:33 2001 Debug: $(DSN) -> "SNMP_Down"Fri Jun 15 11:05:33 2001 Debug: $(DSS) -> "Agent_Down"

Fri Jun 15 11:05:33 2001 Debug: $(TrigName) -> "NodeStateUp"

Fri Jun 15 11:05:33 2001 Debug: $(BON) -> "nl-ping"

Fri Jun 15 11:05:33 2001 Debug: $(TrapGN) -> "-2"

Fri Jun 15 11:05:33 2001 Debug: $(TrapSN) -> "-2

Event Log

Analysis of eventsreceived from thenetwork devices

Analysis of accuracyof the data


23/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Calculation Methodology: Example

Primary events are device down/up

Down time is calculated based on device-typeoutage duration

Availability is calculated based on the totalnumber of device types, the total time, and thetotal down time

MTTR numbers are calculated from averageduration of downtime

With MTTR the shortest and longest outageprovides a simplified curve


NMS-22019627_05_2004_c2

Automated Fault Management Methodology

Pros

Outage duration and scope can be fairly accurate

Can be implemented within a NMS fault management system

No additional network overhead

Cons

Requires an excellent change management/provisioningprocess

Requires an efficient and effective fault management system

Requires a custom development

Does not account for routing problems

Not true end-to-end measure


24/81


Presentation_ID.scr

NETWORK AVAILABILITYDATA COLLECTION

SAMPLE OUTPUT


NMS-22019627_05_2004_c2


NMS-22019627_05_2004_c2

Automated Fault Management:

Example Reports

18.726:38:110:23:100:00:2099.9170%.0830%844:59:1626478018GRAND

TOTAL

16.842:16:100:26:070:00:1799.9491%.0509%212:29:46173897OtherTotals

14.909:49:350:22:360:00:2499.8691%.1309%430:02:0316734732NetworkTotals

24.427:48:460:20:470:00:1999.9327%.0673%202:27:278012389HostTotals

Eventsper

Device

LongestOutage

Duration

MeanTime toRepair

ShortestOutage

Duration

%Up

%

Down

Total DownTime

hhh:mm:ss

Count ofIncidents

# ofDevices

DeviceType


25/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Count of Incidents

Automated Fault Management:Example Reports (2)

Other Totals11% Host Totals

30%

NetworkTotals59%

Host Totals

Network Totals

Other Totals

Other Totals7% Host Totals

30%

NetworkTotals63%

Host Totals

Network Totals

Other Totals

Total Down Time

Other Totals25%

Host Totals24%

NetworkTotals51%

Host Totals

Network Totals

Other Totals

Number of Managed Devices


ICMP ECHO (PING) AND SNMP ASDATA GATHERING TECHNIQUES


NMS-22019627_05_2004_c2


26/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Data Gathering Techniques

ICMP ping

Link and device polling (SNMP)

Embedded RMON

Embedded event management

Syslog messages

COOL


NMS-22019627_05_2004_c2


Method definition:

Central workstation or computer configured to send pingpackets to the network edges(device or ports) to determinereachability

How:

Edge interfaces and/or devices are defined and pinged

on a determined interval

Unavailability:

Pre-defined, non-response from the interface

ICMP Reachability


27/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Availability Measurement Through ICMP

Periodic ICMP Test

Periodic Pings to Network Devices Period Ping to Network Leaf Nodes


NMS-22019627_05_2004_c2


Pros

Fairly accurate network availability

Accounts for routing problems

Can be implemented for fairly low network overhead

Cons

Point to multipoint implies not true end-to-end measure

Availability granularity limited by ping frequency

Maintenance of device databasemust have a solidchange management and provisioning process

ICMP Reachability


28/81


Presentation_ID.scr


NMS-22019627_05_2004_c2


Method definition:

SNMP polling and trapping on links, edge ports,or edge devices

How:

An agent is configured to SNMP poll and tabulate outagetimes for defined devices or links; database maintainsoutage times and total service time; sometimes trapinformation is used to augment this method by providingmore accurate information on outages

Unavailability:

Pre-defined, non-redundant links, ports, or devices thatare down

Link and Device Status


NMS-22019627_05_2004_c2

Polling Interval vs. Sample Size

Polling interval is the rate at which data is collectedfrom the network

Polling interval = 1Sampling Rate

The smaller the polling interval the more detailed(granular) the data collected

Example polling data once every 15 minutes provides 4 times thedetail (granularity) of polling once an hour

A smaller polling interval does not necessarily provide abetter margin of error

Example polling once every 15 minutes for one hour, has thesame margin of error as polling once an hour for 4 hours


29/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Link and Device Status Method

Method definition

SNMP polling and trapping on links, edge ports,or edge devices

How:

Utilizing existing NMS systems that are currently SNMPpolling to tabulate outage times for defined devices or links

A database maintains outage times and total service time

SNMP Trap information is also used to augment thismethod by providing more accurate information on

outages


NMS-22019627_05_2004_c2

Link and Device Status Method

Pros

Outage duration and scope can be fairly accurate

Utilize existing NMS systems

Low network overhead

Cons

No canned SW to do this; custom development

Maintaining element device database challenging

Requires an excellent change mgmt and provisioningprocess

Does not account for routing problems

Not a true end-to-end measure


30/81


Presentation_ID.scr

CISCO SERVICE ASSURANCEAGENT (SA AGENT)


NMS-22019627_05_2004_c2


NMS-22019627_05_2004_c2

Service Assurance Agent

Method Definition:

SA Agent is an embedded feature of Cisco IOS softwareand requires configuration of the feature on routers withinthe customer network; use of the SA agent can provide fora rapid, cost-effective deployment without additionalhardware probes

How:

A data collector creates SA Agents on the routers tomonitor certain network/service performances; the data

collector then collects this data from the routers,aggregates it and makes it available

Unavailability:

Pre-defined paths with reporting on non-redundant links,ports, or devices that are down within a path


31/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Case Study:Financial Institution (Collection)

SA Agent Collectors

Remote Sites

DNS

InternetWeb Sites


NMS-22019627_05_2004_c2

Availability = 1 - Probes with No ResponseTotal Probes Sent

DPM = Probes with No Response x 106

Total Probes Sent

Availability Using Network-Based Probes

DPM equations used with network-based probes as input data

Probes can be

Simple ICMP Ping probe, modified Ping to test specific applications,Cisco IOS SA Agent

DPM will be for connectivity between 2 points on the network,the source and destination of probe

Source of probe is usually a management system and the destination arethe devices managed

Can calculate DPM for every device managed


32/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

DPM = 1 x 106

10000= 100 probes out of 1 million will fail

Availability = 1 - 1 .

10000= 0.9999

Availability Using Network-Based Probes:Example

Network probe is a ping

10000 probes are sent between managementsystem and managed device

1 probe failed to respond


NMS-22019627_05_2004_c2

Sample Size

Sample size is the number of samples that havebeen collected

The more samples collected the higher the confidence thatthe data accurately represents the network

Confidence (margin of error) is defined by

Example data is collected from the network every 1 hour

After One Day After One Month

0367.03124

1m =

x=2041.0

24

1m ==

sizesample

1m =


33/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Service Assurance Agent

Pros

Accurate network availability for defined paths


Implementation with very low network overhead

Cons

Requires a system to collect the SAA data

Requires implementation in the router configurations

Availability granularity limited by polling frequency

Definition of the critical network paths to be measured

COMPONENT OUTAGE ONLINEMEASUREMENT (COOL)


NMS-22019627_05_2004_c2


34/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

COOL Objectives

To automate the measurement to increaseoperational efficiency and reduce operational cost

To measure the outage as close to the source ofoutage events as possible to pin point the cause ofthe outages

To cope with large number of network elementswithout causing system and network performancedegradation

To maintain measurement data reliably in presents

of element failure or network partition To support simplicity in deployment, configuration,

and data collection (autonomous measurement)


NMS-22019627_05_2004_c2

COOL Features

NetToolsNetTools3rd Party Tools3rd Party Tools

Customer Equipment

Access Router

NMS

C-NOTEC-NOTE

PNLPNL

COOL Embedded in Router

Automated Real-Time MeasurementAutonomous Measurement

Outage Data Stored in Router

Outage Monitor MIB Open access via Outage Monitor MIB

Event Notification Filtering


35/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

COOL Features (Cont.)

Support NMS or tools forsuch applications as

Calculation of software orhardware MTBF, MTTR,

availability per object, device,or network

Verification of customers SLA

Trouble shooting in real-time

Two-tier framework

Reduces performance impact onthe router

Provides scalability to the NMS

Makes easy to deploy

Provides flexibility to availabilitycalculation

NMS

Customer Equipment

NMS

COOL

Outage Monitor MIB

Access Routers

Access RouterCore Router

Outag

eMonitoringand

M

easurement

OutageCorrelationand

Calculation

NMS

COOL

Outage Monitor MIB


NMS-22019627_05_2004_c2

A

DD

RP

Power Fan,Etc.

PhysicalInterface

LogicalInterface

Access Router

Outage ModelC

B

Failure of Remote Device (Customer Equipment or PeerNetworking Device) or Link In-between

Remote ObjectsC

Failure of Software Processes Running on the RPs and LineCards

Software ObjectsD

Interface Hardware or Software Failure, Loss of SignalInterface ObjectsB

Component Hardware or Software Failure Including the Failureof Line Card, Power Supplies, Fan, Switch Fabric, and So on

Physical EntityObjects

A

Failure ModesObjects MonitoredType

NetworkManagement

System

CustomerEquipmentMUX/

Hub/Switch

PeerRouter

Link

Link

A

DD


36/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Outage Characterization

Data Definition

Defect threshold: a value across which the object is considered to bedefective (service degradation or complete outage)

Duration threshold: the minimum period beyond which an outage needsto be reported (given SLA)

Start time: when the object outage starts

End time: when the outage ends

Down Event

Up Event

Outage Duration

DurationThreshold

DefectThreshold

Start Time End Time

Time


NMS-22019627_05_2004_c2

Architecture

OutageOutage

ManagerManager

Internal ComponentInternal Component

Outage DetectorOutage Detector

Fault Manager(IOS)

EventSource

Callbacks Syslog

Remote ComponentOutage Detector

Remote ComponentOutage Detector

Customer EquipmentDetection Function

PingSAAAPIs

Data Table StructureData Table Structure HA and Persistent Data StoreHA and Persistent Data Store

Time StampTemp Event DataCrash Reason

Outage Data

NVRAM

ATA Flash

Outage Monitor MIBOutage Monitor MIB

SNMP Polling SNMP Notification

ConfigurationConfiguration

CustomerAuthentication

CLI

Baseline Optional

CPUUsageDetect

Outage Component Table

Event History Table

Event Map Table

Process Map Table

Remote Component Map Table

Measurement Metrics

Customer Interfaces

Measurement Methods


37/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Outage Data: AOT and NAF

Requirements of measurement metrics:

Enable calculation of MTTR, MTBF, availability, and SLA assessment

Ensure measurement efficiency in terms of resource (CPU, memory, andnetwork bandwidth)

Measurement metrics per object:

AOT: Accumulated Outage Time since measurement started

NAF: Number of Accumulated Failures since measurement started

AOT = 20 and NAF = 2

Router 1

Time10 10

System CrashSystem Crash

Down

Up


NMS-22019627_05_2004_c2

Outage Data: AOT and NAF

Object containment model

Containment independent propertyRouter Device

AOT = 20;

NAF = 2;

Service AffectingAOT = 27;

NAF = 3;

InterfaceAOT = 7;NAF = 1;

Interface 1

Interface Failure2020

77

20

Router 1 Interface 1

Router Device

Line Card

Physical Interface

Logical Interface

Router 1

Time10 10

System Crash System Crash

Down

Up

Time10 10

Up7


38/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Example: MTTR

Find MTTR for Object i

MTTRi = AOTi/NAFi

= 14/2

= 7 min

Object i

Time10 min. 4 min.

Measurement Interval (T2T1)

Failure FailureT1 T2

TTR TTR

Down

Up


NMS-22019627_05_2004_c2

Example: MTBF and MTTF

Find MTBF and MTTF for Object i

MTBF = 700,000 = 1,400,000/2

MTTR = 699,993 = (700,000 7)

MTBFi = (T2 T1)/NAFi

MTTFi = MTBFi MTTRi = (T2 T1 AOTi)/NAFi

Object i

Time10 min. 4 min.

Measurement Interval (T2T1)


TTR TTF

Down

Up

TBF

(T2T1) = 1,400,000 min


39/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Time10 min. 4 min.


Down

Up

Example: Availability and DPM

Find availability and DPM for Object i

Availability = 99.999% = (700,000/700,007) * 100

DPMi = [AOTi/(T2 T1)] x 106 = 10 DPM

Object i

Measurement Interval = 1,400,000 min.

Availability (%) =MTBF

MTBF + MTTR* 100


NMS-22019627_05_2004_c2

Planned Outage Measurement

To capture operation CLI commands both reload andforced switchover

There is a simple rule to derive an upper bound of theplanned outage

If there is no NVRAM soft crash file, check the reboot reason orswitchover reason

If its reload or forced switchover, it can be considered as an upperbound of the planned outage

Send BreakSend Break

Reload

Forced Switchover

Planned Outage

OperationCausedOutage

Upper Boundof the PlannedOutage


40/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Event Filtering

Flapping interface detection and filtering:

Some faulty interface state can be keep changing up and down

May cause virtual network disconnection

May occurs event storm when hundreds of messages for eachflapping event

May make the object MTBF unreasonably low due to frequentshort failures

This unstable condition needs to get operators attention

COOL detects the flapping status

Catching very short outage event (less than the duration threshold)

Increasing the event counter,

Flapping status, if it becomes over the flapping threshold (3 eventcounter) for the short period (1 sec); sends a notification

Stable status, if it becomes less than the threshold; sends anothernotification


NMS-22019627_05_2004_c2

Data Persistency and Redundancy

Data persistency

To avoid data loss due to link outage or router itself crash

Data redundancy

To continue the outage measurement after the switchover

To retain the outage data even if the RP is physically replaced

Copy

NVRAM

RAMOutage Data

FLASHPersistent

Outage Data

NVRAM

RAMOutage Data

FLASHPersistent

Outage Data

Copy

Active RP Standby RP

COOLCOOL

Router

PersistentOutage Data

PersistentOutage Data

PeriodicUpdate

EventDrivenUpdate


41/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Outage Monitor MIB

(Physical Entity Object Description)

(Interface Object Description)

ifTable

entPhysicalTable

(Process Object Description)

cpmProcessTable

CISCO-OUTAGE-MONITOR-MIB

cOutageHistoryTable

cOutageObjectTable

Remote Object Map Table

(Remote Object Description)

Object-Type;Object-Index;

Event-Reason-Index;Event-Time;Event-Interval;

Object-Type;Object-Index;

Object-Status;Object-AOT;Object-NAF;

IF-MIB

ENTITY-MIB

CISCO-PROCESS-MIB

Iso.org.dod.internet.private.enterprise.cisco.ciscoMgmt.ciscoOutageMIB

1.3.6.1.4.1.9.9.280

Event Reason Map Table

(Event Description)

Process MIB Map


NMS-22019627_05_2004_c2

Configuration

MIB Display

Customer EquipmentDetection Function

Cisco IOSConfigurationCOOL

Update

Update

Show CLI

run;

add;

removal

filtering-enable;

Config CLI

Show event-table

Show object-table

Object Table

Event Table


42/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Enabling COOL

ari#dirDirectory of disk0:/

1 -rw- 19014056 Oct 29 2003 16:09:28 +00:00 gsr-k4p-mz.120-26.S.bin

128057344 bytes total (109051904 bytes free)ari#copy tftp disk0:Address or name of remote host []? 88.1.88.9Source filename []? auth_fileDestination filename [auth_file]?Accessing tftp://88.1.88.9/auth_file...Loading auth_file from 88.1.88.9 (via FastEthernet1/2): ![OK - 705 bytes]

705 bytes copied in 0.532 secs (1325 bytes/sec)ari#clear cool perari#clear cool persist-filesari#conf tEnter configuration commands, one per line. End with CNTL/Z.

ari(config)#cool run

ari(config)#^Zari#wr memBuilding configuration...[OK][OK][OK]

ObtainAuthorization

File

Enable COOL


NMS-22019627_05_2004_c2

COOL

Pros

Accurate network availability for devices, components,and software


Implementation with low network overhead.

Enables correlation between active and passive availabilitymethodologies

Cons

Only a few system currently have the COOL featureRequires implementation in the router configurations ofproduction devices

Availability granularity limited by polling frequency

New Cisco IOS Feature


43/81


Presentation_ID.scr


APPLICATION LAYERMEASUREMENT


NMS-22019627_05_2004_c2


NMS-22019627_05_2004_c2

Application Reachability

Similar to ICMP Reachability

Method definition:

Central workstation or computer configured to send packets thatmimic application packets

How:

Agents on client and server computers and collecting data

Fire Runner, Ganymede Chariot, Gyra Research, ResponseNetworks, Vital Signs Software, NetScout, Custom applicationsqueries on customer systems

Installing special probes located on user and serversubnets to send, receive and collect data; NikSun andNetScout

Unavailability:

Pre-defined QoS definition


44/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Application Reachability

Pros

Actual application availability can be understood

QoS, by application, can be factored into the availabilitymeasurement

Cons

Depending on scale, potential high overhead and cost canbe expected

DATA COLLECTION FOR ROOTCAUSE ANALYSIS (RCA) OFNETWORK OR DEVICEDOWNTIME


NMS-22019627_05_2004_c2


45/81


Presentation_ID.scr


NMS-22019627_05_2004_c2


Alarm and event

History and statistics

Set thresholds in router configuration

Configure SNMP trap to be sent when MIB variablerises above and/or falls below a given threshold

Alleviates need for frequent polling

Not an availability methodology by itself but canadd valuable information and customization to thedata collection method

Cisco IOS Embedded RMON


NMS-22019627_05_2004_c2


Provide information on what the router is doing

Categorized by feature and severity level

User can configure Syslog logging levels

User can configure Syslog messages to be sent asSNMP traps

Not an availability methodology by itself but canadd valuable information and customization to thedata collection method

Syslog Messages


46/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Expression and Event MIB

Expression MIB

Allows you to create new SNMP objects based upon formulas

MIB persistence is supported a MIBs SNMP data persists acrossreloads

Delta and wildcard support allows you to:

Calculate utilization for all interfaces with one expression

Calculate errors as a percentage of traffic

Event MIB

Allows you to create custom notifications and log them and/or sendthem as SNMP traps or informs

MIB persistence is supported a MIBs SNMP data persists across

reloads Can be used to test objects on other devices

More flexible than RMON events/alarms

RMON is tailored for use with counter objects


NMS-22019627_05_2004_c2


Underlying philosophy:

Embed intelligence in routers and switches to enable ascalable and distributed solution, with OPEN interfaces forNMS/EMS leverage of the features

Mission statement:

Provide robust, scalable, powerful, and easy-to-use

embedded managers to solve problems such as syslog andevent management within Cisco routers and switches

Embedded Event Manager


47/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Embedded Event Manager (Cont.)

Development goal: predictable, consistent, scalablemanagement

Distributed

Independent of central management system

Control is in the customers hands

Customization

Local programmable actions:

Triggered by specific events


NMS-22019627_05_2004_c2

EEMPolicies

EEMPolicies

Cisco IOS Embedded Event Manager:

Basic Architecture (v1)

Event Detector Feeds EEMEvent Detector Feeds EEM

Embedded Event Manager EEMPolicies

Notify

SyslogEvent Detector

OtherEvent Detector

Switch-over

Reload

Actions

NetworkKnowledge

SNMPEvent Detector

Syslog EventSyslog Event SNMP DataSNMP Data Other EventOther Event


48/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

EEM Versions

EEM Version 1

Allows policies to be defined using the Cisco IOS CLI applet

The following policy actions can be established:

Generate prioritized syslog messages

Generate a CNS event for upstream processing byCisco CNS devices

Reload the Cisco IOS software

Switch to a secondary processor in a fully redundant hardwareconfiguration

EEM Version 2

EEM Version 2 adds programmable actions using the Tclsubsystem within Cisco IOS

Includes more event detectors and capabilities


NMS-22019627_05_2004_c2

PosixPosixProcessProcessManagerManager

IOS ProcessIOS ProcessWatchdogWatchdog

SyslogSyslogDaemonDaemon

SystemSystemManagerManager

WatchdogWatchdogSysmonSysmon

HAHARedundancyRedundancy

FacilityFacility

SyslogSyslog

SystemSystemManagerManager

TimerTimerServicesServices

CountersCounters

InterfaceInterfaceCounters andCounters and

StatsStats

RedundancyRedundancyFacilityFacility

SNMPSNMP

IOS SubsystemsSubscribers to

Receive ApplicationEvents, PublishesApplication EventsUsing Application

Specific EventDetector

Tcl Shell

EEM PolicySubscribers to

Receive Events,Implements Policy

Actions

Embedded EventEmbedded EventManager ServerManager Server

ApplicationSpecific

Event Detector

Event Detectors

EventSubscriber

Event Publishers

EEM Version 2 Architecture

More eventdetectors!

Define policies orprogrammablelocal actionsusing Tcl

Register policywith EEM Server

Events triggerpolicy execution

Tcl extensions forCLI control anddefined actions

Cisco Internal Use Only 96Cisco Internal Use Only 9696


49/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

What Does This Mean to the Business?

Better problem determinationWidely applicable scripts from Cisco engineering and TAC

Automated local action triggered by events

Automated data collection

Faster problem resolutionReduces the next time it happensplease collect

Better diagnostic data to Cisco engineering

Faster identification and repair

Less downtimeReduce susceptibility and Mean Time to Repair (MTTR)

Better serviceResponsiveness

Prevent recurrenceHigher availability

Not an availability methodology by itself but can add valuableinformation and customization to the data collection method

INSTILLING ANAVAILABILITY CULTURE


NMS-22019627_05_2004_c2


50/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Putting an Availability Programinto Practice

Track network availability

Identify defects

Identify root cause andimplement fix

Reduce operating expenseby eliminating non valueadded work

How much does an outage

cost today?How much can i save thruprocess and productenhancements?


NMS-22019627_05_2004_c2

How Do I Start?

1. What are you using now?

a. Add or modify trouble ticketing analysis

b. Add or improve active monitoring method

2. Processanalyze the data!

a. What caused an outage?

b. Can a root cause be identified andaddressed?

3. Implement improvements or fixes

4. Measure the results

5. Back to step 1are other metricsneeded?


51/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

If You Have a Network Availability Method

Use the current method and metric for improvement

Dont try to change completely

Use incremental improvements

Develop additional methods to gather data as identified

Concentrate on understanding unavailabilitycausesAll unavailability causes should beclassified at a minimum under:

Change, SW, HW, power/facility, or link

Identify the actions to correct unavailability causes

i.e., network design, customer process change, HW MTBFimprovement, etc.


NMS-22019627_05_2004_c2

Multilayer Network Design

Distribution

Access

Core/Backbone

WAN Internet PSTN

Server Farm

Building BlockAdditions

Core

SA Agent

Between Accessand Distribution


52/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Distribution

Access

Core/Backbone

WAN Internet PSTN

Server Farm


Core

Multilayer Network DesignSA Agentbetween

Servers andWAN Users


NMS-22019627_05_2004_c2

Distribution

Access

Core/Backbone

WAN Internet PSTN

Server Farm


Core

Multilayer Network Design

COOL for High-

End CoreDevices


53/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Distribution

Access

Core/Backbone

WAN Internet PSTN

Server Farm


Core

Multilayer Network DesignTrouble

TicketingMethodology

AVAILABILITY MEASUREMENTSUMMARY


NMS-22019627_05_2004_c2


54/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Summary

Availability metric is governed by your businessobjectives

Availability measurements primary goal is:

To provide an availability baseline (maintain)

To help identify where to improve the network

To monitor and control improvement projects

Can you identify Where you are now? for yournetwork?

Do you know Where you are going? as networkoriented business objectives?

Do you have a plan to take you there?


NMS-22019627_05_2004_c2

Complete Your Online Session Evaluation!

WHAT: Complete an online session evaluationand your name will be entered into adaily drawing

WHY: Win fabulous prizes! Give us your feedback!

WHERE: Go to the Internet stations locatedthroughout the Convention Center

HOW: Winners will be posted on the onsiteNetworkers Website; four winners per day


55/81


Presentation_ID.scr


NMS-22019627_05_2004_c2


NMS-22019627_05_2004_c2

Recommended Reading

Performance and FaultManagement

ISBN: 1-57870-180-5

High Availability NetworkFundamentals

ISBN: 1-58713-017-3

Network PerformanceBaselining

ISBN: 1-57870-240-2

The Practical PerformanceAnalyst

ISBN: 0-07-912946-3


56/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Recommended Reading (Cont.)

The Visual Display of Quantitative Information

by Edward Tufte (ISBN: 0-9613921-0)

Practical Planning for Network Growth

by John Blommers (ISBN: 0-13-206111-2)

The Art of Computer Systems Performance Analysis

by Raj Jain (ISBN: 0-421-50336-3)

Implementing Global Networked Systems Management: Strategiesand Solutions

by Raj Ananthanpillai (ISBN: 0-07-001601-1)

Information Systems in Organizations: Improving BusinessProcesses

by Richard Maddison and Geoffrey Darnton (ISBN: 0-412-62530-X)

Integrated Management of Networked SystemsConcepts,Architectures, and Their Operational Application

by Hegering, Abeck, Neumair (ISBN: 1558605711)


NMS-22019627_05_2004_c2

Appendix A: Acronyms

AVGAverage

ATMAsynchronous Transfer Mode

DPMDefects Per Million

FCAPSFault, Config, Acct, Perf,Security

GEGigabit Ethernet

HAHigh Availability

HDLCHigh Level Data Link Control

HSRPHot Standby RoutingProtocol

IPMInternet Performance Monitor IUMImpacted User Minutes

MIBManagement Information Base

MTBFMean Time Between Failure

MTTRMean Time to Repair

RMEResource Manager Essentials

RMONRemote Monitor

SA AgentService Assurance Agent

SNMPSimple Network ManagementProtocol

SPFSingle Point of Failure; ShortestPath First (routing protocol)

TCPTransmission Control Protocol


57/81


Presentation_ID.scr

BACKUP SLIDES


NMS-22019627_05_2004_c2

ADDITIONALRELIABILITY SLIDES


NMS-22019627_05_2004_c2


58/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Network DesignWhat Is Reliability?

Reliability is often used as a general term thatrefers to the quality of a product

Failure Rate

MTBF (Mean Time Between Failures) or

MTTF (Mean Time to Failure)

Availability


NMS-22019627_05_2004_c2

Reliability Defined

1. The probability of survival (or no failure) for astated length of time

2. Or, the fraction of units that will not fail in thestated length of time

A mission time must be stated

Annual reliability is the probability ofsurvival for one year

Reliability:


59/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Availability Defined

1. The probability that an item (or network, etc.) isoperational, and ready-to-go, at any point in time

2. Or, the expected fraction of time it is operational.annual uptime is the amount (in days, hrs., min.,etc.) the item is operational in a year

Example: For 98% availability, the annual availability is0.98 * 365 days = 357.7 days

Availability:


NMS-22019627_05_2004_c2

MTBF Defined

MTBF stands for Mean Time Between Failure

MTTF stands for Mean Time to Failure

This is the average length of time between failures (MTBF)or, to a failure (MTTF)

More technically, it is the mean time to go from anoperational state to a non-operational state

MTBF is usually used for repairable systems, and MTTF is

used for non-repairable systems


60/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

How Reliable Is It?

MTBF Reliability:

R = e-(MTBF/MTBF)

R = e-1 = 36.7%

MTBF reliability is only 37%; that is, 63% of yourHARDWARE fails before the MTBF!

But remember, failures are still random!


NMS-22019627_05_2004_c2

MTTR Defined

MTTR stands for Mean Time to Repair

or

MRT (Mean Restore Time)

This is the average length of time it takes to repair an item

More technically, it is the mean time to go from a non-operational state to an operational state


61/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

One Method of Calculating Availability

Availability = MTBF(MTBF + MTTR)

What is the availability of a computer withMTBF = 10,000 hrs. and MTTR = 12 hrs?

A = 10000 (10000 + 12) = 99.88%


NMS-22019627_05_2004_c2

Uptime

Annual uptime

8,760 hrs/year X (0.9988)= 8,749.5 hrs

Conversely, annual DOWNtime is,

8,760 hrs/year X (1- 0.9988)= 10.5 hrs


62/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Systems

Components In-Series

Components In-Parallel (Redundant)

Component 1 Component 2

Component 1

Component 2

RBD


NMS-22019627_05_2004_c2

In-Series

Part 1

Part 2

In-Series

Up Up Up

UpUp Up

Up Up Up Up

Down Down

Down Down

Down DownDown


63/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

In-Parallel

In-Parallel

Up Down Up

Part 1

Part 2

Up Up Up

UpUp Up

Down Down

Down Down


NMS-22019627_05_2004_c2

In-Series MTBF

COMPONENT 1

MTBF = 2,500 hrs.

MTTR = 10 hrs.

COMPONENT 2

MTBF = 2,500 hrs.

MTTR = 10 hrs.

System Failure Rate

= 0.0004 + 0.0004 = 0.0008

System MTBF

= 1/(0.0008) = 1,250 hrs.

Component Failure Rate

= 1/2500 = 0.0004


64/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

In-Series Reliability

System ANNUAL Reliability:

R = 0.03 X 0.03 = 0.0009

Component ANNUAL Reliability:

R = e-(8760/2500) = 0.03

COMPONENT 1

MTBF = 2,500 hrs.

MTTR = 10 hrs.

COMPONENT 2

MTBF = 2,500 hrs.

MTTR = 10 hrs.


NMS-22019627_05_2004_c2

In-Series Availability

System Availability:

A = 0.996 X 0.996 = 0.992

Component Availability:

A = 2500 (2500 + 10) = 0.996

COMPONENT 1

MTBF = 2,500 hrs.

MTTR = 10 hrs.

COMPONENT 2

MTBF = 2,500 hrs.

MTTR = 10 hrs.


65/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

In-Parallel MTBF

System MTBF*:

= 2500 + 2500/2=3,750 hrs.

COMPONENT 1

MTBF = 2,500 hrs.

COMPONENT 2

MTBF = 2,500 hrs.

In general*, =

n

i

i

MTBF

1*For 1-of-n Redundancy of n Identical Componentswith NO Repair or Replacement of Failed Components


NMS-22019627_05_2004_c2

1-of-4 Example

= 5,208 hrs.

*For 1-of-n Redundancy of n Identical Componentswith NO Repair or Replacement of Failed Components

In general*,

=

n

i

i

MTBF

1

42500

32500

22500

12500

4

1

2500 +++==i

i


66/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

In-Parallel Reliability

COMPONENT 1

MTBF = 2,500 hrs.

MTTR = 10 hrs.

System ANNUAL Reliability:

R= 1- [(1-0.03) X (1-0.03)] = 1-0.94 = 0.06

COMPONENT 1

MTBF = 2,500 hrs.

MTTR = 10 hrs.

Component ANNUAL Reliability:

R = e-(8760/2500) = 0.03 Unreliability


NMS-22019627_05_2004_c2

In-Parallel Availability

Unavailability

Component Availability:

A = 2500 (2500 + 10) = 0.996System Availability:

A= 1- [(1-0.996) X (1-0.996)] = 1-0.000016 = 0.999984

COMPONENT 1

MTBF = 2,500 hrs.

MTTR = 10 hrs.

COMPONENT 1

MTBF = 2,500 hrs.

MTTR = 10 hrs.


67/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Complex Redundancy

1

2

3

n

m-of-n

.

.

.

Examples:

1-of-2

2-of-3

2-of-4

8-of-10

Pure Active Parallel


NMS-22019627_05_2004_c2

More Complex Redundancy

Pure active parallel

All components are on

Standby redundant

Backup components are not operating

Perfect switching

Switch-over is immediate and without fail

Switchover reliabilityThe probability of switchover when it is not perfect

Load sharing

All units are on and workload is distributed


68/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Networks Consist of Series-Parallel

Combinations of in-series and redundantcomponents

D1D1

D2D2

D3D3

EE FFCCB1B1

B2B2

AA 2/31/2


NMS-22019627_05_2004_c2

Failure Rate

The number of failures per time:

Failures/hour

Failures/day

Failures/week

Failures/106 hours

Failures/109 hours called FITs (Failures in Time)


69/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Approximating MTBF

13 units are tested in a lab for 1,000 hours with 2failures occurring

Another 4 units were tested for 6,000 hours with 1failure occurring

The failed units are repaired (or replaced)

What is the approximate MTBF?


NMS-22019627_05_2004_c2

Approximating MTBF (Cont.)

MTBF = 13*1000 + 4*6000

1 + 2

= 37,000

3

= 12,333 hours


70/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Modeling

Distributions

Normal

Log-Normal

Weibull

Exponential

Frequency

Time-to-Failure

MTBF

Fr

equency

Time-to-Failure

MTBF

MTBF


NMS-22019627_05_2004_c2

Constant Failure Rate

The Exponential Distribution

The exponential function:

f(t) = e-t, t > 0

Failure rate, , IS CONSTANT

= 1/MTBF

If MTBF = 2,500 hrs., what is the failure rate?

= 1/2500 = 0.0004 failures/hr.


71/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

The Bathtub Curve

Time

FailureRate

Wear-OutUseful Life PeriodInfant

Mortality

DECREASINGFailure Rate

CONSTANT Failure Rate

INCREASINGFailure Rate


NMS-22019627_05_2004_c2

The Exponential Reliability Formula

Commonly used for electronic equipment

The exponential reliability formula:

R(t) = e-t or R(t) = e-t/MTBF


72/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Calculating Reliability

A certain Cisco router has an MTBF of 100,000 hrs;what is the annual reliability?

Annual reliability is the reliability for one year or 8,760 hrs

R =e-(8760/100000) = 91.6%

This says that the probability of no failure in oneyear is 91.6%; or, 91.6% of all units will surviveone year

ADDITIONAL TROUBLETICKETING SLIDES


NMS-22019627_05_2004_c2


73/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Essential Data Elements

Description of Action Taken to Fix the ProblemStringResolution

Identity if the Event Was Due to PlannedMaintenance Activity or Unplanned OutagePlanned/UnplannedType

For HW Problems include Product ID; for SWInclude Release Version

AlphanumericComponent/Part/SWVersion

HW, SW, Process, Environmental, etc.StringRoot Cause

Outline of the ProblemStringProblem Description

Number of Customers that Lost Service; NumberImpacted or Names of Customers Impacted

IntergerCustomers Impacted

Time of Resolutionhh:mmResolution Time

Date of Resolutiondd/mmm/yyResolution Date

Time of Faulthh:mmStart Time

Date of Faultdd/mmm/yyStart Date

Trouble Ticket NumberAlphanumericTicket

Date Ticket Issueddd/mmm/yyDate

DescriptionFormatParameter

Note: Above Is the Minimum Data Set, However, ifOther Information Is Captured it Should Be Provided


NMS-22019627_05_2004_c2

Trouble Tickets Definitions Data accuracy Collection

processes

OperationalProcess and Procedures

AnalysisData Analysis

HA Metrics/NAIS Synergy

Network reliabilityimprovement analysis

Problem management

Fault management

Resiliency assessment

Change management

Performancemanagement

Availabilitymanagement

Baseline availability

Determine DPM(Defects Per Million)by:

Planned/Unplanned

Root Cause

Resolution

Equipment

MTTR

Analyzed Trouble Ticket DataReferral for Process/Procedural Improvement

Referral forAnalysis


74/81


Presentation_ID.scr

ADDITIONAL SA AGENT SLIDES


NMS-22019627_05_2004_c2


NMS-22019627_05_2004_c2

SA Agent: How It Works

1. User configures Collectorsthrough Mgmt Application GUI

2. Mgmt Application provisionsSource routers with Collectors

6. Application retrieves data fromSource routers once an hour

7. Data is written to a database

8. Reports are generated

3. Source router measures andstores performance data,e.g.:

Response time

Availability

4. Source router evaluatesSLAs, sends SNMP Traps

5. Source router stores latestdata point and 2 hours ofaggregated points

SNMP

Management Application SA Agent


75/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

SAA Monitoring IP Core

R1

R3

R2

IP CoreIP Core

P1

P2

P3

Management System


NMS-22019627_05_2004_c2

Monitoring Customer IP Reachability

P1-Pn Service Assurance Agent ICMPPolls to a Test Point in the IP Core

TP1TP1

TPxTPx

P1

P3

P2

PN

Nw1

Nw3

Nw3

NwN


76/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Service Assurance Agent Features

Measures Service Level Agreement (SLA) metrics

Packet Loss

Response time Throughput

Availability Jitter

Evaluates SLAs

Proactively sends notification of SLA violations


NMS-22019627_05_2004_c2

SA Agent Impact on Devices

Low impact on CPU utilization

18k memory per SA agent

SAA rtr low-memory


77/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Monitored Network Availability Calculation

Not calculated:

Already have availability baseline

Fault type, frequency and downtime may be more useful

Faults directly measured from management system(s)


NMS-22019627_05_2004_c2

Monitored Network Availability

Assumptions

All connections below IP are fixed

Management systems can be notified of all fixedconnection state changes

All (L2) events impact on IP (L3) service


78/81


Presentation_ID.scr

ADDITIONAL COOL SLIDES


NMS-22019627_05_2004_c2


NMS-22019627_05_2004_c2

CLIs

[no] cool run

[no] cool interface interface-name(idb)

[no] cool physical-FRU-entity entity-index (int)

[no] cool group-interface group-objectID(string)

[no] cool add-cpu objectID threshold duration

[no] cool remote-device dest-IP(paddr) obj-descr(string) rate(int) repeat(int) [local-ip(paddr) mode(int) ]

[no] cool if-filter group-objectID (string)

Configuration CLI Commands

Router#show cool event-table [] displays all if not s pecified

Router#show cool object-table [] displays all object ty pes if not specified

Router#show cool fru-entity

Display CLI Commands

Router#clear cool event-table

Router#clear cool persistent-files

Exec CLI Commands


79/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Measurement Example:Router Device Outage

Reload (Operational) ,Power Outage, orDevice H/W failure

Type: interface(1), physicalEntity(2), Process(3), and remoteObject(4).Index: the corresponding MIB table index. If it is PhysicalEntity(2), index in the ENTITY-MIB.Status: Up (1) Down (2).Last-change: last object status change time.AOT:Accumulated Outage Time (sec).NAF: Number of Accumulated Failure.


NMS-22019627_05_2004_c2

Measurement Example:

Cisco IOS S/W OutageStandby RP in Slot 0 Crash Using Address Error (4) Test Crash;AdEL Exception It Is Caused Purely by Cisco IOS S/W

Standby RP Crash Using Jump to Zero (5) Test Crash;Bp Exception It Can Be Caused by S/W, H/W, or Operation


80/81


Presentation_ID.scr


NMS-22019627_05_2004_c2

Measurement Example: Linecard Outage

Add a Linecard

Reset the Linecard

Down Event Captured

Up Event Captured

AOT and NAF Updated


NMS-22019627_05_2004_c2

Measurement Example: Interface Outage

12406-R1202(config)#cool group-interface ATM2/0.12406-R1202(config)#no cool group-interface ATM2/0.3

sh cool object 1 | include ATM2/0.33 1 1054859087 0 0 0 ATM2/0.135 1 1054859088 0 0 0 ATM2/0.239 1 1054859090 0 0 0 ATM2/0.441 1 1054859090 0 0 0 ATM2/0.5

12406-R1202(config)#interface ATM2/012406-R1202(config-if)#shutshow cool event-table**** COOL Event Table ****type index event time-stamp interval hist_id object-name1 33 1 1054859105 18 1 ATM2/0.11 35 1 1054859106 18 2 ATM2/0.21 39 1 1054859107 17 3 ATM2/0.41 41 1 1054859108 18 4 ATM2/0.5

12406-R1202(config)#interface ATM2/012406-R1202(config-if)#no shutshow cool event-table**** COOL Event Table ****type index event time-stamp interval hist_id object-name1 33 0 1054859146 41 1 ATM2/0.11 35 0 1054859147 41 2 ATM2/0.21 39 0 1054859149 42 3 ATM2/0.41 41 0 1054859150 42 4 ATM2/0.5

sh cool object 1 | include ATM2/0.33 1 1054859087 0 41 1 ATM2/0.135 1 1054859088 0 41 1 ATM2/0.239 1 1054859090 0 42 1 ATM2/0.441 1 1054859090 0 42 1 ATM2/0.5

Configure to Monitor All the Interfaces whichIncludes ATM2/0; String, Except ATM2/0.3

1

2 3

4 5

Object Table

Shut ATM2.0Interface Down

Down EventCaptured

Up EventCaptured

No Shut ATM2.0Interface

Object Table Shows AOT and NAF


81/81


NMS-22019627_05_2004_c2

Measurement Example:Remote Device Outage

12406-R1202(config)#cool remote-device 1 50.1.1.2 remobj.1 30 2 50.1.1.1 112406-R1202(config)#cool remote-device 2 50.1.2.2 remobj.2 30 2 50.1.2.1 112406-R1202(config)#cool remote-device 3 50.1.3.2 remobj.3 30 2 50.1.3.1 1

sh cool object-table 4 | include remobj1 1 1054867061 0 0 remobj.12 1 1054867063 0 0 remobj.23 1 1054867065 0 0 remobj.3

12406-R1202(config)#interface ATM2/012406-R1202(config-if)#shut

12406-R1202(config)#interface ATM2/012406-R1202(config-if)#no shut

4 2 5 1054867105 42 2 remobj.24 1 5 1054867108 47 3 remobj.14 3 5 1054867130 65 10 remobj.3

4 1 4 1054867171 63 1 remobj.1

4 3 4 1054867193 63 8 remobj.34 2 4 1054867200 95 10 remobj.2

sh cool object-table 4 | include remobj1 1 1054867061 63 1 remobj.12 1 1054867063 63 1 remobj.23 1 1054867065 95 1 remobj.3

3 Remote Devices AreAdded

Object Table

Shut Down the Interface Link Between the RemoteDevice and Router

Down Event Captured

Up Event Captured

Object Table Shows AOT and NAF

No Shut the Interface Link