36 terms

ITIL OSA Problem Management

What is the purpose of Problem Management
-manage the lifecycle of all problems from identification to removal
-understand, document, and communicate known errors and initiate actions to improve or correct the situation
-reactively minimize the adverse impacts of incidents and problems
-proactively prevent recurrence of incidents related to errors
Objectives of problem management
-prevent problems and resulting incidents from happening
-eliminate recurring incidents
-minimize the impact of incidents that cannot be prevented
scope of problem management
is primarily within scope of service operation:
-proactive problem management supports CSI
-problems may be identified in transition activities
reactive problem management (scope)
resolve problems in response to specific incidents or events
proactive problem management (scope)
-triggered by improvement activities
-identify and solve problems and known errors before the incident recurs
-perform on an ongoing basis
-conduct a review and trend analysis of incident records looking for signs of common causes that can be corrected
-conduct major incident reviews
-conduct reviews of operational and event logs to identify problems
-conduct brainstorms and check sheets to identify problems
business value of problem management
1. higher availability of services
2. higher IT staff productivity by reducing unplanned (reactive) work
3. reduced costs related to ineffective workarounds
4. reduced costs of effort related to resolving repeat incidents
policies of problem management
-problems should be tracked separate from incidents
-all problems should be stored and tracked in a single management system
-all problems should use a standard classification schema that is consistent across the enterprise
-the unknown cause of one or more incidents
problem models
predefined steps for known types of problems
Criteria for raising a problem record
-incidents cannot be matched to existing problems and known errors
-trending reveals incidents that may share a common cause
-major incidents require root cause analysis
-other IT functions or suppliers identify a problem
-service desk restores service, but recognizes that the underlying cause was not identified or corrected
-analysis of an incident reveals an underlying problem
chronological analysis
-documents all events in chronological order
-provides a single timeline of all related events
-compare with incident and change records to identify or rule out potential causes
-useful when there are conflicting reports of what happened and when
pain value analysis
-seeks to understand the total impact of incidents and problems to the business
-enables better prioritization of problems by focusing on the areas of the greatest plan
-some considerations: number of people affected; duration of downtime; cost to the business
Kepner and Tregoe Analysis
-formally investigate deeply rooted problems
-define the problem (in terms of identity, location, time and size)
-describe the problem
-establish possible causes
-test the most probable cause
-verify the true root cause
-gather relevant people
-brainstorm potential causes and resolutions
-document the outcomes and next steps
Affinity mapping
-organizes large amounts of data
-groups ideas by common characteristics
-assists in pattern recognition and assignment of next steps
-starts with a description of the problem
-follows the five "Why...?" questions to deduce cause
fault isolation
-execute steps that create the problem
-eliminate variables (CIs) until fault is identified
-helpful for persistent problems
hypothesis testing
-generate possible causes based on educated guesses
-convert each cause into testable statements
-test each statement to confirm or reject each hypothesis
technical observation post
-focuses on intermittent incidents and unknown causes
-cross functional IT team with specialist staff
-monitors events in real-time to catch the incidents and problems as they occur
Ishikawa Diagrams
-generated by brainstorming and affinity mapping sessions
-goal is represented as the trunk with primary factors represented as branches and secondary factors as stems
-can lead to greater and consistent levels of understanding for members of the group and facilitates the assignment of follow up actions
-shows possible causes
Pareto Analysis
-separates important causes from trivial issues
-can help organizations prioritize which problems should be addressed first
-useful approach for analyzing incident trends and volumes as a part of proactive problem management and helps focus the service provider and problem management team on the problems that are generating the highest number of incidents
Problem situation: complex problems where a sequence of events needs to be assembled to determine exactly what happened
Analysis Techniques:
-chronological analysis
-technical observation post
Problem situation: uncertainty over which problems should be addressed first
Analysis Techniques:
-Pain value analysis
Problem situation: uncertain whether a presented root cause is truly the root cause
Analysis Technique:
-hypothesis testing
problem situation: intermittent problems that appear to come and go and cannot be recreated or repeated in a test environment
Analysis Technique:
-technical observation post
-hypothesis testing
problem situation: uncertainty over where to start for problems that appear to have multiple causes
Analysis Technique:
-Pareto analysis
-Ishikawa diagrams
problems situation: struggling to identify the exact point of failure for a problem
Analysis Technique:
-fault isolation
-Ishikawa diagrams
-affinity mapping
problem situation: uncertain where to start when trying to find root cause
Analysis Technique:
-affinity mapping
problem management processes
1. problem detection
2. problem logging
3. problem categorization
4. problem prioritization
5. problem investigation and diagnosis
6. workarounds
7. raising a known error record
8. problem resolution
9. problem closure
10. major problem review
details included in problem ticket
-user details
-service details
-equipment (CI) details
-date and time initially logged
-priority and category
-description of incident
-related incident numbers
-details of troubleshoot and recovery steps attempted
triggers for problem management process
-reactive problem management: incidents, results from testing, supplier known error data
-proactive problem management: patterns and trends identified in analysis and reporting; review of operation logs; review of event logs; operational communications
inputs for problem management
-incident records
-incident reports
-CI information
-communication and feedback from incident management
-communication and feedback about RFCs
-events and event related communications
-operational and service level objectives
-customer feedback
-agreed criteria for prioritizing and escalating problems
-outputs from risk management activities
outputs for problem management
-resolved problems
-updated problem management records
-RFCs to remove infrastructure errors
-workarounds for incidents
-known error records
-problem management reports
-improvement recommendations from major problem reviews
problem management process owner
-carrying out the generic process owner role for problem management
-designing problem management models and workflows
-working with other process owners to ensure proper process integration
problem management process manager
-carrying out generic process manager role for problem management
-planning and managing support for problem management tools and processes
-coordinating interfaces between problem management and other service management processes
-liaising with problem resolution groups to ensure resolution of problems with agreed timeframes
-ownership and maintenance of the KEDB
-gatekeeper for known error records and management of search algorithms
-formal closure of all problems
-liaising with external resources for problem resolution as needed
-major problem reviews
problem analyst
-reviewing incident data to analyze assigned problems
-analyzing problems for correct prioritization and classification
-investigating problems through to resolution and root cause
-raising RFCs to resolve problems as necessary
-monitoring the progress of known errors
-updating the KEDB with new or updated known errors and workarounds
-assisting with major incidents to identify root causes