Cloud Infra_Icon_1500px

Reliability Engineering Coaching

Enquiry
Programme Code D289
Domain
Cloud Infrastructure
Level
Intermediate
Learning Partner(s)
TaUB Solutions
Duration
3 Days
Format Online
Rating
Competencies
Ops Excellence
Job Roles
Cloud Infrastructure Engineer Cloud Infrastructure Architect Cloud Infrastructure Manager DevOps Engineer ICT&SS Professional

Overview

Enhance your team's capabilities with Reliability Engineering Coaching programme. Improve performance, scalability, and reliability for smooth, efficient operations.

Develop the skills needed to maintain robust systems via personalised coaching and a hands-on, practical approach. Foster continuous improvement, drive innovation and achieve operational excellence with expert guidance.

Key Takeaways

At the end of this programme, you will be able to:
  • implement a successful reliability culture in your organisation
  • understand reliability principles and recognise anti-patterns to avoid them
  • assess the organisational impact of introducing reliability
  • improve SLIs and SLOs in a distributed ecosystem, and extend Error Budgets to innovate and mitigate risks
  • build security and resilience by design in a zero-trust environment
  • implement full stack observability, distributed tracing, and foster an Observability-driven development culture
  • curate data using AI for proactive and predictive incident management, and use DataOps for clean data lineage
  • recognise the importance of Platform Engineering for consistency and reliability
  • apply practical Chaos Engineering techniques
  • manage major incident response using an incident command framework and understand unmanaged incidents
  • appreciate why Reliability Engineering is a pure implementation of DevOps
  • execute the Reliability Engineering model effectively
  • understand that reliability is everyone's responsibility
  • learn from success stories in Reliability Engineering

Who Should Attend

  • Please refer to the job roles section.
  • Public Officers interested in modern IT leadership and organisational change approaches and looking to enhance large-scale service scalability and reliability.

Prerequisites

  • You should have completed the SRE Foundation Programme.

What To Bring

  • Laptop with good internet connection and Zoom application.

 

This programme will cover the following topics:

Module 1:  Anti-Patterns

  • Rebranding Ops as Reliability Engineering
  • Users notice an issue before you do
  • Measuring until my Edge
  • False positives are worse than no alerts
  • Configuration management trap for snowflakes
  • The Dogpile: Mob incident response
  • Point fixing
  • Production Readiness Gatekeeper
  • Fail-Safe really?
  • Use Case Discussion

Module 2: SLO is a Proxy for Customer Happiness

  • Define SLIs that meaningfully measure the reliability of a service from a user’s perspective
  • Choose appropriate SLO targets, including how to perform statistical and probabilistic analysis
  • Use error budgets to help your team have better discussions and make better data-driven decisions
  • Use Case Discussion 

Module 3: Building Secure, Scalable and Reliable Systems

  • Reliability Engineering and its role in Building Secure and Reliable systems
  • Design for Changing Architecture
  • Fault tolerant Design
  • Design for Security
  • Design for Resiliency
  • Design for Reliability
  • Use Case Discussion

Module 4: Full-Stack Observability

  • Modern applications are Complex & Unpredictable
  • Slow is the new down
  • Pillars of Observability
  • Using Open Telemetry
  • Use Case Discussion

Module 5: Platform Engineering and AIOPs

  • Taking a Platform Centric View
  • AIOps: A big data view to go from reactive to proactive to predictive management
  • Technology becomes more human through ML, allowing ubiquitous self-service
  • Use Case Discussion

Module 6:  Incident Response Management

  • Key responsibilities towards incident response
  • DevOps & ITIL
  • OODA and Reliability Incident Response
  • Closed Loop Remediation and the Advantages
  • Swarming – Food for Thought
  • Use Case Discussion

Module 7: DiRT and Chaos Engineering

  • Disaster Recovery Testing
  • Fault Injection
  • Chaos Engineering
  • Tools that can be instrumented for Chaos Engineering
  • Use Case Discussion

Module 8: Reliability is the Purest form of DevOps

  • Key Principles of Reliability Engineering
  • How to increase Reliability across the spectrum
  • Metrics for Success
  • Possible implementation Model
  • Cultural and Behavioural Skills are key
  • Case Study
  • Use Case Discussion


Full Fee

Full programme fee

S$1515

9% GST on nett programme fee

S$136.35

Total nett programme fee payable, including GSTS$1651.35
With effect from 1 Jan 2024

NOTE

Funding is available for this programme. Please visit the Learning Partner’s website to find out about the updated programme fee funding breakdown and eligibility, terms and conditions.

Upcoming Classes

Class 1
09 Sep 2024 to 11 Sep 2024 (Full Time)
Duration: 3 days
When:
Time : 9:00 AM to 4:00 PM

Agency-sponsored

Step 1 Apply through your organisation's training request system.

Step 2 Your organisation's training request system (or relevant HR staff) confirms your organisation's approval for you to take the programme.

Your organisation will send registration information to the academy.

Organisation HR L&D or equivalent staff can click here for details of the registration submission process.


Step 3 GovTech Digital Academy will inform you whether you have been successful in enrolment.