Major Incident Commander - EMEA
Oracle, a global provider of enterprise cloud computing, empowers businesses of all sizes on their digital transformation journey. With 430,000 customers in 175 countries, Oracle provides leading-edge capabilities in Software as a Service, Platform as a Service, Infrastructure as a Service, and Data as a Service.
Our mission is to transform our world for the better through innovative technologies.
The Oracle Cloud Infrastructure (OCI) Operations team is seeking accomplished and passionate individuals to lead and evolve our Incident Management practice to become a best-in-class service offering.
The primary function of a Major Incident Manager is to direct Subject Matter Experts (SMEs) and Service(s) leaders to restore service as quickly as possible during Major Incidents while keeping accurate and timely data on the progress of such incidents and keeping senior leaders, stakeholders and end users updated.
Incident Commanders are also responsible for building and evolving the practice of Incident Management across OCI, using Post Incident Reviews, developing processes and systems to leverage the related metrics to identify and drive process and procedural improvements globally.
Who are you?
• Passionate about Cloud, customer focused, have done incident management + problem management and thrive in a dynamic team culture.
• A technologist at heart, curious about how things work and how things break - likely to be someone who enjoys finding a better way to do things using automation
• Able to build, maintain and leverage key relationships with internal stakeholders and service leaders to drive increased engagement and accountability for your work.
• Love technology and how to apply it. Maybe you have set up your own environment in the cloud or have spent time developing apps or games that you share with others
• Strong communicator who is passionate about the customer’s experience
• Motivated to be resourceful, innovative and entrepreneurial
• Driven to learn about cloud infrastructure and its inter-dependencies
• Humble and committed to always improving
• Provides leadership in responding and resolving major incidents that impact business critical services, applications and infrastructure for OCI
• Leverages broad technical expertise to convene appropriate SMEs (resolvers) and to direct Major Incident response, with focus on impact mitigation and service restoration
• Work closely with SMEs to quickly identify customer impact (who, how, when)
• Conducts escalation to service teams, senior management and leaders to ensure appropriate awareness, engagement and focus
• Produces accurate and timely communications tailored to relevant audience (Senior Leaders and internal Stakeholders)
• Leads and/or participates in Post Incident Review and Problem Management meetings with key stakeholders and service owners to review events and opportunities for ongoing improvement
• Documents pertinent information relating to Incidents that aids process improvement, identifies deviations and enables the creation of an Incident Knowledge Base
• Monitors and evaluates high-level service and infrastructure dashboards and takes action to address identified anomalies
• Collates and analyses incident based data for team metrics and KPIs
• Identifies opportunities and takes ownership for automation and/or continuous improvement of Incident Management process steps and best practices
• Proactively engages with Service teams to identify and evaluate gaps in operational capabilities and improvements to support Cloud scalability and resiliency
• Represents Incident Management at relevant software team Roadmap planning and backlog reviews, influencing the prioritisation of automation and tooling enhancements
• Work as part of the Major Incident Management team to ensure that the performance of the team achieves the defined performance targets and KPIs
• Have a broad and deep knowledge of cloud infrastructure and related technologies
• Experience in technical troubleshooting, with broad expertise in core infrastructure technologies (e.g. server, compute, storage, network, authentication, databases)
• Experience in managing and tuning systems and/or applications, with ability to review and validate system test output
• Understand IP networking fundamentals and be familiar with Data Center network architectures and standard protocols (e.g. BGP, OSPF)
• Experience in influencing internal/external teams within a diverse/large organisation and skilled at building strong relationships, to deliver required & improved results
• Strong leadership skills to direct service teams during Major Incidents that have the potential for significant business impact; remaining calm, professional and focused in high pressure situations
• Excellent Incident and Problem Management knowledge and experience.
• Exceptional written and verbal communication skills with meticulous attention to detail
• Able to work unsupervised, independently and within a global team
• Experienced user of a trouble ticketing system (Jira, Remedy or similar)
• Flexibility to work within a “Follow the Sun” global shift rota, covering local day-time hours, including holidays and weekends, on a rotational basis
• Ability to be “on-call” as part of an on-call rotation shared across all team members
• Ability to manage multiple tasks in a fast-paced, ever changing environment
• Ability to think strategically and tactically and work in both a reactive (incident response) as well as proactive engagement model.