Home
/
Comprehensive
/
Senior HPC/Supercomputing Site Reliability Engineer
Senior HPC/Supercomputing Site Reliability Engineer-February 2024
Multiple Locations
Feb 11, 2026
About Senior HPC/Supercomputing Site Reliability Engineer

  Working with one of the most exciting products in Microsoft Azure, you will help with advancing Microsoft's cloud first strategy. The Azure Customer Experience (CXP) team is searching for a customer obsessed HPC/Supercomputing Site Reliability Engineer that can drive reliability and observability engineering excellence and embody our culture of inclusiveness, growth-mindset, and unwavering dedication to diversity

  We are a fast-paced agile team in a start-up like culture where you are empowered to help shape the future. We apply software engineering approach to run operations. More specifically, responsible for defining, instrumenting, measuring SLO/SLI/SLAs and improving service availability, latency, scalability, performance, observability, and efficiency.

  Our “no dead-ends”, “whatever it takes”, “biased for action”, “make it better than ever” philosophy ensures that every customer can realize their full potential through the Microsoft Cloud. We are fast growing team, but we make sure we are committed to remain agile. Customer first, nurturing trust, high responsiveness, automation, SLO/SLI/SLA, blameless post-mortem, observability, monitoring, alerting, and toil reduction form the foundations of our code and we work with teams across Microsoft and external customers to ensure success. We work on exciting engineering challenges in a fun and supporting environment, with access to cutting edge technology surrounded by world-class engineers.

  Responsibilities

  Distributed systems architecture – understand and manage the most complex systems

  Continual reliability and performance optimisation – enhancing observability stack to improve proactive detection and resolution of issues

  Working at bleeding edge - adopting new approaches and technologies, iterating on existing tooling to drive improvements

  Problem solving capabilities – troubleshooting complex issues and proactively reducing toil through automation

  Collaboration skills – working across teams to drive change and provide guidance

  Technical expertise – depth skills and ability to act as subject matter expert in High Performance Computing

  Capacity planning – effectively forecasting demand and react to changes

  Incident response – rapidly detecting and resolving critical incidents. Minimising customer impact through effective collaboration, escalation (including periodic on-call shifts) and post incident reviews.

  Regular travel to the customer site in the Southwest of England should be expected on at least a monthly basis.

  Candidates must be eligible for Security Clearance

  Qualifications

  Required qualifications/experience:

  Proven build and operational experience in HPC/Supercomputer environments, preferably labs with 100+ users.

  Deploying and configure large-scale HPC clusters for parallel jobs and AI workloads.

  Orchestration tools e.g. Azure CycleCloud, Bright Cluster Manager, AWS Parallel Cluster

  Provisioning Linux-based compute nodes, drivers, software, application licensing and networking

  Workload scheduling, PBS, Slurm, LSF, Kubernetes or similar

  Parallel filesystems; Lustre, network filesystems NFSv4, and other cloud storage services

  Linux/OSS: Administration and scripting languages (e.g., Bash, Python, Perl)

  Monitoring and Performance Tuning, setting up monitoring tools, analyzing system performance metrics, and implementing improvements.

  Effective diagnosis of complex technical issues. Experience with hardware/software system interrupts, node interrupts and application job failures · Reliability Engineering Knowledge; designing and implementing systems for fault tolerance, scalability, and resilience.

  Preferred qualifications/experience:

  Familiarity with tools such as Terraform, Bicep, Ansible, Spack

  Public Cloud Infrastructure, preferably Azure. Including compute, networking, security, identity, governance and storage

  Managing and utilizing version control systems such as Git, GitHub, or Azure DevOps.

  Knowledge of continuous integration and continuous deployment (CI/CD) practices, including pipeline configuration and automation.

  Experience of tools like Prometheus, Grafana, or Azure Monitor.

  Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations (https://careers.microsoft.com/v2/global/en/accessibility.html) .

Comments
Welcome to zdrecruit comments! Please keep conversations courteous and on-topic. To fosterproductive and respectful conversations, you may see comments from our Community Managers.
Sign up to post
Sort by
Show More Comments
SIMILAR JOBS
Manufacturing Assembler (1st shift)
Overview The Assembler position at Cook Inc. assembles medical devices according to written specifications and manufacturing instructions. As a manufacturing assembler, you will play an integral role
Sales Associate - Store 281
Overview: The primary duty of a Sales Associate is to promote Conn’s products and financing options through excellent customer service. Sales Associates are responsible for assisting our customers th
Student Support Para
TYPE OF POSITION: Hourly Hours may vary based on program needs REPORTS TO: Building Principals, immediate supervisor SUPERVISION: None QUALIFICATIONS: Required: High School Diploma or G.E.D. Diploma
Registered Nurse RN Med/Surg Float Pool
All the benefits and perks you need for you and your family: Career Development Nursing Clinical Ladder Program* - Team Based Nursing Model* Our promise to you: Joining AdventHealth is about being pa
Test Lead III
Secure our Nation, Ignite your Future Become an integral part of a diverse team while working at an Industry Leading Organization, where our employees come first. At ManTech International Corporation
Vice President, Investor Relations and Risk Management
Vice President, Investor Relations Apply now » Date: Feb 1, 2024 Location: Colmar, PA, US Company: Dorman Products Dorman was founded on the belief that people should have greater freedom to fix moto
Security Engineer, Amazon Internal Auth
This role is located in IAD31-CO-(Herndon,VA,US) | JFK14-CO-7(New York,NY,US) | ORD10-Corp Office (Chicago) | SAN13-CO-1030(San Diego,CA,US) | SEA33-Blackfoot(Seattle,WA,US) | SFO28-CO-(San Francisco
Experienced Maintenance Mechanic-2nd
JOB REQUIREMENTS: Want a job you will LOVE? Start a NEW CAREER at JCT!Experienced Maintenance Mechanic 2nd Shift Needed Great pay - Greatbenefits - No Weekends - Paid Weekly - Family Owned! If you ca
Oracle NetSuite - Account Executive - Mid-market
Job Description About Oracle NetSuite Oracle NetSuite was founded in 1998 and is widely recognized as the first cloud computing software company. Having over 36,000 customers, we are a world leader i
Employer Account Executive - Remote - Omaha, NE
Our work matters. We help people get the medicine they need to feel better and live well. We do not lose sight of that. It fuels our passion and drives every decision we make. Job Posting Title Emplo
Copyright 2023-2026 - www.zdrecruit.com All Rights Reserved