What is Viva Engage?
Viva Engage is the industry-defining social network for the enterprise. We provide a platform for millions of employees, including those from 85% of Fortune 500 companies, to build community and culture, share knowledge, and connect with their leaders and each other.
Why Viva Engage?
Acquired by Microsoft in 2012, Viva Engage combines the benefits of a startup - rapid innovation, cutting-edge technology, outsized individual impact - with the advantages of working for one of the most successful software companies in the world. We believe in mission-driven work and our platform has become more indispensable than ever as it fosters connection and a sense of belonging among remote teams. #VivaEngage
You will have:
Autonomy and freedom to innovate
Choice of the best of open source and Microsoft-internal technology
The ability to experiment, A/B test, and make data-driven decisions
Tons of opportunity for outsized impact as part of a small but mighty team on a rapidly-growing product needed now more than ever
As Principal Site Reliability Engineering Manager in Viva Engage , you will have two critical accountabilities:
The first is leading efforts to fully embrace site reliability engineering principals while building critical infrastructure, optimizing existing systems, and eliminating toil. You will oversee efforts that combine software and systems engineering to build, scale and operate the large-scale conversation platform that powers Viva Engage experiences. With our origins as a startup but now part of Microsoft, your purview spans our own open-source-based tech stack, Azure managed services, and M365 technology.
The second expectation is to improve overall reliability for Viva Engage. This means guiding engineering teams to develop missing capabilities, and driving changes to our culture and processes to make reliability a critical aspect of how we work. We have been growing rapidly to become a critical workload for many of the world’s largest organizations and are looking for you to help us get to the next level.
You should have a well-established playbook developed through years of experience operating world-class systems on a huge scale. You should be able to paint a vision of the future and build consensus across the organization while still being able to dive into details. The day-to-day responsibilities include a blend of technical, hands-on leadership with demonstrated people management and partnership skills.
Location: By applying to this U.S. based position, relocation does not apply/is not provided for the role.
Microsoft’s mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Responsibilities
Mentor engineers within the infrastructure team and in partner teams in improving service reliability and evangelize reliability practices across the organization
Drive accountability across the entire engineering organization with well-defined processes, metrics, and goals for reliability. This may include retooling existing rituals and creating new ones.
Collaborate across various teams to provide input into capacity planning; failure/reliability analysis; performance analysis; security and customer privacy analysis
Participate in the incident manager on-call rotation to co-ordinate responses to Service Level Agreement (SLA) impacting incidents. Keeping relevant stakeholders and leadership apprised of details related to incident impact and status of resolution
In addition, you have people management responsibilities including driving employee growth and development, executing projects, and managing performance, while continuing to evolve our infrastructure
Embody our culture (https://careers.microsoft.com/v2/global/en/culture) and values (https://www.microsoft.com/en-us/about/corporate-values)
Qualifications
Required/Minimum Qualifications:
8+ years technical experience in software engineering, network engineering, systems administration, or Site Reliability Engineeringo OR Bachelor's Degree in Computer Science, Information Technology, or related field AND 5+ years technical experience in software engineering, network engineering, systems administration, or Site Reliability Engineering
o OR Master's Degree in Computer Science, Information Technology, or related field AND 3+ years technical experience in software engineering, network engineering, systems administration, or Site Reliability Engineering
o OR Doctorate Degree in Computer Science, Information Technology, or related field AND 2+ years technical experience in software engineering, network engineering, systems administration, or Site Reliability Engineering
3+ years of people management experience leading Site Reliability Engineers or livesite teams.
6+ years of experience in a Site Reliability Engineering role building and operating systems with world-class reliability at huge scale (100m+ Monthly Active Usage).
6+ years technical engineering experience with building large scale distributed systems using, but not limited to Golang, Java, Python, containers and container orchestration systems (such as Docker, Kubernetes, Apache Mesos), infrastructure as code (such as Terraform), databases (such as Postgres, data sharding), and Cloud Platforms (such as Microsoft Azure, Amazon Web Services, Google Cloud Platform).
Additional/Preferred Qualifications:
Demonstrated experience growing and coaching people, and acts as a role model for others.
6+ years technical engineering experience with coding in languages including, but not limited to Golang, Java, or Python.
6+ Experience with containers and container orchestration systems
6+ Experience operating and evolving large-scale distributed systems in a cloud infrastructure (such as Kubernetes, Apache Mesos, Docker)
6+ Experience with Infrastructure as code (Terraform)
6+ Experience with large scale databases (Postgress, data sharding)
6+ Experience with Linux, Ubuntu, Microsoft Azure, Amazon Web Services, Google Cloud Platform is preferred.
Site Reliability Engineering M5 - The typical base pay range for this role across the U.S. is USD $133,600 - $256,800 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $173,200 - $282,200 per year.
Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay
Microsoft is an equal opportunity employer. Consistent with applicable law, all qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations (https://careers.microsoft.com/v2/global/en/accessibility.html) .