Posted May 23, 2026

Lead Engineer Site Reliability

Our vision for the future is based on the idea that transforming financial lives starts by giving our people the freedom to transform their own. We have a flexible work environment, and fluid career paths. We not only encourage but celebrate internal mobility. We also recognize the importance of purpose, well-being, and work-life balance. Within Empower and our communities, we work hard to create a welcoming and inclusive environment, and our associates dedicate thousands of hours to volunteering for causes that matter most to them. Chart your own path and grow your career while helping more customers achieve financial freedom. Empower Yourself. As a Lead Site Reliability Engineer at Empower, you'll combine deep technical expertise with team leadership to drive reliability across our financial services platform. You'll lead other SREs in solving complex operational challenges, establish technical standards, and serve as a key advisor to engineering leadership on infrastructure strategy and reliability initiatives. ESSENTIAL FUNCTIONS: Technical Leadership & Strategy: Lead cross-functional reliability initiatives spanning multiple value streams, coordinating efforts across teams Define and evolve SRE best practices, tools, and methodologies for the organization Architect enterprise-scale infrastructure solutions that balance reliability, cost, performance, and security Establish Service Level Objectives (SLOs) and error budgets for critical services, using them to drive prioritization decisions Lead major incident response as incident commander, coordinating resolution across multiple teams Drive strategic improvements to observability, identifying gaps and implementing solutions at scale Design and implement disaster recovery plans for critical financial services infrastructure Evaluate and introduce new technologies and practices that improve team effectiveness Operational Excellence: Lead the design of foundational infrastructure patterns using Terraform, creating reusable modules adopted across teams Architect multi-region, highly available AWS infrastructure supporting millions of daily transactions Design and implement sophisticated Kubernetes patterns, including multi-tenancy, security policies, and advanced scheduling Build comprehensive observability strategies using Datadog and Splunk, establishing standards for metrics, logging, and tracing Establish CI/CD standards and patterns, implementing pipeline-as-code and progressive delivery at scale Lead initiatives to implement chaos engineering practices and systematic reliability testing Drive FinOps initiatives, optimizing cloud spend while maintaining reliability targets Team Leadership & Development: Lead a functional team of SREs (without direct reports) on projects and operational initiatives Mentor Senior, Intermediate, and Entry-level SREs, accelerating their technical growth Conduct design reviews and architecture discussions, providing expert guidance Lead training sessions on SRE practices, new technologies, and operational procedures Coordinate on-call schedules and drive improvements to reduce on-call burden Facilitate postmortems for high-severity incidents, ensuring organizational learning occurs Collaboration & Influence: Partner with Engineering Managers and Directors to align SRE work with business priorities Collaborate with Security teams on implementing zero-trust architecture and compliance controls Work with Product teams to balance feature velocity with reliability requirements Influence architectural decisions across the engineering organization Represent SRE in cross-functional initiatives and planning discussions Evangelize SRE culture and practices across Empower QUALIFICATIONS: Required: 6-10 years of experience in Site Reliability Engineering (or equivalent), with demonstrated technical leadership Proven ability to lead technical teams and drive complex projects to completion Expert-level knowledge of AWS, with experience designing large-scale, multi-region architectures Deep Kubernetes expertise, including advanced features, security, and production-scale operations Mastery of Infrastructure as Code using Terraform, with experience building shared platforms and frameworks Strong software engineering background with production experience in Python and/or Go Extensive experience with observability platforms (Datadog, Splunk) and implementing monitoring at scale Deep understanding of CI/CD principles and experience implementing enterprise-grade pipelines Proven track record leading major incidents and conducting effective postmortems Strong communication skills with ability to explain complex technical concepts to diverse audiences Experience mentoring engineers and building technical capabilities in teams Preferred: Previous technical leadership roles (Lead, Staff, or similar) in SRE or Operational Excellence Financial services industry experience with understanding of regulatory requirements Expert knowledge of compliance frameworks (SOC 2, PCI DSS, FINRA) AWS certifications (Professional level) Kubernetes certifications (CKA, CKAD, CKS) Experience implementing SRE at organizations with 500+ engineers Background in chaos engineering, game days, and reliability testing practices Contributions to open-source projects with demonstrated community leadership Experience with service mesh implementation and management Track record of speaking at conferences or writing technical content Technical Environment AWS | EKS | Kubernetes | Terraform | Datadog | Splunk | GitOps | ArgoCD | FluxCD | GitLab CI | Jenkins | Python | Go | Helm | Prometheus | Grafana | Istio | Linkerd What Success Looks Like Platform reliability consistently exceeds 99.99% availability Successful delivery of major infrastructure initiatives on time and within scope Demonstrable improvement in team capabilities and productivity Reduction in incident frequency and severity through proactive reliability work Positive relationships with engineering leadership and cross-functional partners Technical decisions that prove sound over time and scale effectively Team members successfully promoted or grown in their capabilities Work Environment & Disclaimer This job operates in a professional office environment. This job description is not intended to be an exhaustive list of all duties, responsibilities and qualifications of the job. The employer has the right to revise this job description at any time. You will be evaluated in part based on your performance of the responsibilities and/or tasks listed in this job description. You may be required to perform other duties that are not included in this job description. The job description is not a contract for employment, and either you or the employer may terminate employment at any time, for any reason, as per terms and conditions of your employment contract. We are an equal opportunity employer with a commitment to diversity. All individuals, regardless of personal characteristics, are encouraged to apply. All qualified applicants will receive consideration for employment without regard to age, race, color, national origin, ancestry, sex, sexual orientation, gender, gender identity, gender expression, marital status, pregnancy, religion, physical or mental disability, military or veteran status, genetic information, or any other status protected by applicable state or local law.

Apply Now

Lead Engineer Site Reliability

More WFH Jobs