Our vision for the future is based on the idea that transforming financial lives starts by giving our people the freedom to transform their own. We have a flexible work environment, and fluid career paths. We not only encourage but celebrate internal mobility. We also recognize the importance of purpose, well-being, and work-life balance. Within Empower and our communities, we work hard to create a welcoming and inclusive environment, and our associates dedicate thousands of hours to volunteering for causes that matter most to them.
Chart your own path and grow your career while helping more customers achieve financial freedom. Empower Yourself.
As a Lead Site Reliability Engineer at Empower, you'll combine deep technical expertise with team leadership to drive reliability across our financial services platform. You'll lead other SREs in solving complex operational challenges, establish technical standards, and serve as a key advisor to engineering leadership on infrastructure strategy and reliability initiatives.
ESSENTIAL FUNCTIONS:
Technical Leadership & Strategy:
Lead cross-functional reliability initiatives spanning multiple value streams, coordinating efforts across teams
Define and evolve SRE best practices, tools, and methodologies for the organization
Architect enterprise-scale infrastructure solutions that balance reliability, cost, performance, and security
Establish Service Level Objectives (SLOs) and error budgets for critical services, using them to drive prioritization decisions
Lead major incident response as incident commander, coordinating resolution across multiple teams
Drive strategic improvements to observability, identifying gaps and implementing solutions at scale
Design and implement disaster recovery plans for critical financial services infrastructure
Evaluate and introduce new technologies and practices that improve team effectiveness
Operational Excellence:
Lead the design of foundational infrastructure patterns using Terraform, creating reusable modules adopted across teams
Architect multi-region, highly available AWS infrastructure supporting millions of daily transactions
Design and implement sophisticated Kubernetes patterns, including multi-tenancy, security policies, and advanced scheduling
Build comprehensive observability strategies using Datadog and Splunk, establishing standards for metrics, logging, and tracing
Establish CI/CD standards and patterns, implementing pipeline-as-code and progressive delivery at scale
Lead initiatives to implement chaos engineering practices and systematic reliability testing
Drive FinOps initiatives, optimizing cloud spend while maintaining reliability targets
Team Leadership & Development:
Lead a functional team of SREs (without direct reports) on projects and operational initiatives
Mentor Senior, Intermediate, and Entry-level SREs, accelerating their technical growth
Conduct design reviews and architecture discussions, providing expert guidance
Lead training sessions on SRE practices, new technologies, and operational procedures
Coordinate on-call schedules and drive improvements to reduce on-call burden
Facilitate postmortems for high-severity incidents, ensuring organizational learning occurs
Collaboration & Influence:
Partner with Engineering Managers and Directors to align SRE work with business priorities
Collaborate with Security teams on implementing zero-trust architecture and compliance controls
Work with Product teams to balance feature velocity with reliability requirements
Influence architectural decisions across the engineering organization
Represent SRE in cross-functional initiatives and planning discussions
Evangelize SRE culture and practices across Empower
QUALIFICATIONS:
Required:
6-10 years of experience in Site Reliability Engineering (or equivalent), with demonstrated technical leadership
Proven ability to lead technical teams and drive complex projects to completion
Expert-level knowledge of AWS, with experience designing large-scale, multi-region architectures
Deep Kubernetes expertise, including advanced features, security, and production-scale operations
Mastery of Infrastructure as Code using Terraform, with experience building shared platforms and frameworks
Strong software engineering background with production experience in Python and/or Go
Extensive experience with observability platforms (Datadog, Splunk) and implementing monitoring at scale
Deep understanding of CI/CD principles and experience implementing enterprise-grade pipelines
Proven track record leading major incidents and conducting effective postmortems
Strong communication skills with ability to explain complex technical concepts to diverse audiences
Experience mentoring engineers and building technical capabilities in teams
Preferred:
Previous technical leadership roles (Lead, Staff, or similar) in SRE or Operational Excellence
Financial services industry experience with understanding of regulatory requirements
Expert knowledge of compliance frameworks (SOC 2, PCI DSS, FINRA)
AWS certifications (Professional level)
Kubernetes certifications (CKA, CKAD, CKS)
Experience implementing SRE at organizations with 500+ engineers
Background in chaos engineering, game days, and reliability testing practices
Contributions to open-source projects with demonstrated community leadership
Experience with service mesh implementation and management
Track record of speaking at conferences or writing technical content
Technical Environment
AWS | EKS | Kubernetes | Terraform | Datadog | Splunk | GitOps | ArgoCD | FluxCD | GitLab CI | Jenkins | Python | Go | Helm | Prometheus | Grafana | Istio | Linkerd
What Success Looks Like
Platform reliability consistently exceeds 99.99% availability
Successful delivery of major infrastructure initiatives on time and within scope
Demonstrable improvement in team capabilities and productivity
Reduction in incident frequency and severity through proactive reliability work
Positive relationships with engineering leadership and cross-functional partners
Technical decisions that prove sound over time and scale effectively
Team members successfully promoted or grown in their capabilities
Work Environment & Disclaimer
This job operates in a professional office environment.
This job description is not intended to be an exhaustive list of all duties, responsibilities and qualifications of the job. The employer has the right to revise this job description at any time. You will be evaluated in part based on your performance of the responsibilities and/or tasks listed in this job description. You may be required to perform other duties that are not included in this job description. The job description is not a contract for employment, and either you or the employer may terminate employment at any time, for any reason, as per terms and conditions of your employment contract.
We are an equal opportunity employer with a commitment to diversity. All individuals, regardless of personal characteristics, are encouraged to apply. All qualified applicants will receive consideration for employment without regard to age, race, color, national origin, ancestry, sex, sexual orientation, gender, gender identity, gender expression, marital status, pregnancy, religion, physical or mental disability, military or veteran status, genetic information, or any other status protected by applicable state or local law.