Back

Site Reliability Engineering Manager (Markham)

43.8564 -79.3377
Markham, Canada
Posted: yesterday
Save
Share

Description

About CarltonOne CarltonOne is a global B2B technology leader and part of the Goldman Sachs portfolio, helping organizations reward and inspire exceptional employees. Our solutions empower productive employees, high-performing sales teams, and loyal customers. Our platform powers the global engagement industry, enabling companies to deliver impactful employee recognition, customer loyalty, rewards, sales, and channel incentive programs. We partner with more than 450 clients, 500 vendors, and serve 14 million members across 185 countries. Each solution also supports eco‑action: we have funded over 20 million trees and are on track to plant millions more each year. About the Role We are seeking a strategic and technically adept Site Reliability Engineering Manager to lead our SRE team. The role ensures reliability, scalability, and performance of our cloud‑native infrastructure and services. Responsibilities Lead, mentor, and grow a team of SREs, fostering ownership, continuous learning, and operational excellence. Define and drive SRE strategy, including SLIs, SLOs, and error‐budget management. Collaborate with cross‑functional teams to align reliability goals with business objectives. Build and maintain solid stakeholder relationships. Establish and continuously improve the incident management lifecycle, from detection through post‑incident review. Lead coordination of incident response across engineering, DevOps, and support teams during major outages. Implement and maintain runbooks, playbooks, blameless post‑mortems, and track incident metrics such as MTTR, MTTD, frequency, and severity. Drive automation initiatives to reduce toil, eliminate manual effort, and improve system resilience. Design and implement comprehensive monitoring and observability strategies using Datadog, Grafana, CloudWatch, Prometheus, Rapid7 InsightCloudSec, Wiz, and Cloudflare. Establish actionable alerting systems with proper thresholds and escalation paths. Analyze performance, availability metrics, and capacity trends to proactively identify and resolve issues. Create and maintain dashboards that provide visibility into system health and business‑critical metrics. Lead root‑cause analysis for recurring issues and implement long‑term preventative solutions. Optimize cloud resource usage and costs through automation, right‑sizing, and performance tuning. Oversee disaster recovery planning and testing to meet RPO and RTO requirements. Implement and maintain IaC practices using Terraform, CloudFormation, and Helm. Champion security best practices, including RBAC, IAM policies, encryption, and vulnerability management. Drive capacity planning initiatives to ensure infrastructure scales with business growth. Qualifications Bachelor’s degree in computer science, engineering, or related field. 7+ years of experience in cloud infrastructure, DevOps, or SRE roles, with at least 2 years in leadership. Proven experience managing incident response and reliability programs at scale. Deep expertise in AWS services (EKS, EC2, S3, VPC, IAM, RDS Aurora, Lambda). Strong background in Kubernetes, container orchestration, and service meshes. Proficiency in Infrastructure‑as‑Code (Terraform, CloudFormation, Helm). Experience with CI/CD pipelines and automation (Bamboo, Jenkins, Ansible). Solid understanding of networking concepts (TCP/IP, DNS, load balancing, CDN). Familiarity with monitoring and observability platforms (Datadog, Grafana, CloudWatch). Excellent communication, stakeholder management, and cross‑functional collaboration skills. Strong incident management and crisis leadership capabilities. Strategic thinking with a focus on long‑term reliability and scalability goals. Nice to Have AWS Certified Solutions Architect or SRE‑related certifications (SRE Practitioner, CKA, CKAD). Experience with ITIL or other incident management frameworks. Solid understanding of security frameworks and tools (RBAC, IAM, KMS, Wiz, Rapid7). Experience with multi‑cloud environments (Azure, GCP). Familiarity with Cloudflare, Ubuntu Server, VMware vSphere, and on‑premises hosting. Experience with observability tools such as OpenTelemetry, Honeycomb, or New Relic. Familiarity with chaos engineering principles and tools (Chaos Monkey, Gremlin). Background in high‑scale, high‑availability systems (99.99%+ uptime SLOs). Benefits and Compensation Competitive salary within $120,000 to $130,000, based on market conditions. Health, dental, and vision coverage. 3 weeks of vacation plus personal days. Employee benefits portal with exclusive discounts. Monthly company‑wide events, celebrations, and team activities. Bravo reward points program for recognition and appreciation. Convenient office location close to public transit. Equal Opportunity Statement We value diversity and inclusion and encourage all qualified people to apply. If we can make this easier through accommodation in the recruitment process, or if you need assistance to accommodate a disability, please contact us with the "Help" button in the application. Apply on Kit Job: kitjob.ca/job/2prew3

Highlights

Company name

CarltonOne
Job position

Site Reliability Engineering Manager (Markham)

Ad ID:

8818255523
Flag
Block ad

Safety Tips

If the salary for a position is far above normal, proceed with caution.