Site Reliability Engineer, Ontario

Description

Job Description

Key Responsibilities:

Incident Management and Reliability: Lead the incident management process, ensuring high availability and performance of the applications. Develop and implement SRE practices to improve system reliability and resilience.
Monitoring and Observability: Utilize Dynatrace, Splunk, and Grafana to monitor system health, detect anomalies, and provide actionable insights for performance optimization.
Root Cause Analysis: Conduct thorough root cause analysis of incidents and outages, developing long-term solutions to prevent recurrence.
DevOps Practices: Collaborate with development and operations teams to streamline CI/CD pipelines, automate workflows, and implement infrastructure as code (IaC) for efficient service deployment and management.
Networking Expertise: Provide expertise in networking technologies (Cisco, Arista, AVI, etc.), ensuring robust network infrastructure design, implementation, and troubleshooting. Utilize tools like Wireshark for in-depth network analysis and debugging.
Collaboration and Leadership: Work closely with cross-functional teams to share knowledge, mentor junior engineers, and lead by example in adopting best practices in SRE, DevOps, and networking.
Innovation and Continuous Improvement: Stay abreast of industry trends and new technologies, advocating for and implementing innovative solutions to enhance system reliability and performance.

Qualifications:

Bachelor's or Master's degree in Computer Science, Information Technology, or related field.
10+ years of experience in an SRE/DevOps role, with a proven track record in managing high-availability systems.
Strong expertise in monitoring and observability tools (Dynatrace, Splunk, Grafana).
Proficient in network debugging and analysis tools, including Wireshark.
Solid understanding of on-prem and hybrid cloud infrastructure (VMware, Linux, Windows, Azure) and container orchestration (Kubernetes, Docker).
Certifications in relevant technologies (Dynatrace, Splunk) are a plus.
Excellent communication and leadership skills, capable of leading incident response initiatives and collaborating effectively across teams.
Excellent problem-solving skills, with the ability to conduct comprehensive root cause analysis and troubleshooting.

Highlights

Safety Tips

Be careful with commission-based ’work-from-home’ positions that offer an unrealistically high income.