Site Reliability Engineer, Ontario
Site Reliability Engineer, Ontario
-
Ontario, Canada
-
Last edited: today
-
Save
Description
Job Description
Key Responsibilities:
- Incident Management and Reliability: Lead the incident management process, ensuring high availability and performance of the applications. Develop and implement SRE practices to improve system reliability and resilience.
- Monitoring and Observability: Utilize Dynatrace, Splunk, and Grafana to monitor system health, detect anomalies, and provide actionable insights for performance optimization.
- Root Cause Analysis: Conduct thorough root cause analysis of incidents and outages, developing long-term solutions to prevent recurrence.
- DevOps Practices: Collaborate with development and operations teams to streamline CI/CD pipelines, automate workflows, and implement infrastructure as code (IaC) for efficient service deployment and management.
- Networking Expertise: Provide expertise in networking technologies (Cisco, Arista, AVI, etc.), ensuring robust network infrastructure design, implementation, and troubleshooting. Utilize tools like Wireshark for in-depth network analysis and debugging.
- Collaboration and Leadership: Work closely with cross-functional teams to share knowledge, mentor junior engineers, and lead by example in adopting best practices in SRE, DevOps, and networking.
- Innovation and Continuous Improvement: Stay abreast of industry trends and new technologies, advocating for and implementing innovative solutions to enhance system reliability and performance.
Qualifications:
- Bachelor's or Master's degree in Computer Science, Information Technology, or related field.
- 10+ years of experience in an SRE/DevOps role, with a proven track record in managing high-availability systems.
- Strong expertise in monitoring and observability tools (Dynatrace, Splunk, Grafana).
- Proficient in network debugging and analysis tools, including Wireshark.
- Solid understanding of on-prem and hybrid cloud infrastructure (VMware, Linux, Windows, Azure) and container orchestration (Kubernetes, Docker).
- Certifications in relevant technologies (Dynatrace, Splunk) are a plus.
- Excellent communication and leadership skills, capable of leading incident response initiatives and collaborating effectively across teams.
- Excellent problem-solving skills, with the ability to conduct comprehensive root cause analysis and troubleshooting.
Highlights
-
Company nameE-IT
-
Job positionSite Reliability Engineer
Safety Tips
Be careful with commission-based ’work-from-home’ positions that offer an unrealistically high income.
More info about this ad
Site Reliability Engineer has been posted in the Bradford West Gwillimbury Engineering category on Locanto.
Why not check out other ads in this category, such as Heavy Duty Mechanic (Underground), Ontario, Heavy Equipment Mechanic, Ontario or Drill Mechanic in Ontario. In total, we have 4 ads in Engineering in Bradford West Gwillimbury on Locanto classifieds.
There are more ads within a 15 km radius for this category. If you want to view those ads, click here.