Site Reliability Engineer (Lac-Brome)

45.2177 -72.5372
Lac-Brome, Canada
Last edited: less than a week ago
Save
Share

Description

Title: SRE Operations Engineer (Canada) Location: 100% Remote Role Summary - L1 Site Reliability Engineer responsible for monitoring, triaging, and executing standard operational tasks across enterprise applications - Supports Kubernetes, APIs, WAF, databases, API gateways (Gloo, Apigee), Kafka, and multi-cloud environments (AWS/Azure/GCP) - First line of defense for incident detection, troubleshooting, and escalation using runbooks and automation Key Responsibilities - Monitoring & Infrastructure - Monitor systems using Grafana, Datadog, Splunk, Prometheus, and AIOps tools - Detect anomalies and follow alert workflows for resolution or escalation - Validate Kubernetes issues using monitoring dashboards and logs - Runbook Execution - Follow predefined runbooks for incident resolution - Restart services, validate system health, and escalate when procedures fail - Ensure adherence to operational standards - Incident Triage & Communication - Perform initial incident triage and severity classification - Collect logs, metrics, and system data for analysis - Communicate clearly with stakeholders and escalation teams - Kubernetes Operations - Use kubectl to inspect pods, deployments, and services - Validate service health and troubleshoot cluster-level issues - Scripting & Automation - Read and modify scripts in Python, Bash, or PowerShell - Support automation of repetitive operational tasks - Networking & Security Troubleshooting - Use tools like ping, curl, netstat, and traceroute - Identify DNS, firewall, WAF, or proxy-related issues - Documentation & Knowledge Management - Document incident resolution steps and system issues - Identify gaps in runbooks and suggest improvements Preferred Skills - Familiarity with AWS, Azure, or GCP cloud platforms - Basic SQL/NoSQL knowledge (e.g., simple query validation like SELECT 1) - Experience with ITSM tools such as ServiceNow, Jira, or xMatters - Exposure to observability tools (ELK, Prometheus, Grafana, Splunk) - Understanding of AI-assisted operational support tools - Solid automation mindset and process optimization awareness Qualifications - 2–5 years (or more) in IT operations, NOC, or SRE/DevOps roles - Strong understanding of Linux, networking, and Kubernetes fundamentals - Knowledge of cloud-ready applications and observability tools - Strong troubleshooting skills using structured methods (5 Whys, Fishbone analysis) Deliverables - Continuous monitoring of infrastructure, applications, dashboards, and logs - Execution of standardized runbooks for incidents and routine tasks - First-level incident triage and escalation to L2/L3 teams - Documentation of incidents, gaps, and automation opportunities - Clear communication during operational incidents - Support onboarding of applications into operations framework Apply on Kit Job: kitjob.ca/job/2o5kli

Highlights

Company name

Net2Source (N2S)
Job position

Site Reliability Engineer (Lac-Brome)

Ad ID:

8762548530
Flag
Block ad

Safety Tips

Be careful with jobs that explicitly state ’no experience needed’.