-
Aarorn Technologies Inc

Site Reliability Engineer - Production Support

Aarorn Technologies Inc
Canada · Full-time · Entry

Role: Site Reliability Engineer - Production Support
Rate Max for $50/hr.
Position Overview

seeks a skilled and experienced Production Support Engineer through vendor staffing to support our digital applications. This role combines hands-on production support with Site Reliability Engineering (SRE) principles, focusing on toil elimination, infrastructure automation, and ensuring high availability of critical digital applications and backend systems.

Primary Responsibilities

1. Toil Removal & Infrastructure Maintenance (15%)

· Execute SSL/TLS certificate updates and renewals across production environments

· Perform Windows and Linux server patching and security updates

· Manage NPID password updates and credential rotation protocols

· Implement security vulnerability remediation in production systems

· Identify, document, and eliminate repetitive manual operational tasks

2. Infrastructure & Database Cluster Management (20%)

· Manage and support Elasticsearch cluster operations (deployment, scaling, monitoring, troubleshooting, performance tuning)

· Administer MongoDB clusters including replication, sharding, backup, recovery, and maintenance

· Operate and maintain Redis instances for caching and session management

· Monitor cluster health, capacity planning, and optimization

· Execute failover and disaster recovery procedures

· Ensure data integrity and backup compliance

3. Automation & SRE Activities (15%)

· Develop, maintain, and enhance Ansible playbooks for infrastructure automation

· Build infrastructure-as-code solutions to reduce manual intervention

· Create and maintain comprehensive runbooks and operational playbooks

· Design monitoring, alerting, and observability solutions

· Implement automated remediation for common operational issues

· Quantify and prioritize toil reduction opportunities

4. Production Application Support (50%)

· Troubleshoot and resolve production incidents affecting digital applications

· Collaborate with application development and support teams on issue diagnosis

· Participate in incident response, root cause analysis, and post-mortems

· Monitor and respond to application performance degradation

---

Technical Requirements

Required Expertise (Must-Have)

· Ansible: 2+ years hands-on experience writing playbooks, roles, and automation workflows

· Elasticsearch: 2+ years managing and troubleshooting Elasticsearch clusters in production

· MongoDB: 2+ years with replica sets, sharding, backup/recovery, and performance tuning

· Redis: Proficiency in deployment, configuration, and operational support

· OpenShift: Experience deploying and managing containerized applications on OpenShift

· Azure: Knowledge of Azure cloud services, resource management, and deployments

· Linux Administration: 3+ years with RHEL, CentOS, or Ubuntu in production environments

· Windows Server Administration: Experience with patching, certificate management, and maintenance

· Shell Scripting: Bash scripting for automation and operational tasks

· Incident Management: Experience responding to and resolving critical production incidents

Preferred Skills

· Kubernetes or container orchestration platforms

· Python or Go scripting for automation

· CI/CD pipeline experience (Jenkins, GitLab CI, Azure DevOps)

· Monitoring and observability tools (Prometheus, Grafana, ELK Stack, Datadog)

· Infrastructure-as-Code tools (Terraform, CloudFormation)

· Security best practices and vulnerability management

· Relevant certifications (AZ-900, CKA, Elasticsearch, etc.)

---

Required Qualifications

· Minimum 5 years of production infrastructure support or SRE experience

· Minimum 3 years with at least 2 of the core technologies (Elasticsearch, MongoDB, Ansible, OpenShift)

· Experience working in regulated financial services environment (preferred)

· Ability to work independently and in teams

· Strong troubleshooting and analytical capabilities

· Excellent documentation and communication skills

· Must be available for on-call support rotation (with reasonable notice)

---

Operational Expectations

· On-Call Rotation: Participates in production support on-call schedule

· Incident Response: Available for critical incident resolution outside standard business hours as required

· Availability: Core business hours + flexibility for critical production issues

· Response Time: First response to critical incidents within 30 minutes

· Documentation: Maintains detailed runbooks, playbooks, and knowledge base articles

· Collaboration: Regular communication with infrastructure, development, and operations teams

Key Skills

Ranked by relevance

elasticsearch ansible incident response server redis linux high availability shell scripting windows server kubernetes prometheus gitlab ci terraform jenkins grafana python gitlab cloud bash cicd elk
Login to Apply
Posted
Mar 13, 2026
Type
Full-time
Level
Entry
Location
Toronto

Industries

Wireless Services Telecommunications Communications Equipment Manufacturing

Categories

Information Technology

Related Jobs

3 roles aligned with this opportunity

View all jobs
View Job Details
Resmed
Related

Generative AI Engineer

2026-06-01

Full-time
Not Applicable
Australia
Software Development
Engineering
View Job Details
Riot Games
Related

Senior Software Engineer, Gameplay - League of Legends

2026-05-20

Full-time
Not Applicable
Australia
Computer Games
Engineering
View Job Details
Saab
Related

DevOps Engineer - Combat Systems

2026-05-28

Full-time
Not Applicable
Finland
Defense
Engineering