Aarorn Technologies Inc
Site Reliability Engineer - Production Support
Aarorn Technologies IncCanada1 day ago
Full-timeInformation Technology

Role: Site Reliability Engineer - Production Support
Rate Max for $50/hr.
Position Overview

seeks a skilled and experienced Production Support Engineer through vendor staffing to support our digital applications. This role combines hands-on production support with Site Reliability Engineering (SRE) principles, focusing on toil elimination, infrastructure automation, and ensuring high availability of critical digital applications and backend systems.

Primary Responsibilities

1. Toil Removal & Infrastructure Maintenance (15%)

· Execute SSL/TLS certificate updates and renewals across production environments

· Perform Windows and Linux server patching and security updates

· Manage NPID password updates and credential rotation protocols

· Implement security vulnerability remediation in production systems

· Identify, document, and eliminate repetitive manual operational tasks

2. Infrastructure & Database Cluster Management (20%)

· Manage and support Elasticsearch cluster operations (deployment, scaling, monitoring, troubleshooting, performance tuning)

· Administer MongoDB clusters including replication, sharding, backup, recovery, and maintenance

· Operate and maintain Redis instances for caching and session management

· Monitor cluster health, capacity planning, and optimization

· Execute failover and disaster recovery procedures

· Ensure data integrity and backup compliance

3. Automation & SRE Activities (15%)

· Develop, maintain, and enhance Ansible playbooks for infrastructure automation

· Build infrastructure-as-code solutions to reduce manual intervention

· Create and maintain comprehensive runbooks and operational playbooks

· Design monitoring, alerting, and observability solutions

· Implement automated remediation for common operational issues

· Quantify and prioritize toil reduction opportunities

4. Production Application Support (50%)

· Troubleshoot and resolve production incidents affecting digital applications

· Collaborate with application development and support teams on issue diagnosis

· Participate in incident response, root cause analysis, and post-mortems

· Monitor and respond to application performance degradation

---

Technical Requirements

Required Expertise (Must-Have)

· Ansible: 2+ years hands-on experience writing playbooks, roles, and automation workflows

· Elasticsearch: 2+ years managing and troubleshooting Elasticsearch clusters in production

· MongoDB: 2+ years with replica sets, sharding, backup/recovery, and performance tuning

· Redis: Proficiency in deployment, configuration, and operational support

· OpenShift: Experience deploying and managing containerized applications on OpenShift

· Azure: Knowledge of Azure cloud services, resource management, and deployments

· Linux Administration: 3+ years with RHEL, CentOS, or Ubuntu in production environments

· Windows Server Administration: Experience with patching, certificate management, and maintenance

· Shell Scripting: Bash scripting for automation and operational tasks

· Incident Management: Experience responding to and resolving critical production incidents

Preferred Skills

· Kubernetes or container orchestration platforms

· Python or Go scripting for automation

· CI/CD pipeline experience (Jenkins, GitLab CI, Azure DevOps)

· Monitoring and observability tools (Prometheus, Grafana, ELK Stack, Datadog)

· Infrastructure-as-Code tools (Terraform, CloudFormation)

· Security best practices and vulnerability management

· Relevant certifications (AZ-900, CKA, Elasticsearch, etc.)

---

Required Qualifications

· Minimum 5 years of production infrastructure support or SRE experience

· Minimum 3 years with at least 2 of the core technologies (Elasticsearch, MongoDB, Ansible, OpenShift)

· Experience working in regulated financial services environment (preferred)

· Ability to work independently and in teams

· Strong troubleshooting and analytical capabilities

· Excellent documentation and communication skills

· Must be available for on-call support rotation (with reasonable notice)

---

Operational Expectations

· On-Call Rotation: Participates in production support on-call schedule

· Incident Response: Available for critical incident resolution outside standard business hours as required

· Availability: Core business hours + flexibility for critical production issues

· Response Time: First response to critical incidents within 30 minutes

· Documentation: Maintains detailed runbooks, playbooks, and knowledge base articles

· Collaboration: Regular communication with infrastructure, development, and operations teams

Key Skills

Ranked by relevance