Tekgence Inc
Site Reliability Engineer
Tekgence IncCanada2 days ago
ContractRemote FriendlyInformation Technology

Hybrid: 3 work from office- Face 2 Face interview required


Skills:

• Production experience in SRE / Infrastructure / ops for large-scale systems

• Strong programming/scripting skills (Python, Go, Java, or equivalent)

• Deep experience with containerization (Docker), orchestration (Kubernetes, etc.)

• Infrastructure-as-code (Terraform, Helm, CloudFormation, Ansible, etc.)

• Familiarity with GPU / AI compute clusters, high-performance data storage, and distributed architectures

• Experience with monitoring / observability / logging / alerting tools (Prometheus, Grafana, ELK / EFK, Datadog, etc.)

• Networking & systems engineering knowledge (TCP/IP, DNS, routing, load bal-ancing, distributed storage)

• Solid experience in capacity planning, performance tuning, scaling, and incident response

• Demonstrated ability to lead RCAs, deploy fixes, and drive reliability improve-ments

• Experience in regulated environments (financial services, compliance, audit, se-curity) is a strong plus

• Excellent communication, documentation, and cross-team collaboration skills

• Proven track record of reducing operational toil via automation

Key Skills

Ranked by relevance