Track This Job
Add this job to your tracking list to:
- Monitor application status and updates
- Change status (Applied, Interview, Offer, etc.)
- Add personal notes and comments
- Set reminders for follow-ups
- Track your entire application journey
Save This Job
Add this job to your saved collection to:
- Access easily from your saved jobs dashboard
- Review job details later without searching again
- Compare with other saved opportunities
- Keep a collection of interesting positions
- Receive notifications about saved jobs before they expire
AI-Powered Job Summary
Get a concise overview of key job requirements, responsibilities, and qualifications in seconds.
Pro Tip: Use this feature to quickly decide if a job matches your skills before reading the full description.
With over 260 employees across our London headquarters, Europe, and the US, $93m Series C funding secured, and exceeding £15bn in processed transactions, we are only just getting started.
We are collaborative, customer centric and work with integrity, whilst partnering with some of the biggest insurance leaders including Lloyd's of London and Many Pets. We take huge pride in our company culture, ensuring that everyone has a part to play, an opportunity to be heard, be involved, and the ability to make a real difference. As we continue to scale up, we want like-minded humans to join us on this exciting journey.
Are you ready?
Your mission:
As a Site Reliability Engineer (SRE), you will play an important role in designing, building, and maintaining the infrastructure and tools necessary to support our software applications and services. You will collaborate closely with the product engineering squads, technical operations, and security teams to ensure the reliability, scalability, and security of our platform. Your responsibilities will include automating infrastructure provisioning, configuration management, and deployment pipelines, utilizing best practices and modern technologies to streamline processes and improve efficiency. You will also be responsible for monitoring system performance, identifying bottlenecks, and implementing solutions to enhance system reliability and performance.
Your responsibilities
- Cloud Platform Management: Using Azure/AWS to manage and optimize infrastructure components, ensuring scalability, reliability, and cost management.
- Infrastructure Design and Implementation: Designing, building and maintaining the cloud-based infrastructure that supports our software applications and services
- System Reliability: Ensuring the reliability, availability, and performance of systems and services by designing, implementing, and maintaining robust infrastructure.
- Infrastructure as Code (IaC): Implementing and maintaining tools for automation, monitoring, and deployment to improve efficiency and reduce manual intervention.
- Collaboration and Support: Working closely with product engineering to ensure efficient workflows and support continuous integration and delivery pipelines (CI/CD).
- Capacity Planning and Scalability: Assessing system capacity requirements and planning for future growth to ensure the system can scale and is cost efficient.
- Incident Response and Management: Monitoring system health, promptly responding to incidents, and assisting with the resolution process.
- Risk Management: Identifying potential risks and vulnerabilities in systems and implementing measures to mitigate these risks effectively.
- Monitoring and Observability: Implement and oversee monitoring tools to proactively detect and mitigate issues, ensuring high application and system availability.
- Documentation and Knowledge Sharing: Maintaining documentation and sharing knowledge with the team to ensure transparency and facilitate cross-functional collaboration.
- 3+ years of experience in an SRE or Platform/Cloud Engineer, or similar role.
- Strong knowledge and experience in cloud platforms, we primarily host in Azure and AWS but recognize that skills are transferable.
- Experience in running and maintaining highly available and scalable platforms.
- Expertise in containerisation tools like Docker and orchestration tools such as Kubernetes.
- Experience with infrastructure as code (IaC) tools such as Terraform, Ansible, or Chef for automation and configuration management.
- Strong understanding of monitoring and observability tools.
- Knowledge of networking, security principles, and best practices in a cloud environment. Cloudflare experience would be a bonus.
- Demonstrated experience of CI/CD tools like GitHub Actions, GitLab CI/CD, or Azure DevOps for continuous integration and delivery.
- Problem-solving mindset and meticulous attention to detail.
- Strong collaboration and communication skills to work effectively with cross-functional, internationally distributed teams.
- Comfortable working in a fast-paced environment, handling incidents, and participating in on-call rotations.
- Adaptability to evolving technologies and eagerness to learn new tools and methodologies.
- 25 days Holiday per year + Bank Holidays
- Hybrid working arrangements.
- Contributory pension scheme
- Enhanced parental leave.
- Cycle to Work Scheme
- Private Medical Insurance through Vitality
- Access to Oliva our Mental Health Therapy partners
- Discounted Gym membership
- Financial Coaching with Octopus Wealth
- 2 days of volunteering leave per year
- Sabbatical after 5 years' service
- Ongoing Learning and Development to support you reach your career goals.
We are committed to creating an inclusive environment that enables everyone to perform at their best, where we recognise the rights of all individuals to mutual respect and where there is an
unbiased acceptance of others. Our policies and practices aim to promote an environment that is free from all forms of Unfair discrimination and values the diversity of all people. At the heart of our policy, we seek to treat people fairly and with dignity and respect.
Key Skills
Ranked by relevanceReady to apply?
Join Vitesse and take your career to the next level!
Application takes less than 5 minutes