Lead Site Reliability Engineer

Overview:

About AIQ:

AIQ is an Abu Dhabi based joint venture company between Presight and ADNOC, which focuses on developing artificial intelligence technologies. AIQ develops and commercializes AI products and applications for energy world. It aims in providing end-to-end solutions by using its data, cloud and talents to develop AI solutions that seek to reduce costs and generate revenue for its clients. AIQ embodies an innovative and entrepreneurial spirit that embraces challenges to push boundaries and seeks to welcome professionals to its team that share the desire to make meaningful and impactful contributions to its mission. Always on the cutting edge of technology, AIQ provides its talent all the opportunities to thrive and excel. Working at AIQ includes dealing with massive data sets, an AI infrastructure that is powered by the latest NVIDIA GPU cloud computing platform and access to limitless computing, storage and network resources.

 

About the role:

AIQ is looking for a Lead Site Reliability Engineer to drive reliability, performance, and scalability across our infrastructure. This role will lead SRE initiatives, mentor team members, and collaborate with engineering and product teams to build robust systems that can scale globally.

Responsibilities:

  • Architect and lead reliability strategies across services and environments.
  • Define and enforce SLOs, SLIs, and error budgets with engineering leadership.
  • Lead incident response and root cause analysis.
  • Implement automation to reduce toil and improve system resilience.
  • Manage capacity planning, traffic forecasting, and cost optimization.
  • Mentor junior and senior SREs in technical and process excellence.
  • Collaborate with MLOPS, DevSecOps and CloudOps teams to enforce best practices.
  • Champion observability, metrics-driven decisions, and platform maturity.

Qualifications:

  • 12+ years of experience in previous relevants roles.
  • At least 1 year experience in leading a team. 
  • Expertise in Kubernetes, CI/CD (e.g., GitLab, Argo), and infrastructure-as-code (Terraform/Helm).
  • Strong experience in cloud (Azure, AWS, or GCP).
  • Proven background in SRE principles, SLIs/SLOs, and reliability-focused engineering.
  • Programming proficiency in Python, or Shell (Nice to have)
  • Deep understanding of distributed systems, networking, and incident management.