Principal Engineer - HPC Operations Job Details | G Forty Two General Trading LLC

Apply now »

About Us

Core42, a leader in AI-powered cloud and digital infrastructure, is driving transformative technology solutions globally. Leveraging advanced resources and partnerships, Core42 empowers clients to harness sovereign AI infrastructure, especially in sectors with stringent regulatory needs. With a mission to redefine digital transformation, we combine sovereign capabilities with scalable, high-performance compute infrastructure, positioning itself at the forefront of AI innovation in the Middle East and beyond.

The opportunity

We are seeking a highly skilled Principal Engineer – HPC Operations to oversee the daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads. This role ensures stable, secure, and high-performing infrastructure leveraging technologies such as Slurm, Kubernetes, and modern MLOps platforms. The ideal candidate will bring deep technical expertise in HPC and a strong operational mindset to drive continuous improvement and automation across globally distributed environments. Responsibilities will extend to collaborating with multidisciplinary teams, leading complex projects, implementing cutting-edge technologies, and providing mentorship to operations engineers.

Duties and Responsibilities:

Oversee the daily operational management of HPC infrastructure, including compute, storage, networking, and scheduler components (e.g., Slurm, Kubernetes, etc.).
Drive efforts to optimize the efficiency and performance of HPC systems, ensuring maximum resource utilization and minimizing downtime.
Serve as the primary technical contact for planned HPC deployments in scope.
Serve as the primary technical escalation point for L2 support teams, ensuring rapid and effective resolution of incidents and service requests.
Continuously monitor system health, performance, and resource utilization using advanced monitoring tools (e.g., Prometheus, Grafana, DCGM).
Manage user environments for AI/ML workloads, including container orchestration (e.g., Docker, Kubernetes) and workflow tools (e.g., MLflow, Kubeflow).
Define and enforce job scheduling policies, priorities, and partitions within Slurm and/or Kubernetes environments to ensure resource fairness, efficiency, and workload optimization.
Lead root cause analysis (RCA) of operational issues, contributing to post-mortem documentation and driving continuous improvement initiatives.
Provide mentorship and technical guidance to junior engineers, fostering skills development and knowledge sharing across teams. Participate in on-call rotation as necessary.
Ensure adherence to security and operational policies, assisting in audits and maintaining documentation for change and incident management processes.

(a) Required skills / qualifications

Minimum Experience:

Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
Minimum of 8 years of experience in HPC operations, systems engineering, or DevOps roles, with at least 2 years in a leadership or ownership capacity.
Advanced expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systems.
Hands-on experience managing Slurm clusters and/or Kubernetes-based environments for AI/ML workloads.
In-depth knowledge of GPU resource management, workload schedulers, and performance tuning for AI/ML workloads.
Proficiency with monitoring and observability frameworks such as Prometheus, Grafana, and DCGM.
Strong scripting and automation skills, including Python, Bash, Ansible, and Terraform.
Solid understanding of Linux (RHEL/CentOS/Ubuntu), networking technologies (RDMA, InfiniBand, RoCE), and storage solutions (NFS, Lustre, Ceph).

The U.S. base salary range for this full-time role is $166,800 to $250,000 with bonus, LTIP and benefits on top. Salary ranges are set according to the role, level, and location. The range listed on each job posting represents the minimum and maximum target salary for new hires across all U.S. locations. Actual pay within this range will depend on factors such as the specific work location, job-related skills, experience and relevant education or training.

What working at Core42 offers

With a diverse team of 1,100+ employees from 68 nationalities, we foster an inclusive, innovative and collaborative environment. At Core42, we foster a culture grounded in trust, accountability and high performance. We are united by our values: Grit, where we overcome challenges with resilience and determination, Passion, which drives us to pursue excellence in everything we do, and Impact, as we aim to inspire progress and create meaningful change. Our team members thrive in an environment where each person’s contributions propel us forward, and together, we commit to achieving extraordinary results.

Core42 is committed to building a diverse and inclusive workplace. As an equal opportunity employer, Core42 does not discriminate based on race, national origin, gender, gender identity, sexual orientation, protected veteran status, disability, age or any other legally protected status. In compliance with the Americans with Disabilities Act (ADA), we provide reasonable accommodations to qualified individuals with disabilities throughout the application and employment process. If you need assistance or a reasonable accommodation due to a disability, please contact us on reasonableaccommodations@core42.com including the role you’re applying for and the accommodation necessary to assist you with the recruiting process.

Apply now »