Lead Data Engineer
Inception, a G42 company, is the region’s leading innovator of AI-powered domain-specific as well as industry-agnostic products, built on a rich heritage of research and development. Within the G42 ecosystem, Inception functions as the core intelligence layer – transforming data and compute infrastructure into real-world, applied AI solutions. Beyond its commercial endeavors, Inception is committed to creating positive societal impact. For more information, please visit www.inceptionai.ai
Overview:
Inception is seeking a highly skilled Lead Data Engineer to architect and build scalable, cloud-native data and AI pipelines that power enterprise LLM, RAG, and retrieval systems.
Responsibilities:
• Design, build, and optimize scalable data pipelines for AI/LLM workloads, including vectorization and embedding processing.
• Develop and maintain ETL/ELT workflows for structured, unstructured, and streaming data.
• Create and manage vector database indexing and similarity search pipelines using tools like FAISS, Pinecone, Weaviate, Qdrant, Chroma.
• Build retrieval systems for RAG, semantic search, and enterprise knowledge retrieval.
• Develop robust, reusable data orchestration pipelines using Airflow, Spark, or similar tools.
• Architect and manage data pipelines across Azure (primary), AWS, and GCP environments.
• Integrate and optimize storage and processing across SQL, NoSQL, and vector databases.
• Contribute to the design and implementation of event-driven architectures.
• Collaborate with AI teams to enable embedding generation, LLM integration, and model-serving pipelines.
• Ensure end-to-end data quality, monitoring, reliability, and observability.
• Lead or participate in system design for large-scale, distributed data and AI systems.
Required Skills
Programming & Data
• Strong expertise in Python for data processing, APIs, automation, or distributed workloads.
• Strong proficiency in SQL and knowledge of NoSQL databases (MongoDB, DynamoDB, Cosmos DB, etc.).
• Experience with vector databases, such as: FAISS, Pinecone, Weaviate, Qdrant, Chroma.
• Strong knowledge of data modeling, pipeline development, and ETL/ELT frameworks.
AI/LLM Infrastructure
• Solid understanding of vectorization, embeddings, and similarity search techniques.
• Familiarity with LLMs, embedding models, and RAG pipeline concepts.
• Experience integrating embedding-generation pipelines via Hugging Face, OpenAI, or other model providers.
Cloud & Distributed Systems
• Proficiency with Azure (primary), and familiarity with AWS and GCP.
• Experience with Docker and containerized development.
• Understanding of Kubernetes is a strong plus.
Orchestration & Big Data
• Expertise in Apache Airflow for scheduling and orchestration.
• Experience with Apache Spark or equivalent distributed processing frameworks.
Architecture & Engineering Fundamentals
• Strong system design fundamentals for scalable and distributed systems.
• Knowledge of event-driven architecture and modern data platforms.
• Strong understanding of DevOps, CI/CD, version control, and observability best practices.
Qualifications:
- 8+ years of progressive experience in data engineering, distributed systems, or AI/ML data infrastructure
- Experience building RAG pipelines in production.
- Knowledge of graph databases or hybrid search systems.
- Understanding of model deployment, inference optimization, and caching techniques for LLM workloads.
- Familiarity with data governance, IAM, and security patterns across cloud ecosystems