Prepare for Your Observe.AI Interview with Real Experiences!
View interviews28 Observe.AI Jobs
Observe.ai - Technical Lead - Infrastructure Engineering (8-12 yrs)
Observe.AI
posted 3+ weeks ago
Flexible timing
Key skills for the job
Our Infrastructure/DevOps team is a dynamic group of skilled engineers operating in a fast-paced Agile environment. We manage a robust multi-region infrastructure across the globe, leveraging AWS, Kubernetes, and Harness for efficient deployments and seamless application runtime management.
Collaboration is at our core, with daily stand-ups and bi-weekly sprints ensuring alignment and continuous progress. Innovation thrives here; team members are encouraged to experiment with new technologies and share ideas that drive impactful solutions. We foster growth through mentorship programs, regular skill development workshops, and ample career advancement opportunities.
Responsibilities :
- Manager Self-Hosting tools : Lead the transition from managed services to self-hosted Elasticsearch, Prometheus, and other critical infrastructure components to optimize performance and cost.
- Optimize AI Infrastructure : Work closely with ML engineers and data scientists to efficiently deploy and scale AI/ML models, ensuring high availability and low-latency inference.
- Infrastructure Scalability & Reliability : Design and implement scalable, fault-tolerant systems capable of handling large-scale AI workloads, distributed training, and high-throughput data pipelines.
- Technology Evaluation & Implementation : Continuously assess and introduce new technologies to enhance automation, reliability, and security in AI model deployment and training pipelines.
- CI/CD for AI Workflows : Enhance and automate ML model deployment pipelines using MLOps best
practices and tools like Kubeflow, MLflow, and Argo Workflows.
- Observability & Monitoring : Implement and enhance monitoring, logging, and alerting strategies using Prometheus, Grafana, ELK, OpenTelemetry, etc., tailored for AI workloads.
Requirements :
- 8+ years of experience in DevOps, SRE, or Cloud Infrastructure roles, preferably in AI or data-intensive environments.
- Strong expertise in Kubernetes (EKS, AKS preferred ) for deploying AI workloads and managing GPU & non CPU clusters.
- Experience with self-hosting services like Elasticsearch, Prometheus, Grafana, Kafka, etc.
- Hands-on expertise in Infrastructure as Code (Terraform, CloudFormation).
- Deep understanding of cloud platforms (AWS, Azure, GCP) and AI-focused services like AWS Sagemaker, Vertex AI, or Azure ML.
- Strong automation and scripting skills in Python, Bash, or Go.
- Experience in CI/CD tools (Jenkins, GitHub Actions, ArgoCD, etc. ) with a focus on AI model deployment.
- Strong leadership and mentorship skills to guide DevOps and ML teams.
- FinOps expertise for optimizing GPU and AI cloud compute costs.
- Familiarity with service meshes (Istio, Linkerd) and API gateways.
- Knowledge of compliance frameworks (SOC2 ISO 27001 etc. ) for AI data pipelines.
Functional Areas: Other
Read full job descriptionPrepare for Your Observe.AI Interview with Real Experiences!
View interviews8-12 Yrs
AWS, Kubernetes, Azure DevOps +4 more
3-8 Yrs
Bangalore / Bengaluru
Customer Service, Python, Automation Testing +5 more
3-4 Yrs
Cyber Security, IAM, Information Security +3 more
3-5 Yrs
Python, SQL, MQ +2 more
9-12 Yrs
Cyber Security, CCNA, Information Security +7 more
3-4 Yrs
Bangalore / Bengaluru
Software Configuration Management, Customer Service, Python +7 more
3-5 Yrs
Bangalore / Bengaluru
Medical Coding, Customer Service, Python +6 more
3-8 Yrs
Bangalore / Bengaluru
Data Entry, Medical Coding, Customer Service +7 more
2-5 Yrs
Bangalore / Bengaluru
Computer Science, Data Analysis, Data Analytics +6 more
8-10 Yrs
Python, Postgresql, System Design +2 more