Prepare for Your Galent Interview with Real Experiences!
View interviewsi
Galent
6 Galent Jobs
Senior DevOps Engineer - Site Reliability (3-6 yrs)
Galent
posted 3+ weeks ago
Flexible timing
Key skills for the job
About the Job :
We are looking for a Senior DevOps Engineer with a focus on Site Reliability Engineering (SRE) to join our growing team.
In this role, you will play a key part in ensuring the availability, scalability, and reliability of our infrastructure and services, leveraging modern DevOps and SRE practices.
Desired Skillset and Competencies :
- 5+ years of experience in DevOps, SRE, or similar roles in high-performance, large-scale environments.
- Expertise in managing cloud environments such as AWS, Azure, or GCP.
- Strong experience with automation tools like Terraform, Ansible, Chef, or Puppet.
- Proficiency in scripting languages such as Python, Shell, or Ruby.
- Solid knowledge of containerization (Docker, Kubernetes) and orchestration.
- Experience with CI/CD tools and pipelines (Jenkins, GitLab CI, CircleCI, etc.
- In-depth understanding of monitoring, alerting, and logging tools such as Prometheus, Grafana, ELK stack, or similar.
- Expertise in managing distributed systems and microservices architecture.
- Experience with infrastructure as code (IaC) and configuration management.
- Strong understanding of networking concepts, load balancing, and firewalls.
- Knowledge of incident management and root cause analysis in production environments.
- Ability to work in an Agile environment and collaborate with cross-functional teams.
- Strong troubleshooting skills and a proactive, problem-solving mindset.
Key Responsibilities :
- Own the availability, scalability, and performance of production systems across multiple environments.
- Develop and manage infrastructure automation for efficient provisioning, scaling, and monitoring of resources.
- Design, implement, and maintain CI/CD pipelines to automate deployment processes.
- Implement Site Reliability Engineering (SRE) best practices, including SLIs, SLOs, and SLAs.
- Collaborate with software development teams to improve system performance, reduce downtime, and optimize system reliability.
- Build and maintain observability solutions (monitoring, logging, alerting) to track application health and troubleshoot production incidents.
- Perform capacity planning, stress testing, and disaster recovery planning to ensure system reliability.
- Manage production incidents and lead post-mortem/root cause analysis to improve system resilience.
- Create documentation and maintain knowledge sharing practices to ensure continuity and consistency across teams.
- Continuously improve systems and processes for performance, security, and scalability
Functional Areas: Software/Testing/Networking
Read full job descriptionPrepare for Your Galent Interview with Real Experiences!
View interviews3-6 Yrs
DevOps, Cloud Computing, Cloud Services +7 more
5-6 Yrs
Manual Testing, Automation Testing, Performance Testing +4 more
7-14 Yrs
Oracle SCM
5-6 Yrs
Java, API Integration, Payment Systems
5-6 Yrs
Java, Java Spring Boot, Full Stack +2 more
6-8 Yrs
Salesforce, Salesforce Administration