26 Core Technologies & Solutions Jobs
Site Reliability Engineer (7-20 yrs)
Core Technologies & Solutions
posted 2 weeks ago
Flexible timing
Key skills for the job
Job Description :
- Engage with our product teams to understand requirements, design, and implement resilient and scalable infrastructure solutions
- Operate, monitor, and triage all aspects of our production and non-production environments
- Collaborate with other engineers on code, infrastructure, design reviews, and process enhancements.
- Evaluate and integrate new technologies to improve system reliability, security, and performance
- Develop and implement automation to provision, configure, deploy, and monitor Apple services
- Participate in an on-call rotation providing hands-on technical expertise during service-impacting events
- Design, build, and maintain highly available and scalable infrastructure
- Implement and improve monitoring, alerting, and incident response systems
- Automate operations tasks and develop efficient workflows
- Conduct system performance analysis and optimization
- Collaborate with development teams to ensure smooth deployment and release processes
- Implement and maintain security best practices and compliance standards
- Troubleshoot and resolve system and application issues
- Participate in capacity planning and scaling efforts
- Stay up-to-date with the latest trends, technologies, and advancements in SRE practices
- Contribute to capacity planning, scale testing, and disaster recovery exercises.
- Approach operational problems with a software engineering mindset
- BS degree in computer science or equivalent field with 5+ years of experience
- 5+ years in an Infrastructure Ops, Site Reliability Engineering, or DevOps-focused role.
- Knowledge of Linux operating system principles, networking fundamentals, and systems management.
- Demonstrable fluency in at least one of the following languages : Java, Python, or Go
- Experience managing and scaling distributed systems in a public, private, or hybrid cloud environment
- Develop and implement automation tools and apply best practices for system reliability.
- You will be responsible for the availability & scalability of our services and manage the disaster recovery and other operational tasks.
- Collaborate with the development team to improve application codebase for logging, metrics and traces for observability.
- Collaborate with data science teams and other business units to design, build and maintain the infrastructure that runs machine learning and generative AI workloads.
- Influence architectural decisions with focus on security, scalability and performance.
- Find and fix problems in production, and work to avoid them from happening again
Preferred Qualifications :
- Familiarity with micro-services architecture and container orchestration with Kubernetes.
- Awareness of key security principles including encryption, keys (types and exchange protocols).
- Understanding SRE principles includes monitoring, alerting, error budgets, fault analysis, and automation.
- Strong sense of ownership, with a desire to communicate and collaborate with other engineers and teams.
- Ability to identify and communicate technical and architectural problems, while working with partners and their team to iteratively find solutions.
Functional Areas: Software/Testing/Networking
Read full job description7-20 Yrs
DevOps, Python, Java +3 more
5-15 Yrs
Data Engineering, Project Management, Data Management +3 more
7-15 Yrs
Project Management, JIRA, Agile +1 more
2-14 Yrs
Oracle DBA, Cloud, SQL Server +5 more
1-11 Yrs
Power BI, Tableau, Data Visualization +3 more
3-10 Yrs
Python, MongoDB, Postgresql +2 more
3-13 Yrs
Manual Testing, SDET, Functional Testing +1 more
3-18 Yrs
Product Management, Agile Coaching, Scrum +1 more
4-15 Yrs
Salesforce, IT Sales, SaaS Sales +1 more
3-16 Yrs
Digital Marketing, Social Media Marketing, Advertising +2 more