II689 - SRE ARCHITECT FOR GENAI INTEGRATION PLATFORM

Epam Systems


We are seeking a highly skilled Site Reliability Engineer/Architect (SRE) to join our innovative and fast-paced team. In this role, you will be responsible for architecting and implementing state-of-the-art SRE practices to ensure the reliability and scalability of our enterprise-grade Generative AI (GenAI) integration platform. You will play a critical role in driving operational excellence by adopting cutting-edge methodologies and tools while collaborating with key stakeholders across technical and business units. Responsibilities Define and implement Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to establish reliability standards and ensure systemic health Architect and maintain resilient production systems that leverage canary deployments, shadow traffic, and testing-in-production methodologies Develop strategies for incident management and automate on-call operations to reduce downtime and improve system stability Create and enhance observability frameworks, including logging, tracing, and monitoring, for real-time system visibility and proactive troubleshooting Automate scalability, performance optimization, and routine operational tasks to improve efficiency and reliability Lead collaboration sessions with engineering teams to embed SRE principles in system design and development Provide strategic leadership in implementing site reliability solutions across multi-cloud, multi-tenant environments for enterprise customers Act as a trusted advisor to executive stakeholders, offering insights and recommendations to align SRE strategies with business and technical objectives Champion a culture of innovation and operational reliability by mentoring teams and driving adoption of industry-leading best practices Partner with architecture and DevOps teams to ensure the platform's infrastructure supports high availability and scalability Advocate for continuous improvement in operational processes while identifying opportunities for innovation and optimization Requirements 8+ years of professional experience in SRE, DevOps, or related fields, including direct involvement with production systems Expertise in SRE methodologies such as SLOs, SLIs, canary testing, and incident management Proficiency with cloud technologies including AWS, Google Cloud Platform, or Azure, with hands-on experience in multi-cloud environments Background in observability tools such as Prometheus, Grafana, or ELK Stack, coupled with monitoring practices for distributed systems Skills in automation platforms like Terraform, Ansible, or Kubernetes, driving infrastructure-as-code practices Familiarity with programming languages for automation solutions, such as Python, Go, or Bash Strong understanding of CI/CD pipelines, containerization technologies, and orchestration frameworks Competency in architecting systems for fault tolerance, redundancy, and performance optimization Showcase of effective collaboration with stakeholders from technical teams to executive-level managers Background in handling enterprise-scale systems and multi-tenant platform deployments Nice to have Knowledge of Generative AI technologies and frameworks, including their integration processes Understanding of managed database services such as Amazon RDS, Google Spanner, or Azure SQL Familiarity with security best practices specific to multi-cloud environments and enterprise platforms Experience influencing technical roadmaps for large-scale distributed systems Capability to lead initiatives around Chaos Engineering or disaster recovery strategies We offer/Benefits - International projects with top brands - Work with global teams of highly skilled, diverse peers - Healthcare benefits - Employee financial programs - Paid time off and sick leave - Upskilling, reskilling and certification courses - Unlimited access to the LinkedIn Learning library and 22,000+ courses - Global career opportunities - Volunteer and community involvement opportunities - EPAM Employee Groups - Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

trabajosonline.net © 2017–2021
Más información