SENIOR SITE RELIABILITY ENGINEER (SRE)

none

Marvik

What’s the opportunity? We’re looking for a Site Reliability Engineer (SRE) to join our team! As an SRE, you're expected to ask key questions like:
What data do we need to understand how our systems are performing? How do we collect that data? What patterns are we looking for, and what do they mean? Who needs to be alerted when something isn’t working? Are there any systems where we need more or better data?An SRE designs systems and processes to answer these questions and automate support and response wherever possible.

🧑🏻‍💻 Responsibilities: Own OpenTelemetry Pipelines: Design, implement, and maintain observability pipelines across logs, metrics, and traces, ensuring standardized, scalable, and efficient data ingestion. Optimize ingestion strategies for cost, performance, and usability. Empower Engineering Teams: Build self-service automation and tooling that lets development teams implement observability without needing manual SRE support. Drive best practices and ensure teams take ownership of their telemetry. Support Incident Management: Act as the engineering arm of the Incident Management Team—designing playbooks, processes, checklists, and automations to support teams during incidents. Collaborate Across Teams: Work with teams across the business to understand their monitoring, alerting, and SLO/SLA needs. Design solutions that meet or exceed these requirements and influence architectural decisions from the start to ensure scalability and resilience. Automate Observability Infrastructure: Use Infrastructure-as-Code (IaC) to manage monitoring tools, alert rules, and observability configurations across OTEL pipelines. Define Baseline Observability Standards: Create base-level requirements to ensure all infrastructure and code is monitored consistently and accurately. Own Technical and Security Health: Take full ownership of infrastructure reliability and ensure alignment with key availability and security KPIs. Optimize Alerting Systems: Continuously fine-tune alerting to reduce noise, ensure alerts are actionable, and improve response efficiency. 🤝 If you have 4+ years of experience as an SRE or in a similar observability-focused role. Strong Kubernetes expertise, including components, deployment practices, and monitoring. Familiarity with OpenTelemetry—setting up collectors, instrumentation, and pipeline optimization. Experience with tools like Grafana, Prometheus, Loki, New Relic, or Datadog. Hands-on experience with Infrastructure-as-Code (Terraform) and GitOps CI/CD (e.g., ArgoCD, GitHub Actions). Experience integrating incident platforms (PagerDuty, Jira) into alerting workflows. Strong scripting skills (Python, Go, etc.) to automate observability tasks. A problem-solving mindset and ability to collaborate across teams to improve reliability. 🦾 It’s a plus: Cloud experience, especially with AWS and ECS workloads. Experience managing observability pipelines at scale in high-throughput environments. Familiarity with Configuration-as-Code tools (Ansible, Chef, or SaltStack). Experience with database performance monitoring in large-scale distributed systems.

Ir a Talent.com

Vacante publicada hace 7 horas