SITE RELIABILITY ENGINEER

Remote

Odisea Cultura | Cultsure

Site Reliability Engineer About the Role Locations: Colombia only (remote) Come join us at Odisea and work with some of the most exciting start-ups in the US! Role : Are you a seasoned Site Reliability Engineer looking to make a real impact? We're seeking a high-caliber technical expert to join our team and help shape the future of our infrastructure and product delivery. This role is ideal for someone who thrives on ownership, values deep technical challenges, and is eager to lead initiatives that drive meaningful improvements across the stack. You'll contribute to architecture decisions, mentor peers, and take the lead on projects that are critical to system reliability and scalability. As a Site Reliability Engineer, you'll be part of a collaborative team responsible for building, automating, and maintaining our multi-region infrastructure. You'll work hands-on with the Observability stack, Kubernetes, Infrastructure-as-Code, CI/CD systems and GitOps workflows. Your work will directly support infrastructure availability, performance, security and efficiency across the organization. Responsibilities: Manage and maintain the observability stack across all environments using tools such as OpenTelemetry, DataDog, Prometheus, Grafana, and others to ensure system visibility and performance. Develop and manage Infrastructure as Code (IaC) using Terraform, OpenTofu, Terragrunt, Atlantis, Spacelift, and related tools to provision and maintain cloud infrastructure. Contribute to implementation and improvement of SRE practices , such as SLOs, Error Budgets, PRRs, Problem Management Administer and support CI/CD pipelines , including TeamCity and GitHub Actions, ensuring reliable and efficient software delivery. Own and resolve Jira tickets related to infrastructure projects, support requests, and ongoing operational tasks. Create and maintain operational documentation, including runbooks, playbooks, architecture diagrams, and SOPs to support knowledge sharing and incident response. Respond to production incidents, diagnose and triage issues, and follow established escalation protocols and standard operating procedures. Identify and drive down toil with creative innovation and automation Participate in a shared on-call rotation, helping ensure the reliability and uptime of critical production systems. What We Are Looking For: Proficient with infrastructure monitoring and observability tools like Datadog and Prometheus 3+ years of hands-on experience in AWS administration, Site Reliability Engineering, DevOps, or build and release roles Deep knowledge of the AWS ecosystem , including but not limited to: EC2, EKS/ECS, RDS, IAM, KMS, SQS, CloudWatch, Lambda, Config, and Glue 3+ years experience with Infrastructure as Code (IaC) tools including Terraform, Atlantis, Spacelift, OpenTofu, and/or Terragrunt Deep understanding of SRE practices, such as SLOs, Error Budgets, PRRs, Problem Management A process improvement mindset, especially around DevOps/SRE areas such as deployment workflows, automation, security, and developer productivity Scripting experience Python, Bash, or Shell
Bonus points: Kubernetes and GCP experience Production-level expertise with container orchestration and tooling such as EKS, ECS/Fargate, ArgoCD, Helm, and Istio. Hands-on experience deploying, configuring, and automating CI/CD pipelines (TeamCity and GitHub Actions). Multi-region (>3) production support Prior work in FedRamp and/or SOC 2 environments Familiarity with GitOps workflows (Argo CD, Akuity, Github Actions) Familiarity with Security workflows and tooling (Wiz, Orca, AWS SecurityHub.

Ir a Talent.com

Vacante publicada hace 1 dia