Site Reliability Engineer

Remotive

Remote

•

2 weeks ago

•

No application

About

The Role

We are seeking an experienced and proactive Site Reliability Engineer to join our technology team. This is a hybrid role that combines the responsibilities of building and maintaining a scalable, resilient cloud infrastructure with the critical function of leading our response to operational and security incidents.

You will be responsible for the entire lifecycle of our production environment, from managing CI/CD pipelines and infrastructure-as-code development to real-time threat monitoring and crisis management. The ideal candidate is a hands-on engineer who thrives in a fast-paced environment, possesses a deep understanding of cloud-native technologies, and has a proven track record in incident response and management.

Key Responsibilities

DevOps & Infrastructure Management:

Manage, automate, and maintain our production infrastructure hosted on Amazon Web Services (AWS), including our multi-AZ Amazon EKS cluster, RDS databases, and ElastiCache instances.
Develop, manage, and improve our CI/CD pipelines using GitHub Actions to ensure smooth and reliable deployments.
Own and advance our Infrastructure as Code (IaC) practices using Terraform to ensure our infrastructure is reproducible, scalable, and secure.
Collaborate with development teams to support the deployment and operation of backend microservices (.NET, Go) and frontend applications (React, hosted on Vercel).
Monitor and manage system capacity and performance, ensuring high availability and low latency for our users.
Implement and enforce security best practices across the infrastructure, including network segmentation, secret management, and access controls.

Incident Response & Security:

Serve as the primary lead for responding to, managing, and resolving production incidents, from initial detection to post-mortem analysis.
Develop, maintain, and test incident response playbooks, disaster recovery plans, and business continuity procedures.
Utilize our monitoring stack (AWS GuardDuty, CloudTrail, Inspector, Security Hub) to proactively detect, triage, and respond to security threats and system anomalies.
Conduct thorough root cause analysis (RCA) for all major incidents and drive the implementation of corrective and preventative actions.
Support and participate in regular security and resilience testing, including vulnerability scanning and software supply chain security checks using tools like Trivy.
Ensure all operational and incident management activities are documented and executed in alignment with our DORA and MiCA compliance obligations.

Security Operations & Compliance:

Define, implement, and maintain a PSIRT process (Product Security Incident Response Team), including both infrastructure-related and blockchain/on-chain incidents.
Design and execute incident response processes, including tooling, documentation, and post-incident reviews.
Lead digital forensics efforts: define tools, processes, and playbooks.
Roll out and manage EDR (Endpoint Detection and Response) tools for both infrastructure and employee endpoints.
Implement and manage MDM (Mobile Device Management) for laptops and phones to ensure secure key storage and prevent compromise.
Define and enforce security rules and guardrails aligned with business risk.
Harden Kubernetes clusters (EKS), containers, and implement admission control policies.
Maintain and test Disaster Recovery (DR) and backup plans regularly.
Manage Cloudflare WAF rules, vulnerability management (SAST/SCA/DAST), and AWS/Kubernetes event-based security tooling.

REQUIREMENTS

Required Skills & Experience

10+ years of experience in the field.
Proven experience in a Site Reliability Engineering (SRE), DevOps or similar role.
Deep, hands-on expertise with Amazon Web Services (AWS), particularly EKS, RDS, VPC, IAM, and security services like GuardDuty and Security Hub.
Strong proficiency with containerization (Docker) and Kubernetes orchestration in a production environment.
Expert-level knowledge of Infrastructure as Code, with extensive experience using Terraform.
Demonstrable experience building and managing CI/CD pipelines, preferably with GitHub Actions.
Solid experience in leading incident response efforts, including incident command, diagnostics, and post-incident review.
A strong understanding of networking principles, including VPCs, subnets, load balancing (NLB), and edge security (WAF, DDoS protection) with platforms like Cloudflare.
Familiarity with modern monitoring, logging, and observability principles and tools.

Desired Skills & Experience

Experience working in a highly regulated environment, such as FinTech, banking, or crypto services.
Familiarity with our wider tech stack, including Vercel, Fireblocks, and NGINX.
Experience with security scanning tools for containers and dependencies (e.g., Trivy).
Knowledge of authentication mechanisms like JWE and best practices for secrets management (e.g., credential stores, AWS KMS).
Scripting skills in languages such as Python or Bash for automation tasks.

BENEFITS

A competitive salary and benefits package.
The opportunity to work with a modern, cutting-edge technology stack.
A key role in a fast-growing company at the intersection of finance and technology.
A collaborative and dynamic work environment with a strong focus on security and resilience.
Flexible working arrangements, full remote work opportunity.

Remote

4 minutes ago

Get our app today