Lead Site Reliability Engineer, Observability – Remote

Full time @CISCO Meraki posted 6 days ago in Information Technology (IT)

Remote within US; San Francisco, CA View on Map
Post Date : May 30, 2025
Apply Before : June 17, 2025
0 Application(s)
View(s) 5

Job Detail

Job ID 24026
Experience 2 Years
Qualifications Degree Bachelor

Job Description

“Examples of projects our team works on:
Design, deploy and scale our Prometheus architecture to handle 100+ million active series and beyond.
Deploy and operate large, high-performance ElasticSearch clusters holding 2000+TB of data.
Deploy and grow high-throughput data pipelines built on Kafka, handling hundreds of thousands of events per second.
Design and build an alerting system that allows engineering teams to construct alerts from multiple data sources and alerting workflows.
Write libraries and APIs that give engineers self-service access to our monitoring, logging, and other observability systems.
Use Terraform to deploy public and private cloud infrastructure.
You are an ideal candidate if you:
Have 5+ years experience designing, deploying and operating mid to large size distributed systems on VMs or bare metal machines running Linux (we run Debian and Ubuntu).
Have 2+ years experience developing with languages like Ruby, Python, Go, Go, or Bash.
Are excited by the challenge of solving difficult problems in large distributed systems that deal with huge amounts of data.
Want to work on a highly autonomous team that cares deeply about quality and customer experience.
Are curious, learn fast and feel comfortable diving into unfamiliar code and systems to solve problems.
Understand the value of observability and can work with other teams to help them better monitor their services.
Are willing to be part of a production on-call rotation.
Have direct experience with the following technologies (or similar): Elasticsearch Logstash Kibana (ELK) stack, Kafka, Prometheus/Thanos/Cortex, Graphite, Ansible, Terraform, Consul.
Have strong experience in building out solitons based on Software engineering best practices.
Keywords: Observability, Monitoring, SRE, Site Reliability Engineering, DevOps, ElasticSearch, Logstash, Kibana, ELK, Grafana, Graphite, Prometheus, Kafka, Snowflake, Ansible, Ruby, Terraform, Consul.
“

Required skills

Other jobs you may like

Site Reliability Engineer, Senior
- @ Hebbia
- New York City
Full time