Site Reliability Engineer

Remote / Eden Prairie, Minnesota, United States of America Full Time regular
This job has more than 30 days. You can find more up-to-date jobs using the search box.
Added 1mo ago

Primary Responsibilities:

  • Leverage Chef to manage large deployment of Telegraf agents
  • Manage several RabbitMQ clusters
  • Scale and manage large distributed InfluxDB cluster
  • Operate and maintain Nomad container platform
  • Create dashboards and alerts
  • Participate in on-call duties (support hours 8 AM - 11 PM Central time, rotates weekly)
  • Troubleshoot and debug application and infrastructure issues

  • You ll be rewarded and recognized for your performance in an environment that will challenge you and give you clear direction on what it takes to succeed in your role as well as provide development for other roles you may be interested in.

    Required Qualifications:

  • Undergraduate degree or equivalent experience
  • Coding experience using a high-level programming language (Ruby, Sinatra, Rails a plus)
  • Experience using Git (GitHub a plus)
  • Working knowledge of one or more container technologies (Docker, Nomad, Kubernetes, OpenShift, etc)
  • Familiarity with SRE concepts
  • Solid Linux skills (Red Hat or CentOS a plus)
  • Data visualization skills (Grafana or Kibana)
  • Solid troubleshooting skills
  • Preferred Qualifcations:

  • Experience with one or more time-series platforms (Telegraf/InfluxDB, Prometheus/Thanos/Mimir, TimescaleDB, Graphite, OpenTSDB, etc)
  • Experience writing and maintaining Chef cookbooks
  • Experience deploying and managing message queueing system (RabbitMQ, Kafka, etc)
  • HashiCorp tools (Nomad, Consul, Vault, Terraform, Vagrant, etc)
  • APM and logging tools (Elastic, Splunk, etc)
  • Solid understanding of networking and proxies
  • Technical writing skills (creating flow diagrams, end user documentation, etc)
  • This job has more than 30 days. You can find more up-to-date jobs using the search box.

    Jobs you may like