Senior Engineer

Remote / IL 60604, Full Time regular
This job has more than 30 days. You can find more up-to-date jobs using the search box.
Added 2mo ago

The health and safety of Enova’s employees is our number one priority. Proof of vaccination will be required regardless of work location, unless prohibited by applicable state law. Employees may request an exemption to the vaccination policy due to medical reasons, sincerely-held religious beliefs, or as otherwise permitted by applicable state law.

Enova is currently accepting candidates for remote positions in the following eligible states: AL, AK, AR, AZ, CT, GA, IA, ID, IL, IN, KY, LA, MA, ME, MD, MI, MN, MO, MS, NC, ND, NE, NH, NV, NJ, NM, OH, OK, OR, PA, RI, SC, SD, TN, UT, VT, WI, WV, WY.

What you’ll be doing:

In this role, you will help improve the resiliency of our services through technology, incident analysis, and process refinement.

You will work on optimizing how we deal with unexpected complex failures, including facilitating our incident response process, running post-incident blameless retrospectives, analyzing for and learning from consistent high-level trends, and integrating technology to reduce the effort needed to maintain these functions.

You will be responsible for learning how our systems and applications relate holistically in order to appropriately react during outages and work alongside Subject Matter Experts to drive resolution. You will develop improvements to how we collect and analyze data around failures, adjusting to the ever-advancing environment as progress is made.

You will collaborate with IT, Software Engineering, and product teams to foster a culture of quality where resilience is woven into our technology stack. You will show what different failure modes look like by running experiments (Mock Incidents, Disaster Recovery) and share learnings across the organization.

Your core priorities will be to:

  • Own Enova’s Production Incident Process end-to-end.
  • Develop processes and technology to sustainably test and improve the resiliency of our services on an ongoing basis, balancing tech and business needs.
  • Manage process refactoring initiatives to ensure risk mitigation is considered, improving customer experience.
  • Collect data, perform trend analysis, and identify patterns of risks and vulnerabilities.
  • Work with leading teams to address vulnerabilities, particularly principal engineers and production managers.
  • Socialize lessons learned among all teams to bolster the culture of operational ownership.
  • Be part of our PI PIC (Incident Commander) rotation following training, leading incidents to completion, and driving post-incident analysis (including interviews, contributing factor analysis, incident response analysis, and remediation plans).

What you should have:

  • 3+ years of professional work experience in a technology role; Software Engineering, Systems, Ops, SRE, Product Management or others.
  • Interest in complex distributed systems - how they work, how they can work better, how to know if they are working correctly.
  • Superior analytical, problem solving, and critical thinking skills.
  • Understanding of infrastructure as code (Terraform, Chef, etc.)
  • Experience with query language (Postgres, sql Kafka, etc.)
  • Ability to handle, analyze, and present data.
  • Comfortable with ambiguity; able to translate ambiguous problems into strong solutions.
  • Demonstrates maturity, good judgment, negotiation, leadership and project management skills.
  • Excellent written and verbal communication skills, including the ability to communicate to different levels of an organization (i.e. on a technical vs. non-technical level).

Nice to have:

  • Experience with full stack development.
  • Experience with handling and leading resolution of major failures of critical systems.
  • Experience driving large-scale changes.

About Resilience Engineering:

The Resilience Engineer is a subset of the Site Reliability Engineering team that strives to drive a culture of continuous resiliency improvement in our systems. We do this by focusing on our incident response process, incident analysis and learnings, and creatively solving systemic hurdles to resiliency. We work closely with other Tech, Operations, and Business teams to resolve complex failures and to continuously learn.

Our goal at Enova is to recruit, hire, develop and maintain a diverse workforce. It is our policy to provide equal employment opportunity for all persons and not discriminate in employment decisions by placing the most qualified person in each job, without regard to any other classification protected by federal, state, or local law.

This job has more than 30 days. You can find more up-to-date jobs using the search box.

Jobs you may like