Senior SRE Engineer
About Kohl's - What's Our Inspiration? Many people think of Kohl's as just a brick and mortar retail chain. The truth is, we have developed an omni-channel approach to reaching our customers through our stores, online and mobile experiences. Our main source of inspiration has always been our customers, our associates and our drive to be the most engaging retailer in America. Why Kohl's Information Technology? At Kohl's, our mission is, "To inspire and empower families to lead more fulfilled lives." That statement is also true of the culture and our 1,000+ person technology team. We want to be the most engaging retailer out there while offering you opportunities to: have a flexible work schedule, work with some of the newest technologies, have clear career paths and have the ability to make an impact in the work that you do every day (not to mention great employee discounts). With a $1 billion dollar investment in technology over the next 3 years, innovation is at the heart of everything that we do. What Will You Be Doing? Site Reliability Engineers (SREs) are responsible for keeping business critical Stores systems across 1100+ stores running reliably whether its on-prem, cloud or hybrid. SRE treats Operations as a Software problem and brings software and systems engineering principles, skills, best practices and robust automation to run service reliably in production and enable high velocity of changes while maintaining high level of availability by embracing risk and shared ownership between Product and SRE.
As a Senior SRE you will:
- Develop and manage real-time production monitoring, instrumentation and observability.
- Identify metrics and build visualization using Grafana/InfluxDB/ELK to drive quality and efficiency.
- Identify toil, repetitive issues and automate it.
- Be on-call rotation to respond to production incidents and provide support for Ops engineers as needed.
- Perform RCA and blameless post mortem to drive culture of continuous improvements.
- Document every action to turn into repeatable SOPs and then into automation.
- Proactively identify failures before it becomes outages using chaos engineering techniques, edge cases, failure modes, DR etc.
- Design, build, maintain and run core stores infrastructure.
- Capacity planning and analysis, and infrastructure change management (including tuning, reshaping, resizing, and migrating infrastructure), for services and their immediate downstreams. You will have a comprehensive view of systems interactions in the local ecosystem, providing valuable feedback and insight to broader capacity planning and event preparation efforts.
- Collaborate with SWE service owners to productionize new services and features, as well as improve production landscape for existing services, providing SRE expertise and implementing best practices including setting up SLI/SLOs and Error Budgets to use for services on an ongoing basis.
Skills & Experience
- Bachelor's Degree or equivalent in MIS, Computer Science or a related field
- 6+ years of experience in SWE, SRE and/or Systems Engineering roles
- Have strong programming skills in one or more languages - Spring, Python, Go or Node.js
- Experience in atleast one PasS & Containers - Openshift, Cloud Foundry, Kubernetes or equivalent
- Experience with one or more configuration management systems like Chef, Ansible, Puppet
- Experience in one of more Observability platforms - Prometheus, InfluxDB, Grafana, ELK or APM
- Deep understanding of systems architecture, UNIX internals, networking topologies, multi-cluster, multi-tenancy and security implications.
- You are passionate about reliability as a feature and have experience implementing reliability features into application codes and solving reliability challenges across the organization - when you see something broken, you can't help but fix it!
- Experience working with engineering teams to adopt tooling, help build better resiliency, guardrails, and proven SRE practices into our applications and process including
- Have an urge for delivering quickly and iterating fast
- Ability to diagnose technical problems, debug code, and automate routine tasks
- Analytical approach coupled with solid communication skills and a sense of ownership
- Mentor and coach junior engineers and improve team collaboration
- Retail experience
- Experience in Weblogic, Oracle, SQL
Back to top