DESCRIPTION
Job summary
Blink's Site Reliability Engineers (SREs) are responsible for keeping all of our user-facing services running smoothly. As an SRE you will be part of the team supporting our production infrastructure and ensuring customers have a great experience with our products.
Key job responsibilities
- Monitor and support the AWS Cloud components of the Blink home security system
- Work with the latest AWS technology and tools with access to all of Amazon's internal resources
- Identify and resolve service issues while diving deep into understanding root causes
- Develop, maintain, and improve monitoring solutions
- Automate repetitive processes
- Perform infrastructure maintenance and configuration
- Work closely with the software development teams to ensure that platforms are designed with operability in mind
- Ensure our systems are resilient and fault-tolerant
- Assist with initiatives for upgrading and scaling our systems to improve availability and performance
- Create and maintain operational runbooks and other documentation
- Participate in a 24/7 on-call rotation. Each engineer is on-call about 1 week per month
BASIC QUALIFICATIONS
- 2+ years of hands-on troubleshooting in highly available, highly scalable, mission-critical environments
- 2+ years of experience using and configuring system health and application performance monitoring tools
- 2+ years of experience of UNIX/Linux operating system administration
- 2+ years of experience managing cloud infrastructure and developing operational processes
- Proficiency in Python, Ruby, Perl, Bash, or any popular programming language
- Clear written and verbal communication skills
PREFERRED QUALIFICATIONS
- Experience writing Ansible playbooks and modules
- Experience performance tuning SQL queries and schema
- In-depth understanding of HTTP, experience debugging REST applications and implementing security
- In-depth understanding of networking and tools used to debug networking issues (tcpdump, netstat, lsof, etc.)
- Experience using Jenkins for building, deploying and automating job creation with DSL
- AWS EC2, ELB, ECS, AutoScaling, IAM, S3, RDS, DynamoDB
- DNS, SSH, HTTP, TCP/IP and other common network protocols
- Understanding of Continuous Integration / Continuous Delivery (CI/CD) and Agile software engineering practices