Application support, implement SRE best practices, handle incidents, escalations and problem management for NFRT Production plants. This includes application server administration, technical troubleshooting of infrastructure and user incidents. Incorporate System Reliability Engineering implementation into the day-to-day role by developing automated solutions to long standing problems to ensuring minimal downtime and manual effort. Execute web architecture including performance, availability, scalability, and disaster recovery planning. Monitor alerts and configure application monitors using industry standard monitoring tools, as well as develop customized monitoring solutions Revisit SRE Metrics and confirm against the firm and department goals. Implement tooling / create automations to help with Toil Elimination (manual or repetitive work). Identify areas for improvements including automation, toil reduction, resiliency and observability across the platform and help build up the knowledge and documentation for the team. Produce reusable infrastructure designs patterns for future reference and periodically review / refresh the patterns. Apply technical skills to automate daily support functions, improve system stability, support hygiene initiatives and deliver innovation that creates efficiency and consistency. The role requires availability for weekend and on-call work on a rotation basis. Hands-on experience of Linux 7.x operating system for 4 years at an advanced level. Experience with Service Oriented Architecture, Distributed Systems, Scripting such as Python and shell, Relational database (E.g., Sybase, DB2, SQL, Postgres) Hands-on experience of web servers (Apache / Nginx), Application Servers (Tomcat / JBoss) to include application integration, configuration, and troubleshooting. Hands-on experience Docker containers, Kubernetes and SaaS platform integration. Clear concept of load balancer, web proxies and storage platforms like NAS / SAN from an implementation perspective only. Familiar with basic security policies for secure hosting solutions, Kerberos and standard encryption methodologies including SSL and TLS. Strong knowledge SRE Principles with grasp over tools / approach to apply them. Experience in troubleshooting Application Issues and Managing Incidents. Exposure to tools like Open Telemetry, Prometheus, Grafana, Splunk, Ansible, Kafka. Excellent verbal and written communication skills. Build a career with impact. Visit morganstanley.com for more information. Our values - putting clients first, doing the right thing, leading with exceptional ideas, committing to diversity and inclusion, and giving back - aren't just beliefs, they guide the decisions we make every day to do what's best for our clients, communities and more than 80,000 employees in 1,200 offices across 42 countries. Our teams are relentless collaborators and creative thinkers, fueled by their diverse backgrounds and experiences. We are proud to support our employees and their families at every point along their work-life journey, offering some of the most attractive and comprehensive employee benefits and perks in the industry. There's also ample opportunity to move about the business for those who show passion and grit in their work. To learn more about our offices across the globe, please copy and paste https://www.morganstanley.com/about-us/global-offices into your browser. We work to provide a supportive and inclusive environment where all individuals can maximize their full potential.
Want more jobs like this?
Get jobs delivered to your inbox every week.