LLM AIOps Development Engineer - Data Center Networking
Responsibilities
About the team
Networking brings together innovative ideas and technologies from network architecture, software defined networking (SDN), network virtualization, switch software and hardware co-design, and high-speed networking, to create hyper-scale data-center networking solutions that power several of the most popular apps of the world such as Douyin and TikTok which serve hundreds of millions of users around the globe.
Network Observation team is committed to building a world-leading hyperscale data center network infrastructure that supports hundreds of millions of users' real-time access and explosive growth of massive data volumes. We believe that the next generation of network operations will be fundamentally powered by artificial intelligence technologies, particularly Large Language Models (LLMs).
We are seeking a passionate development engineer who combines deep networking expertise with innovative AIOps capabilities to join us in defining and building "autonomous" data center networks. Together, we will transform network operations from a reactive "firefighting" mode into a proactive, data-driven intelligent ecosystem with predictive and self-healing capabilities.
Want more jobs like this?
Get jobs in Seattle, WA delivered to your inbox every week.

Responsibilities:
As a core member of our team, you will collaborate closely with our NetOps, SRE, and platform engineering teams to tackle the complexities of one of the world's largest data center networks. You will design and implement a closed-loop AIOps for NetWork platform, covering:
- Build a Panoramic Network Observability Platform: Develop a streaming telemetry data pipeline for both physical and virtual networks, integrating multi-source data from gNMI, Netconf, IPFIX/NetFlow, and SNMP to provide a high-quality, real-time data foundation for AIOps.
- Develop an Intelligent Diagnostics and Root Cause Analysis System: Apply machine learning and deep learning algorithms to perform anomaly detection, correlation analysis, and intelligent noise reduction on massive volumes of network metrics, logs, and events. Swiftly pinpoint root causes of failures across the entire stack, from optical transceivers and switch hardware to protocol adjacencies and application traffic.
- Explore Innovative Applications of LLMs and Agents:
- Intelligent Operations Assistant: Build a conversational chatbot powered by Retrieval-Augmented Generation (RAG) that understands natural language queries, automatically queries knowledge bases and monitoring data, and provides precise troubleshooting guidance and network status reports.
- Automated Remediation and Smart Runbooks: Train operational Agents to safely and controllably invoke network change tools and APIs. Empower them to autonomously generate, recommend, or even execute remediation plans and emergency runbooks based on their understanding of failure scenarios.
- Establish Capacity and Risk Prediction Capabilities: Forecast network capacity bottlenecks, high-risk links, and "sub-healthy" devices based on historical data and business growth models, enabling proactive scaling and preventative maintenance.
- Forge a Rock-Solid Engineering System: Adhere to engineering best practices to design and develop a highly available and scalable AIOps platform. Guarantee the stability and performance of the entire pipeline, from data collection and model training to online inference and automated closed-loop actions.
Qualifications
Minimum Qualifications:
- Solid Fundamentals in Computer Science and Networking: A deep understanding of data center network architectures (e.g., Spine-Leaf Fabric), and proficiency in key protocols such as EVPN/VXLAN and BGP/OSPF. In-depth knowledge of the Linux network stack is essential.
- Excellent Software Engineering Skills: Mastery of Golang or Python with outstanding coding and system design abilities. Familiarity with modern software development workflows, including microservices, containerization (Docker/Kubernetes), and CI/CD.
- Rich Platform Development Experience: Practical experience in one or more of the following areas is highly desirable:
- Big Data Processing: Familiarity with Kafka, Flink, ClickHouse/TSDB, and experience building real-time data pipelines and analytics systems.
- Observability Technologies: Experience with Prometheus/OpenTelemetry, graph databases (e.g., Neo4j), and developing alert and event platforms.
- A Passion for AIOps/ML/LLM Practices:
- A keen interest in the latest advancements in Large Models and Agent technologies, with thoughtful insights or hands-on experience in their application to operations (e.g., RAG, tool use, safety evaluation).
Preferred Qualifications:
- Experience in operating or developing for hyperscale (100,000+ servers) data center networks.
- Proven experience leading or making significant contributions to an LLM/Agent-based intelligent operations project with measurable business impact.
- Active contributions to open-source communities such as SONiC, P4/PINS, eBPF, Prometheus, or OpenTelemetry.
- In-depth research or practical experience in high-performance networking (RDMA/RoCE), SmartNICs (NIC Offload), or DPDK/eBPF.
- Experience building network configuration and control systems (e.g., based on SONiC, gNMI, Netconf).
Job Information
[For Pay Transparency] Compensation Description (annually)
The base salary range for this position in the selected city is $177688 - $341734 annually.
Compensation may vary outside of this range depending on a number of factors, including a candidate's qualifications, skills, competencies and experience, and location. Base pay is one part of the Total Package that is provided to compensate and recognize employees for their work, and this role may be eligible for additional discretionary bonuses/incentives, and restricted stock units.
Benefits may vary depending on the nature of employment and the country work location. Employees have day one access to medical, dental, and vision insurance, a 401(k) savings plan with company match, paid parental leave, short-term and long-term disability coverage, life insurance, wellbeing benefits, among others. Employees also receive 10 paid holidays per year, 10 paid sick days per year and 17 days of Paid Personal Time (prorated upon hire with increasing accruals by tenure).
The Company reserves the right to modify or change these benefits programs at any time, with or without notice.
For Los Angeles County (unincorporated) Candidates:
Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state, and local laws including the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act. Our company believes that criminal history may have a direct, adverse and negative relationship on the following job duties, potentially resulting in the withdrawal of the conditional offer of employment:
1. Interacting and occasionally having unsupervised contact with internal/external clients and/or colleagues;
2. Appropriately handling and managing confidential information including proprietary and trade secret information and access to information technology systems; and
3. Exercising sound judgment.
Perks and Benefits
Health and Wellness
- Health Insurance
- Dental Insurance
- Vision Insurance
- HSA
- Life Insurance
- Fitness Subsidies
- Short-Term Disability
- Long-Term Disability
- On-Site Gym
- Mental Health Benefits
- Virtual Fitness Classes
Parental Benefits
- Fertility Benefits
- Adoption Assistance Program
- Family Support Resources
Work Flexibility
- Flexible Work Hours
- Hybrid Work Opportunities
Office Life and Perks
- Casual Dress
- Snacks
- Pet-friendly Office
- Happy Hours
- Some Meals Provided
- Company Outings
- On-Site Cafeteria
- Holiday Events
Vacation and Time Off
- Paid Vacation
- Paid Holidays
- Personal/Sick Days
- Leave of Absence
Financial and Retirement
- 401(K) With Company Matching
- Performance Bonus
- Company Equity
Professional Development
- Promote From Within
- Access to Online Courses
- Leadership Training Program
- Associate or Rotational Training Program
- Mentor Program
Diversity and Inclusion
- Diversity, Equity, and Inclusion Program
- Employee Resource Groups (ERG)
Company Videos
Hear directly from employees about what it is like to work at TikTok.