Skip to main contentA logo with &quat;the muse&quat; in dark blue text.

Research Scientist - Privacy-Preserving Large-Scale Model Training & Architecture Optimization

3+ months ago San Jose, CA

Responsibilities

At TikTok, we treat privacy as our top priority in our product design and implementation. Privacy is not just about regulation compliance, but also about a more trusted way to enable technology innovation by respecting users' privacy choices!

About the Team

Privacy Innovation (PI) Lab is established to explore the next frontier of privacy technology and theory in the digitalized world. We provide key insights and technical solutions on privacy-related innovation for all TikTok's products. Furthermore, we also collaborate with worldwide technical and academic communities to build an open ecosystem to promote a privacy-friendly digital experience.

About the Role
We are building next-generation generative foundation models, with a strong focus on diffusion-based and unified generation-understanding architectures, deployed in privacy-sensitive, production environments.
This role sits at the intersection of
- Large-scale model training systems
- GPU-first architecture and kernel-level optimization

Want more jobs like this?

Get Design and UX jobs in San Jose, CA delivered to your inbox every week.

Job alert subscription

- Diffusion / DiT / unified multimodal foundation models
- Privacy-preserving and compliant training pipelines
You will work on end-to-end training architecture design, from model-parallel execution and GPU efficiency to robust, fault-tolerant, privacy-aware training infrastructure.

Responsibilities
Model Training Architecture & Systems
- Design and optimize large-scale training architectures for diffusion-based and unified generative models (e.g., DiT, Rectified Flow, hybrid AR + diffusion systems).
- Lead GPU-centric performance optimization, including memory layout, communication overlap, kernel fusion, and throughput scaling across thousands of accelerators.
- Develop and evolve distributed training strategies (DP / TP / PP / ZeRO / FSDP-style sharding) tailored to long-running, multi-stage foundation model training.

Robustness, Reliability & Production Readiness
- Build fault-tolerant, self-healing training systems that can sustain long-running jobs under frequent hardware, network, and software failures.
- Design mechanisms for fast failure detection, recovery, and minimal training interruption, including checkpointing strategies, restart policies, and controlled rollouts.
- Improve training ETTR / MFU / utilization efficiency under real-world production constraints.
Diffusion & Unified Model Optimization
- Optimize Diffusion Transformer training pipelines, including noise schedules, timestep strategies, and memory-efficient attention mechanisms.
- Support unified generation-and-understanding models, enabling shared context, long-sequence multimodal reasoning, and scalable training without architectural bottlenecks.
- Collaborate with research teams on architecture-level tradeoffs between quality, compute efficiency, and training stability.

Qualifications

Minimum Qualifications
- Strong background in large-scale deep learning systems and distributed training.
- Hands-on experience with GPU optimization, including memory management, communication/computation overlap, and performance profiling.
- Experience training diffusion models, DiT-style architectures, or large foundation models at scale.
- Proficiency in PyTorch and modern distributed training stacks.
- Solid understanding of parallelism strategies (DP / TP / PP / ZeRO / FSDP or equivalents).
- Ability to reason about training stability, numerical issues, and long-running job robustness.

Preferred Qualifications
- Experience with privacy-preserving ML, sensitive data training, or regulated environments.
- Familiarity with fault-tolerant training systems, checkpointing strategies, or production GPU orchestration.
- Experience with unified multimodal models (generation + understanding) or hybrid AR/diffusion systems.
- Low-level performance work (CUDA kernels, custom ops, fused attention, or communication libraries).
- Background in production ML infrastructure supporting thousands of GPUs.

Job Information

[For Pay Transparency] Compensation Description (annually)

The base salary range for this position in the selected city is $136800 - $259200 annually.

Compensation may vary outside of this range depending on a number of factors, including a candidate's qualifications, skills, competencies and experience, and location. Base pay is one part of the Total Package that is provided to compensate and recognize employees for their work, and this role may be eligible for additional discretionary bonuses/incentives, and restricted stock units.

Benefits may vary depending on the nature of employment and the country work location. Employees have day one access to medical, dental, and vision insurance, a 401(k) savings plan with company match, paid parental leave, short-term and long-term disability coverage, life insurance, wellbeing benefits, among others. Employees also receive 10 paid holidays per year, 10 paid sick days per year and 17 days of Paid Personal Time (prorated upon hire with increasing accruals by tenure).

The Company reserves the right to modify or change these benefits programs at any time, with or without notice.

For Los Angeles County (unincorporated) Candidates:

Qualified applicants with arrest or conviction records will be considered for employment in accordance with all federal, state, and local laws including the Los Angeles County Fair Chance Ordinance for Employers and the California Fair Chance Act. Our company believes that criminal history may have a direct, adverse and negative relationship on the following job duties, potentially resulting in the withdrawal of the conditional offer of employment:

1. Interacting and occasionally having unsupervised contact with internal/external clients and/or colleagues;

2. Appropriately handling and managing confidential information including proprietary and trade secret information and access to information technology systems; and

3. Exercising sound judgment.

Client-provided location(s): San Jose, CA
Job ID: TikTok-7072585766320212254
Employment Type: OTHER
Posted: 2025-02-13T13:25:09

Perks and Benefits

  • Health and Wellness

    • Health Insurance
    • Dental Insurance
    • Vision Insurance
    • HSA
    • Life Insurance
    • Fitness Subsidies
    • Short-Term Disability
    • Long-Term Disability
    • On-Site Gym
    • Mental Health Benefits
    • Virtual Fitness Classes
  • Parental Benefits

    • Fertility Benefits
    • Adoption Assistance Program
    • Family Support Resources
  • Work Flexibility

    • Flexible Work Hours
    • Hybrid Work Opportunities
  • Office Life and Perks

    • Casual Dress
    • Snacks
    • Pet-friendly Office
    • Happy Hours
    • Some Meals Provided
    • Company Outings
    • On-Site Cafeteria
    • Holiday Events
  • Vacation and Time Off

    • Paid Vacation
    • Paid Holidays
    • Personal/Sick Days
    • Leave of Absence
  • Financial and Retirement

    • 401(K) With Company Matching
    • Performance Bonus
    • Company Equity
  • Professional Development

    • Promote From Within
    • Access to Online Courses
    • Leadership Training Program
    • Associate or Rotational Training Program
    • Mentor Program
  • Diversity and Inclusion

    • Diversity, Equity, and Inclusion Program
    • Employee Resource Groups (ERG)

Company Videos

Hear directly from employees about what it is like to work at TikTok.