Machine Learning Safety: Evaluation Research Engineer
Do you want to help shape the future of AI at Apple? Our team, part of Apple Services Engineering's (ASE) Human Centered AI Research organization, pioneers new methods and tools for AI evaluation. You will help build tools that accelerate our team's research and empower the entire organization to build and evaluate AI more effectively. As a technical leader, you'll help set engineering standards for evaluation systems across ASE and mentor researchers and engineers on best practices for AI tooling.
Description
This role is for a Machine Learning expert to drive the operational setup, execution, and quality assurance of safety evaluations across languages and markets. You will play a crucial role in collaborative development of canonical evaluation guidelines, with subject matter experts and partners on evaluation task configuration, running pilots, monitoring live evaluations, and ensuring data quality throughout the evaluation lifecycle.
An ideal candidate possesses strong data science fundamentals, and experience managing complex annotation or evaluation tasks.
This role will involve designing evaluations to scale across diverse linguistic contexts, by partnering with subject matter experts and cross-functional partners.
You will play a crucial role in building upon product safety requirements to create taxonomies, compose and curate exemplar safety evaluation datasets, and ensure that evaluation frameworks are culturally and linguistically grounded.
An ideal candidate possesses a strong understanding of sociotechnical evaluation design principles and practices, experiences designing evaluations to support policies and/or product requirements, and classification systems, and annotation and/or study participant guidelines.
","responsibilities":"Taxonomy Development: Design, refine, and maintain safety-relevant taxonomies that capture media-specific risk categories, content types, and policy distinctions, achieved through collaborations with subject matter experts who bring knowledge across languages and cultural contexts. You will work collaboratively to ensure taxonomies that represent AI safety risks are comprehensive, internally consistent, and actionable for downstream policy development and evaluation work.
Exemplar Curation: Develop and validate exemplar sets that illustrate taxonomy categories, edge cases, and boundary conditions. Collaborate with language and cultural experts to ensure exemplars are culturally appropriate and representative across target markets.
Policy-to-Data Translation: Partner with policy, product, and engineering teams to translate policies and guidelines into concrete dataset requirements, annotation templates, and evaluation criteria that can be operationalized across distributed annotators.
Documentation & Communication: Produce clear, detailed documentation of taxonomies and risk assessments, data sampling methodologies, and evaluation design rationale. Present evaluation findings and recommendations to cross-functional stakeholders including engineering, product, and legal teams.
Canonical Guideline Development: Author and maintain canonical evaluation guidelines that standardize task definitions, rating criteria, and edge-case handling. These assets will be adapted to scale across languages and markets, with the support of multi-lingual and operations experts. You will ensure guidelines are clear, complete, and adaptable.
Task Setup & Configuration: Collaborate with partners to configure evaluation tasks, including platform setup, workflow design, annotator assignment, and quality control mechanisms. Ensure task configurations align with research design specifications.
Pilot Design & Execution: Design and run pilot evaluations to validate task setups, identify guideline ambiguities, calibrate annotator understanding, and surface issues before full-scale deployment. Analyze pilot results and iterate on guidelines and configurations accordingly.
Monitoring & Data Quality: Develop and implement monitoring frameworks to track evaluation progress, annotator performance, inter-rater agreement, and data quality in real time. Flag anomalies and implement corrective actions to maintain data integrity across markets.
Data Pipeline & Delivery: Manage the end-to-end data pipeline from raw annotations to clean, analysis-ready datasets. Ensure data is properly structured, documented, and delivered to downstream research and engineering consumers.
Train, fine-tune, and validate automated judge models that can reliably score AI system outputs for safety, quality, and policy compliance across languages. Develop calibration and agreement metrics to ensure judges meet human-parity benchmarks.
Validation Techniques: Design and implement validation frameworks to assess the accuracy, reliability, and cross-linguistic consistency of automated evaluation systems. Develop methods to detect drift, bias, and failure modes in automated judges across markets.
Automated Performance Monitoring: Build automated performance check systems that continuously monitor AI safety metrics, flag regressions, and generate alerts. Integrate these checks into CI/CD and model release workflows.
Synthetic Data Generation: Develop and maintain synthetic data generation pipelines to augment evaluation coverage, stress-test safety boundaries, and support evaluation in low-resource languages. Ensure synthetic data is diverse, representative, and validated against human-generated benchmarks.
Scalable Analysis & Reporting Automation: Create automated pipelines for analysis and reporting that reduce manual effort, increase reproducibility, and enable rapid cross-market safety assessments. Build tooling that integrates with existing dashboards and reporting workflows.
Preferred Qualifications
Advanced degree (MS/PhD) in Data Science, Statistics, Computational Linguistics, Information Science, or a related field.
Experience operating evaluation or annotation pipelines across multiple languages or markets.
Familiarity with annotation platforms and task management tools (e.g., Label Studio, Scale AI, or similar).
Experience with SQL and large-scale data infrastructure (e.g., Spark, Hadoop, or cloud-based analytics platforms).
Prior experience in AI safety, responsible AI, content moderation, or trust and safety domains.
Want more jobs like this?
Get jobs in Seattle, WA delivered to your inbox every week.

Experience designing quality assurance frameworks for crowdsourced or distributed annotation work.
General familiarity with localization workflows or working with language service providers.
Minimum Qualifications
3+ years of experience in a data science, applied research, or evaluation operations role, with hands-on experience managing annotation or evaluation pipelines.
Proficiency in Python and experience with data processing, statistical analysis, and visualization libraries (e.g., pandas, NumPy, scipy, matplotlib, seaborn).
Experience developing and maintaining annotation guidelines or evaluation protocols for human labeling tasks.
Comfortable computing and interpreting inter-rater reliability metrics (e.g., Cohen's kappa, Krippendorff's alpha) and other data quality indicators.
Demonstrated ability to collaborate with annotation operations services, vendor teams, or distributed study participants .
Able to work independently as well as collaboratively with minimal direction.
Organized, highly attentive to detail, and manages time well.
1+ year of experience working in industry.
Apple is an equal opportunity employer that is committed to inclusion and diversity. We seek to promote equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics. Learn more about your EEO rights as an applicant .
Pay & Benefits
At Apple, base pay is one part of our total compensation package and is determined within a range. This provides the opportunity to progress as you grow and develop within a role. The base pay range for this role is between $201,300 and $302,200, and your base pay will depend on your skills, qualifications, experience, and location.
Apple employees also have the opportunity to become an Apple shareholder through participation in Apple's discretionary employee stock programs. Apple employees are eligible for discretionary restricted stock unit awards, and can purchase Apple stock at a discount if voluntarily participating in Apple's Employee Stock Purchase Plan. You'll also receive benefits including: Comprehensive medical and dental coverage, retirement benefits, a range of discounted products and free services, and for formal education related to advancing your career at Apple, reimbursement for certain educational expenses - including tuition. Additionally, this role might be eligible for discretionary bonuses or commission payments as well as relocation. Learn more about Apple Benefits.
Note: Apple benefit, compensation and employee stock programs are subject to eligibility requirements and other terms of the applicable plan or program.
Perks and Benefits
Health and Wellness
Parental Benefits
Work Flexibility
Office Life and Perks
Vacation and Time Off
Financial and Retirement
Professional Development
Diversity and Inclusion
Company Videos
Hear directly from employees about what it is like to work at Apple.