Machine Learning Safety: Evaluation Research Engineer
This role supports the design and development of safety evaluation methodologies for generative and agentic AI features that enable users across the globe to interact with our media products and services.
Description
You will play an impactful role: shaping responsible AI and safety policies, evaluating fidelity to product safety requirements, creating risk assessments and taxonomies, curating exemplar safety evaluation datasets, and ensuring that evaluation frameworks are culturally and linguistically grounded.
An ideal candidate possesses a strong understanding of issues in responsible AI and A and society, technology evaluation design principles and practices, and brings experience designing evaluations to support policies and/or product requirements, classification systems, and annotation and/or study participant guidelines.","responsibilities":"Taxonomy Development: Design, refine, and maintain safety-relevant taxonomies that capture risk categories, content types, and policy distinctions, achieved through collaborations with subject matter experts who bring knowledge across languages and cultural contexts. You will work collaboratively to ensure taxonomies are comprehensive, internally consistent, and actionable for downstream evaluation work.
Policy-to-Data Translation: Develop and validate exemplar sets that illustrate taxonomy categories, edge cases, and boundary conditions. Collaborate with language and cultural experts to ensure exemplars are culturally appropriate and representative across target markets. Partner with policy, product, and engineering teams to translate responsible AI policies and guidelines into concrete data requirements, annotation schemas, and evaluation criteria that can be operationalized across markets. Develop and maintain synthetic data generation pipelines to augment evaluation coverage, stress-test safety boundaries, and support evaluation in low-resource languages. Ensure synthetic data is diverse, representative, and validated against human-generated benchmarks.
Automated Judge Development: Shape the development, training and fine-tuning, and validation of automated judge models that can reliably score AI system outputs for safety and policy compliance. Develop calibration and agreement metrics to ensure judges meet human-parity benchmarks. Design and implement validation frameworks to assess the accuracy, reliability, and consistency of automated evaluation systems. Develop methods to detect drift, bias, and failure modes in automated judges across markets.
Want more jobs like this?
Get jobs in San Francisco, CA delivered to your inbox every week.

Scalable Analysis & Reporting Automation: Create automated pipelines for analysis and reporting that reduce manual effort, increase reproducibility, and enable rapid cross-market safety assessments. Build tooling that integrates with existing dashboards and reporting workflows.
Documentation & Communication: Produce clear, detailed documentation artifacts. Present findings and recommendations to cross-functional stakeholders including engineering, product, compliance, and policy teams.
Canonical Guideline Development: Author and maintain canonical evaluation guidelines that standardize task definitions, rating criteria, and edge-case handling. These assets will be adapted to scale across languages and markets, with the support of multi-lingual and operations experts. You will ensure guidelines are clear, complete, and adaptable.
Evaluation Design & Execution: Pilot and run evaluations with validated task setups, manage evaluation instruments and surface issues before full-scale deployment. Analyze pilot results and iterate on guidelines and configurations accordingly. esign and run pilot evaluations to validate task setups, identify guideline ambiguities, calibrate annotator understanding, and surface issues before full-scale deployment. Analyze pilot results and iterate on guidelines and configurations accordingly.
Monitoring & Data Quality: Develop and implement monitoring frameworks to track evaluation progress, annotator performance, inter-rater agreement, and data quality in real time. Flag anomalies and implement corrective actions to maintain data integrity across markets
Preferred Qualifications
Experience designing evaluation frameworks for multilingual or cross-cultural contexts.
Familiarity with responsible AI, AI safety, or content moderation policy frameworks.
Experience with experimental design methodologies, inter-rater reliability data analysis and annotation quality assessment methods.
Prior experience working with localization, internationalization, or language service teams.
Experience with survey design, AI policy development, and/or structured content analysis methodologies.
Minimum Qualifications
4+ years of experience in an applied research setting related to evaluation design, AI ethics, Responsible AI, AI safety, computational social science, content analysis, or a closely related field.
Strong understanding of taxonomy design, classification systems, and annotation methodology.
Experience developing evaluation guidelines and exemplar sets for human annotation or labeling tasks.
Demonstrated ability to collaborate with subject matter experts (e.g., linguists, cultural consultants, multi-lingual annotators) to inform research design.
Able to work independently to drive outcomes among cross-functional teams, with minimal direction.
Organized, highly attentive to detail, and manages time well.
Excellent written and oral communication skills.
Experience working in industry.
Advanced degree (MS/PhD) in Linguistics, Information Science, Computational Social Science, or a related socio-technical field.
Apple is an equal opportunity employer that is committed to inclusion and diversity. We seek to promote equal opportunity for all applicants without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, Veteran status, or other legally protected characteristics. Learn more about your EEO rights as an applicant .
Pay & Benefits
At Apple, base pay is one part of our total compensation package and is determined within a range. This provides the opportunity to progress as you grow and develop within a role. The base pay range for this role is between $181,100 and $318,400, and your base pay will depend on your skills, qualifications, experience, and location.
Apple employees also have the opportunity to become an Apple shareholder through participation in Apple's discretionary employee stock programs. Apple employees are eligible for discretionary restricted stock unit awards, and can purchase Apple stock at a discount if voluntarily participating in Apple's Employee Stock Purchase Plan. You'll also receive benefits including: Comprehensive medical and dental coverage, retirement benefits, a range of discounted products and free services, and for formal education related to advancing your career at Apple, reimbursement for certain educational expenses - including tuition. Additionally, this role might be eligible for discretionary bonuses or commission payments as well as relocation. Learn more about Apple Benefits.
Note: Apple benefit, compensation and employee stock programs are subject to eligibility requirements and other terms of the applicable plan or program.
Perks and Benefits
Health and Wellness
Parental Benefits
Work Flexibility
Office Life and Perks
Vacation and Time Off
Financial and Retirement
Professional Development
Diversity and Inclusion
Company Videos
Hear directly from employees about what it is like to work at Apple.