Posted: Aug 25, 2020
Role Number: 200188037
Siri's universal search engine powers search features across a variety of Apple products, including Siri Assistant, Spotlight, Safari, Messages, and News. The Siri Data organization seeks to improve Siri by using data as the voice of our customers. Within this organization the Search Data Engineering team builds systems that process data reliably at scale to generate scalable and high quality datasets that support confident, data-driven decision making for Siri Search. We're looking for exceptional data engineers who are passionate about our product and values; who love working with data at scale; and who are committed to that hard work necessary to continuously improve. As a part of this group, you will work with petabytes of data daily using diverse technologies like Spark, Flink, Kafka, Hadoop and others. You will be expected to effectively partner with upstream engineering teams and downstream consumers, including analysts and product engineers. In this role you will build datasets to support analytics, experimentation, and machine learning. Specifically, you will build out stream processing applications powering real-time metrics and you will help to drive our self-serve strategy for reporting on-behalf of data scientists and product engineers as we collectively make Siri better.
- You have excellent written and verbal communication skills
- You are curious and have excellent analytical and problem solving skills
- You are excited about digging into massive petabyte-scale semi-structured datasets
- 1+ years of industry experience working with distributed data technologies (e.g. Hadoop, MapReduce, Spark, etc.)
- Proficiency in at least one high-level programming language (Python, Go, Java, Scala, or equivalent)
- Experience with large, complex, highly dimensional data sets; hands-on experience with SQL
- You are pragmatic, not letting "the perfect" be the enemy of "the good"
- You are self-directed and capable of operating amidst ambiguity
- You are humble, continually growing in self-awareness and possessing a growth mindset
- Extras we'd be excited about...
- Experience building stream-processing applications using Apache Flink, Spark-Streaming, Apache Storm, Kafka Streams or others
- Experience with data engineering in support of ML: Anomaly detection in time series data, engineering work to product-ionize models developed by data scientists, etc.
Developing data pipelines and/or software libraries to process, transform, and analyze data to identify signals from the billions of events we collect every day Designing and building abstractions that hide the complexity of the underlying big data stack (HDFS, Hadoop, Hive, Impala, Spark, Kafka, Parquet, etc) and that allow partners to focus on their strengths: product, data modeling, data analysis, search, information retrieval, and machine learning Defining and implementing the "source of truth" for our most fundamental data-such as search activity and content-as well as our core metrics across a variety of products Optimizing end-to-end workflows of data users (crafting libraries, providing abstractions to define jobs, scheduling data pipelines, managing access datasets, etc) Building internal services and tools to help in-house partners implement, deploy and analyze datasets with a high level of autonomy and limited friction. Surfacing datasets in near-real-time to mission critical products and business applications throughout the company, providing the signal that feeds our machine learning algorithms as well as our daily product-defining decisions Automating and handling lifecycle of datasets (schema evolution, metadata store, backfill management, deprecation, migration) Improving the quality and reliability of our pipelines (monitoring, retry, failure detection)
Education & Experience
Surprise us! Many will have an MS or BS in CS, Engineering, Math, Statistics, or a related field or equivalent practical experience in data engineering.