Speech Processing Intern - Multi-modal emotion

Affectiva is an MIT Media Lab spin-off focused understanding human emotion. Our vision is that technology needs the ability to sense, adapt and respond to not just commands but also non-verbal signals. We are building artificial emotional intelligence (Emotion AI).

As you can imagine, such an ambitious vision takes a great team with a strong desire to explore and innovate. We are growing our team to improve and expand our core technologies and help solve many unique and interesting problems focused around sensing, understanding and adapting to human emotion.

Our first technology measures human emotion through sensing and analyzing facial expressions. This technology is already being used commercially in a number of different verticals and use cases, and has been released to the public in the form of SDKs so that developers around the world can begin to use it to create a new breed of apps, websites and experiences. Currently, we are extending our emotion sensing technology beyond the face to analyze human speech. Our goal is to build out our technology to perform emotion sensing unimodally from speech, as well as multi-modally from speech and facial expressions when both channels are present.

This position is on the Science team, the team tasked with creating and refining Affectiva’s technology. We are a group of individuals with backgrounds in machine learning, computer vision, speech processing and affective computing.

We’re looking for a summer intern to work on multi-modal emotion estimation that leverages the use of both face and speech features to build robust and easily generalizable classifiers for emotion estimation. The candidate will work closely with members of the science team to implement classic strategies (such as decision level fusion) in these areas, as well explore novel  feature-level fusion strategies.


  • Implement unimodal face and speech classifiers for selected emotional states from publicly available data sources.
  • Implement decision-level fusion classification for making stronger emotional inference compared to unimodal (face or speech) emotion classifier.
  • Explore feature-level fusion methodologies and implement a subset of the viable feature-level fusion classification approaches.
  • Compare the performance of the feature-level fusion classifiers to unimodal classifiers and decision-level fusion based classifiers; and evaluate their technical feasibility.
  • Clearly communicate your implementations, experiments, and conclusions.


  • Pursuing graduate degree (MS or PhD) in Electrical Engineering or Computer Science, with specialization in speech processing or computer vision.
  • Hands-on experience with multi-modal classification of any of the following: speaker emotion, speaker state (e.g., cognitive load), speaker traits (gender, age, personality, pathology, etc.), communicational signals (entrainment, rapport, power structure, etc.)
  • Strong publication record in journals/proceedings such as ICASSP, NIPS, PAMI, InterSpeech.
  • Expertise in Python or C/C++.
  • Experience working with deep learning models (RNN, LSTM, CNN) a plus.

Meet Some of Affectiva's Employees

Abdelrahman M.

SDK Technical Lead

Abdelrahman builds machines that sense emotions and expressions for Affectiva’s software. He also helps create SDKs so any developer can integrate emotion recognition into their applications.

Brett R.

HR Manager

Brett dabbles in all areas of administrative management and personnel to keep the current team supported while she hires new employees to add to the mix.

Back to top