Our question: How can we affordably integrate knowledge of emotions and context into technology to provide a more comprehensive translating experience? We hope to be able to address specific situations such as the legal and medical arenas, which can involve people in highly charged emotional states and cultural contexts, yet where accurate communication is paramount.

Our solution: Incorporating physiological/emotion/contextual states alongside input (spoken/typed) to better help AI build a more comprehensive translation or rank translation outcomes. This idea involves collecting multi-modal data such as facial expression and tone of voice to gauge emotional state. Using these data, we hope to create a context around the translation input and output in which to train the AI. We also wish to work with technology, such as improving the existing translation Apps or at most creating a new device as we feel that this will make translation more accessible and affordable.

Given the fact that none of us are fluent in how current translation AI and its algorithms work , we wanted to consult with Computer Science professors to discuss the gaps in current translation technology. A few questions that we asked were:

1.      One of the ways we were considering to incorporate these emotional and physiological information is by using facial recognition/voice inflection software (such as those adopted by some marketing companies) to detect moods, and then using that to predict that certain moods would most likely meant a particular translation. Is emotion-detecting software is at that level of sophistication, and is translation AIs actually able to integrate different types of information to produce a translation outcome?

2.      Since we are not very well-versed in the AI arena, what possible challenges has translation technology been facing in the production of accurate translations?

The professors essentially reaffirmed our suspicion that current AI technologies are inadequate in providing accurate translations, saying that “current translation technologies are purely statistical and they are built on looking at a lot of text, basically text that is aligned between two languages, and it makes use of observed regularities in this alignment, over a lot of data. Therefore, it capture “simple” and common regularities and isn’t capable of subtleties that are needed in the scenarios that you describe. Perhaps these scenarios will require that we develop approaches that involves human in the loop.”

Therefore, we are even more convinced that providing more “human”-like data is the way to go in refining translation technology.

We will continue correspondence with the professors and perhaps even contact others in the development of our idea!

Nevertheless, some other issues that we anticipate are:

  1. How might we collect emotional, contextual data?

Facial recognition and tone of voice software: generally based on Paul Ekman’s 6 Universal Expressions (happiness, surprise, sadness, fear, disgust, anger) that are generally recognisable across cultures/contexts. Unfortunately, there are several limitations with this, namely that people’s expressions are not consistently shown on the face, or we can experience a myriad of emotions at once that may be difficult to detect(1).

Nevertheless, we will continue looking into this, especially with regards to the company Affectiva, which measures facial expressions to design advertisements, and is said to incorporate “culturally specific benchmarks” for facial movement. The company currently trains its systems with more than 8 million faces from 87 countries, suggesting that they may have a large repertoire of culturally and emotional specific data.

Other things we are also considering is incorporating physiological software, such as detecting sweat or heart rate levels. We were considering incorporating Fitbit technology or smart-tracking technology into the contextual understanding of a word/phrase. However, this might mean that either the smartphone becomes more advanced in its measurement capabilities or we design an entirely separate device for this. We would also need to do more research on whether words can trigger physiological reactions.

Another question is: What would it actually mean to build a “context” around the word-to-be-translated, in terms of training an AI?

One way to approach it would be to involve the participation of many people around the world, to provide various ways that that particular word is used. Currently, Google Translate has a similar method, of crowdsourcing bilingual speakers from all over the world to translate (and validate) translations of various languages. However, one thing we noticed was that it still looks like is still random phrases and does not really capture the context of the situation, but merely the word in isolation. That is why we thought that having more concurrent data inputs like facial expression and tone of voice at time of speaking would possibly help to elucidate the particularistic meaning of a word.

Another question: It does seem that our idea is simply refining the current method of translation, but merely altering the training set for the AI. Is there any way to incorporate a more “human-like” way of translation?

We are reading studies into how AI currently do translation, and the diagram below is a good summary of what happens.

taken from: Waibel, A., & Fugen, C. (2008). Spoken language translation. IEEE Signal Processing Magazine, 25(3), 70-79.

We will consider at which points on the flow where we might inject more human aspects of translation into the process. For example, incorporating how babies learn semantics (such as via association between the noun and the meaning), as well as how to make use of preceding verbs to predict subsequent nouns. Professor Schuler also suggested how parents speak to their children in a special way, in which they simplify word classes (e.g. dog → doggie), and how we might apply child-directed speech to teaching AI. For example, one might take subsets of reduced forms to help them learn the entire paradigm, and then slowly add exceptions.

Clearly, we still have a way to go to refine the solution, but we are excited to see where it goes!

Resources:

  1. Barrett, L. F., Adolphs, R., Marsella, S., Martinez, A. M., & Pollak, S. D. (2019). Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements. Psychological Science in the Public Interest, 20(1), 1-68.