The Big Question: How can we affordably integrate knowledge of emotions and context into technology to provide a more comprehensive translating experience?

Our solution involves developing an alternate approach to that of current AI algorithms by including more “human” information into the training data. This way, the algorithm is more sensitive to emotion, context, and different cultural information during translation. It is common for bilingual people to struggle in describing the same experience in either of their languages, and they are often forced to present it differently in each because word meanings in the two languages do not match (Wierzbika, 2009).

First thing to consider:
How do humans understand meaning?

Classical abstract symbol theories suggest that meaning arises from a syntactic combination of abstract, amodal (non-perceptual) symbols that have arbitrary relations to real world entities. Collins and Loftus (1975) suggests that meaning is built in a nodal network pattern, where an undefined word is represented by a node, and its connections consist of various words or concepts that are semantically related. For example, the word “lion” might link to other words like “large cat”, “Africa”, “mane”, roar” etc.

To our knowledge, AI may take a similar approach to modeling language, where related words have a high likelihood of occurring together.

However, in a different perspective, representation of word meanings could also be grounded in perception, action, and context. The assumption here is that various mental processes—including those related to language—are based on physical or imagined interactions individuals have had with their environment. Thus, in interpreting a sentence, each word is indexed to a set of perceptual symbols, and all of this information is combined to produce a simulation of what the sentence is thought to describe (Vigliocco, 2009). Barsalou et al. (2008) also proposed a Language and Situated Simulation framework (LASS) in which both linguistic and situated information crucially contribute to semantics and overall conceptual representation.

We believe that this might be an interesting framework to consider, as we anticipate translating semantic information from one language to another will be more accurate with emotional context.

The work done by Viggilocco et al. (2009) is an example of this idea coming to life. To test semantic representation, they trained computer models with either experience or language, or a combination of the two. The findings showed that the combined model outperformed the others, providing evidence that semantics are grounded in both of these mechanisms, as opposed to one or the other. As modern translation tools fail to include experiential input, it becomes clear that they have very limited accuracy.

In addition to context, emotion also plays an important role in semantic processing, especially when a novel word form is presented and does not yet have any conditioned associations (Keuper et al., 2014). These researchers found that subcortical systems responsible for processing perceived emotion  (i.e. the limbic system) are active when processing emotional valence in words. Thus, there is likely an overlap between language and emotion that is important to consider.

Moving forward:
How can we make use of this knowledge?

With all this in mind, we propose having the AI be trained with emotional information such as facial recognition software. Current technologies, such as those being used by the marketing company Affectiva, use front facing cameras on user devices to measure facial expressions of emotion.

Via Affectiva.com: “Computer vision algorithms identify key landmarks on the face – for example, the corners of your eyebrows, the tip of your nose, the corners of your mouth. Deep learning algorithms then analyze pixels in those regions to classify facial expressions. Combinations of these facial expressions are then mapped to emotions.”

Currently, the 7 universal emotions proposed by Paul Ekman (anger, contempt, disgust, fear, joy, sadness and surprise) are the most easily identified by paying attention to brow furrow, depression of the lip corner, and wrinkling of the nose. We propose that as a person inputs the communication to be translated, the device simultaneously keeps track of their person’s facial expression and tone of voice, so as to incorporate emotional and situational data into the translation.

The advice we received from experts, who stated that achieving perfect, purely computational translation would be a nearly impossible feat at this time, combined with the research cited above, is what implored us to implement human input into the translation process.

A crowdsourcing model known as Amazon Mechanical Turk features similar aspects to our proposed solution. In it, data is compiled from the results of various human intelligence tasks, which measure how well humans are able to outperform computers in various scenarios. These tasks are submitted onto website and, for small compensation, users can complete several of their choosing. Each day, thousands of submissions are received and factored into algorithms that help improve computer performance.

We hope to implement a similar computer-human feedback cycle to aid in our translation algorithm. We would have participants complete tasks such as matching visual depictions of emotion to a textual counterpart, or listening to voice recordings and describing the emotion they heard. For this model, we would mainly be concerned with identifying shifts in pleasantness (e.g. positive to negative) and energy (e.g. moderately agitated to vary agitated) rather than distinguishing between similar emotions. (See diagram below). Over time, the algorithm would become increasingly better at identifying emotional cues and predicting their implications to provide a comprehensive translating experience.

Table showing how emotions can be graded on a dimensional scale, varying in pleasantness and energy. Our algorithm not make as much use out of tracking change from peaceful to comfortable, for example, but rather between from calm to sad, or sad to worried. 

Making progress:
Acknowledging the Limitations

With crowdsourcing comes vulnerability to purposely incorrect or random input from users. We could counter this by incorporating a small team of moderators to review translations and identify any obscure input. Though this would probably be a relatively costly and daunting task, we would not expect to make use of it indefinitely. As more information flows in and the algorithm becomes more accurate, these erroneous inputs become less influential.

Furthermore, it is also important to consider that in the medical and legal contexts we anticipate, participants are subject to emotional reactions independent from the language barrier alone. For example, if someone receives heavy medical news, they would probably have a negative change in emotion, which would be picked up by our translation device and could potentially trigger an alternate output. For this reason, it is very important to have the users of the application also contribute to the feedback cycle that trains the AI. We will show how we implement this in the description of the prototype below.


Application Overview

We expect our app to rely primarily on vocal translation, similar to voice-activated devices such as Amazon Echo and Google Home. However, instead of having to continuously prompt the device for a translation (e.g. "Alexa, how do you say [x] in [language]? ... Alexa, how do you say [y] in [language]? ... and so on), the application will continuously parse streams of sound.

Launching the App: When the app is first launched (Home Screen and Language Select in the above diagram), the users are able to select the languages that will be spoken and have the option to import baseline emotional data (e.g. taking a picture to have facial expression analyzed). Once all of the necessary information is taken care of, a user will prompt the app to begin listening (Begin)

Speech Input: The app begins listening for a stream of speech (Listening). Once the speaker stops talking, the app will process the input (Working)

Translation: (Ongoing Output) After the input speech has been processed, the application will produce audial and visual output translations for the other user. Based on emotional context and our AI algorithm for developing translations, the theorized best output will be displayed. Signal Words, which are words the AI would be trained to understand as being either a) subject to vastly different interpretations/meanings or b) complex or super technical in meaning and would require further clarification, will be highlighted in the text so that people are able to further explore different translations. (Final Output) Once the audio is done playing, the user has the option to click on different words for further clarification. (Clarification) If a word is selected, its definition will be displayed, along with other words/phrases that were considered for the translation. They will be ranked in terms of how well they match the given context. The user will also have the option to select one of the alternate translations and replace the original word they selected, to produce a more customized translation.

From here, the app can be prompted to begin listening again, and this process will loop until the conversation is over. Once there is a marked end to the conversation, the user will be able to see a transcript of the conversation in both languages.

Hopefully, this gives you a better idea of our solution, and we look forward to hearing your feedback!


Resources:

Wierzbicka, A. (2009). Language and metalanguage: Key issues in emotion research. Emotion review, 1(1), 3-14.
Collins, A. M., & Loftus, E. F. (1975). A spreading-activation theory of semantic processing. Psychological Review, 82, 407–428.

Barsalou, L. W., Santos, A., Simmons, W. K., & Wilson, C. D. (2008). Language and simulation in conceptual processing. Symbols, embodiment, and meaning, 245-283.

Vigliocco, G., Meteyard, L., Andrews, M., & Kousta, S. (2009). Toward a theory of semantic representation. Language and Cognition, 1(2), 219-247.

Keuper, K., Zwanzger, P., Nordt, M., Eden, A., Laeger, I., Zwitserlood, P., ... & Dobel, C. (2014). How ‘love’and ‘hate’differ from ‘sleep’: using combined electro/magnetoencephalographic data to reveal the sources of early cortical responses to emotional words. Human brain mapping, 35(3), 875-888.