Our task this week was to get in touch with an expert regarding our problem and solution. As a reminder, the problem we wanted to solve was:
How can we better convey information from a target language in hopes of modulating the tension between translation and interpretation?
The solution we decided to pursue after our peers’ feedback was a speech translator that incorporates prosodic elements of speech (i.e: rhythm, loudness, stress, speed, pitch, and intonation) in order to clarify communication.
We were interested in reaching out to Dr. Duane Watson, a professor of psychology and human development at Vanderbilt University, because his work mainly focuses on how gesture, pitch, rhythm, and emphasis in speech is used in communication. We were able to connect via email, in which he expressed an interest in our project and even asked us to send him a follow up email with our questions! Unfortunately, he was not able to respond to our questions before the deadline to submit this post since his own graduate students were entering a similar revision stage for their own semester projects. However, we ideated on potential problems with our idea, and came up with the following four:
- Difficulty in recording prosody (like making the computer understand it): Research from the MIT Technology Review shed some light on why many computers have difficulty in recognizing prosody in speech. For the most part, many computer scientists focused on having the computer recognize words before even thinking about conveying emotional intent. Word recognition is much easier for computers to do as opposed to extracting meaning from basic components of sound waves that describe prosodic elements of speech such as, timing of duration between words and sentences, change in pitch within words and sentences, and amplitude changes that indicate emphasis. Moreover, words are often a linear series of phonetic events as opposed to prosodic features that occur across words and sentences. Another hindrance is when different kinds of prosodic patterns overlap with one another.
- Lack of data on how prosody “translates” from one language to another: Prosody research is most often facilitated by labeled data due to its reliance on machine learning. Data labelling refers to the process of identifying any type of raw data and adding meaningful and informative labels to provide context for the machine learning model to learn from. The main issue is that labeling data is quite expensive, limiting its availability.
- Applicable Technology: Although there is research explaining the importance of prosody analysis, the adoption of prosodic analysis technologies has faced some challenges due to the high prices and limited availability. In other words, many of the prosodic analysis technologies are effective in extracting acoustic features, such as OpenSmile or AuToBI, they are not yet compatible with multiple programming languages.
- Domain issues (value of prosody in speech recognition): Due to the lack of experimental technologies regarding prosody, it is difficult to determine the reliability of translations that are produced during challenging acoustic conditions, such as noisy environments. Furthermore, the prospect of incorporating prosodic features into translators seems like a bit of a leap from the current capabilities of translators. That is, translators cannot fully recognize or transcribe speech into punctuated text.
Rosenberg, A. (2018). Speech, Prosody, and Machines: Nine Challenges for Prosody Research. 9th International Conference on Speech Prosody 2018. doi:10.21437/speechprosody.2018-159
Talbot, D. (2020, April 02). Prosody. Retrieved from https://www.technologyreview.com/2002/07/01/40873/prosody/