The big picture: U of T experts develop storytelling tool for family photos
As the saying goes, a picture is worth a thousand words. A tool under development in the University of Toronto鈥檚 Technologies for Aging Gracefully lab is synchronizing memories by attaching audio stories to digital family photos.
鈥淧ictures are one of the best ways to bring memories to mind,鈥 says Benett Axtell, a PhD student in the department of computer science. 鈥淎s older adults move into smaller living spaces, having a digital way to browse through photos is really important, especially if they鈥檙e separated from their children or their grandchildren. It also means that staff working in assisted living homes can interact with older adults through this tool to help them share their memories.鈥
The tabletop app allows users to swipe and select pictures spread across their tablet while they talk about the memories associated with those pictures. The tool will record and attach the audio to the grouping and uses natural language processing to cluster the photos.
Axtell and the lab鈥檚 co-director Cosmin Munteanu will demo the current prototype Frame of Mind: Using Storytelling for Speech-Based Clustering of Family Pictures at the Association for Computing Machinery鈥檚 international conference on intelligent user interfaces, being held this week in Tokyo. Munteanu is an assistant professor at U of T Mississauga鈥檚 Institute for Communication, Culture, Information and Technology, the Faculty of Information and department of computer science.
Axtell says there鈥檚 currently no digital replacement for how people interact with albums or boxes of printed photos. The digital tool鈥檚 tabletop design of pictures spread across a table, or in this case, tablet, was inspired by visits to homes of older adults in Toronto. Participants were asked to show Axtell their family pictures, whether in print or in digital format, and to share memories from their photo collection.
鈥淲e have this idea that our memories are stored like our photos. We get a roll of film and then we put them in the album: This happened, this happened 鈥 turn the page 鈥 this happened,鈥 says Axtell. 鈥淭hat鈥檚 not how our memories work. And that鈥檚 not how people go through [pictures]. They鈥檒l go page by page and then, 鈥極h yeah, that reminds me,鈥 and flip back five pages.鈥
The tool in development will allow users to talk about their pictures, with the audio attached to the grouping of photos (photo by Ryan Perez)
Axtell says they鈥檙e focused, at least initially, on a very na茂ve approach to clustering the pictures, using just the language shared in the app. If 鈥渃at鈥 was said for eight photos, then those eight photos would be close together. But family descriptions tend to be a lot more casual. There can be a 鈥渃at鈥 and a person named 鈥淐at鈥 and so Axtell says their current method of organizing speech is being used as a stepping stone until they can delve into semantic variances.
鈥淥ne of the things we're really focusing on is making sure that the whole process of how it makes these clusters is really transparent 鈥 keeping the human in the loop. You don鈥檛 want to give your family photos to a computer and have it go: This is how you should group them into different groups.鈥
Axtell and Munteanu will also present a paper published with co-authors Carrie Demmans Epp, Yomna Aly and Frank Rudzicz, an assistant professor of computer science and a rehabilitation scientist at the University Health Network, on touch-supported voice recording to facilitate forced alignment of text and speech in e-readers. Forced alignment determines where text transcription is placed within audio speech.
Axtell says one of the goals of this interactive project is to have a younger family member read a book aloud for an older adult with low vision, a common problem amongst the elderly. The older adult can later listen to their family member read, while they follow the large print, highlighted text.
Forced alignment, as Axtell explains, is built upon existing machine learning tools and works very well when people follow a script, such as closed captioning for newscasts, but a family member reading aloud is much more informal, from skipping words to mispronouncing them entirely.
鈥淭he text we gave [readers] was Anne of Green Gables. It has really long sentences and phrases [such as] 鈥楾he beautiful, capricious, reluctant Canadian spring.鈥欌
The current tool for forced alignment was too strict and didn鈥檛 account for the mistakes and pauses the reader would make, or the reader going entirely off-script, adding their own comments to the story. Although, as Axtell says, the project was inspired by testing forced alignment for their dataset, it has practical applications too, from reducing pronunciation errors in language learning to accurately aligning the e-reader鈥檚 highlighted text with the audio playback.
For their doctoral studies, Axtell鈥檚 human-computer interaction research will look closely at speech interactions that are implicit to the task at hand but don鈥檛 feel like speech interactions to the user.
The tabletop photo prototype needs more external testing, especially when introducing additional storytellers to memory-storing process.
鈥淗ow is a computer going to deal with that? How are conversations handled? That would be really interesting to see where that goes.鈥