Should Researchers Be Allowed to Use YouTube Videos and Tweets?
This article in , a collaboration among , , and .
There鈥檚 a lot you might guess about a person based on their voice: their gender, their age, perhaps even their race. That鈥檚 your brain making an educated guess about the identity of a speaker based on what you鈥檝e experienced, but sometimes, those guesses are wrong. (People I talk to on the phone who don鈥檛 know my name often assume I鈥檓 white because I speak English without an accent. They frequently express surprise to learn I鈥檓 Asian.) In , a group of MIT researchers set out to investigate what a computer can guess about a person鈥檚 appearance from their voice.
To do that, the researchers trained their model using a dataset called , a selection of YouTube videos originally compiled by Google researchers for a different project. The model was fed face and voice data from hundreds of thousands of YouTube examples. Then, researchers fed the voices to the model and asked it to create a face it thought matched the voice. In the end, the model was decent at predicting what a person looked like but struggled with people of certain identities. For instance, while the model renders an Asian American man speaking Chinese as an Asian man, it draws up a white man when that same person is speaking English instead. It also appeared to have issues with voice pitch鈥攊t assumed people with high-pitched voices were women and those with low were men鈥攁nd age. In short, it appears that the model learned some basic stereotypes about a person鈥檚 face and voice.
Unbeknownst to him, Nick Sullivan, head of cryptography at Cloudflare, contributed to this model鈥檚 鈥渆ducation.鈥 He said a friend sent him the paper and was 鈥渜uite surprised and confused鈥 to see his face among the 鈥渟uccessful鈥 renderings. 鈥淚 saw a photo of me, a computer construction of my face, and a computer-generated image that didn鈥檛 resemble me but had a similar nose and jaw dimensions,鈥 says Sullivan. (In my opinion, he鈥檚 being quite generous about that computer-generated image; it鈥檚 unrecognizable as him.)
Part of his confusion was that he hadn鈥檛 signed any waivers to be a part of a machine learning study, but he had signed waivers for appearance on YouTube videos, so he figured maybe one of those videos found its way into the dataset the researchers used. However, after some digging, he discovered the video used in the dataset and didn鈥檛 recall signing any kind of waiver for that one.
Whether Sullivan signed a waiver likely doesn鈥檛 matter, though. Most research using data from human participants does require scientists to obtain informed consent (most often in the form of waivers). But YouTube videos are considered publicly available information and not classified as 鈥渉uman subjects research鈥濃攅ven if researchers are studying the intricacies of your face and voice. And while YouTube users own the copyright to their own videos, researchers using clips could make the argument that their work qualifies as 鈥溾 of copyrighted materials, since the end result is 鈥渢ransformative鈥 of the original work. (In the case of the Speech2Face data, the model quite literally transforms your voice and face data into something else entirely.) Casey Fiesler, assistant professor of information science at the University of Colorado Boulder, says she鈥檚 never seen a copyright holder challenge researchers who used their internet posts as data. 鈥淭here probably aren鈥檛 legal issues with it,鈥 she says.
But just because something is legal doesn鈥檛 mean it鈥檚 ethical. That doesn鈥檛 mean it鈥檚 necessarily unethical, either, but it鈥檚 worth asking questions about how and why researchers use social media posts, and whether those uses could be harmful. I was once a researcher who had to obtain human-subjects approval from a university institutional review board, and I know it can be a painstaking application process with long wait times. Collecting data from individuals takes a long time, too. If you could just sub in YouTube videos in place of collecting your own data, that saves time, money, and effort. But that could be at the expense of the people whose data you鈥檙e scraping.
But, you might say, if people don鈥檛 want to be studied online, then they shouldn鈥檛 post anything. But most people don鈥檛 fully understand what 鈥減ublicly available鈥 really means or its ramifications. 鈥淵ou might know intellectually that technically anyone can see a tweet, but you still conceptualize your audience as being your 200 Twitter followers,鈥 says Fiesler. In her research, she鈥檚 found that the majority of people she鈥檚 polled have no clue that researchers study public tweets.
Some may disagree that it鈥檚 researchers鈥 responsibility to work around social media users鈥 ignorance, but Fiesler and others are calling for their colleagues to be more mindful about any work that uses publicly available data. For instance, Ashley Patterson, an assistant professor of language and literacy at Penn State University, ultimately decided to use YouTube videos in her dissertation work on biracial individuals鈥 educational experiences. That鈥檚 a decision she arrived at after each step of the way. 鈥淚 had to set my own levels of ethical standards and hold myself to it, because I knew no one else would,鈥 she says. One of Patterson鈥檚 first steps was to ask herself what YouTube videos would add to her work, and whether there were any other ways to collect her data. 鈥淚t鈥檚 not a matter of whether it makes my life easier, or whether it鈥檚 鈥榡ust data out there鈥 that would otherwise go to waste. The nature of my question and the response I was looking for made this an appropriate piece [of my work],鈥 she says.
Researchers may also want to consider qualitative, hard-to-quantify contextual cues when weighing ethical decisions. What kind of data is being used? Fiesler points out that tweets about, say, a TV show are way less personal than ones about a sensitive medical condition. Anonymized written materials, like Facebook posts, could be less invasive than using someone鈥檚 face and voice from a YouTube video. And the potential consequences of the research project are worth considering, too. For instance, and other critics have pointed out that researchers who used YouTube videos of people documenting their experience undergoing hormone replacement therapy to train an A.I. to could be putting their unwitting participants in danger. It鈥檚 not obvious how the results of Speech2Face will be used, and when asked for comment, the paper鈥檚 researchers said they鈥檇 prefer to quote from their paper, which pointed to a helpful purpose: providing a 鈥渞epresentative face鈥 based on the speaker鈥檚 voice on a phone call. But one can also imagine dangerous applications, like doxing anonymous YouTubers.
One way to get ahead of this, perhaps, is to take steps to explicitly inform participants their data is being used. Fiesler says that when her team asked people how they鈥檇 feel after learning their tweets had been used for research, 鈥渘ot everyone was necessarily super upset, but most people were surprised.鈥 They also seemed curious; said that if their tweet were included in research, they鈥檇 want to read the resulting paper. 鈥淚n human-subjects research, the ethical standard is informed consent, but inform and consent can be pulled apart; you could potentially inform people without getting their consent,鈥 Fiesler suggests.
Sullivan says it would have been nice to have been notified that his voice and face were in a research database, but he also acknowledges that given the size of the corpus, it would鈥檝e been a difficult task. And in the case of Speech2Face, researchers were using a dataset originally collected for a different project. Even if the original researchers had notified participants that their videos were being used, would the Speech2Face users then also have a responsibility to renotify those users with details about their work? In any case, it seems like researchers could at least notify people whose personal details are published in a paper. 鈥淪ince my image and voice were singled out as an example in the Speech2Face paper, rather than just used as a data point in a statistical study, it would have been polite to reach out to inform me or ask for my permission,鈥 says Sullivan.
But even informing YouTubers might not be the best decision in all cases. Patterson, for instance, considered doing so but decided against it for two reasons. First, some of the YouTubers were under 18, which meant that reaching out to them would have required her to first contact their parents. Based on the videos鈥 candid content about their families and school experiences, Patterson said, it seemed like the YouTubers鈥 imagined audiences were definitely not parents. 鈥淚t seemed like a violation of the way they envisioned this platform,鈥 she says, but she also acknowledges that researchers鈥 eyes could similarly be seen as a violation. Additionally, Patterson said that the IRB officials she talked with said they had no precedent for contacting creators of publicly available content like YouTube videos and that ironing that all out would have taken months. In Patterson鈥檚 case, it just didn鈥檛 seem practical.
In the end, there鈥檚 no one-size-fits-all for researchers to determine whether using publicly available data is appropriate, but there is certainly more room for more discussion. 鈥淚t would be nice to see more reflection from researchers about why this is OK,鈥 says Fiesler, suggesting researchers鈥 published papers could discuss the ethical considerations they made. (The Speech2Face paper did include an 鈥渆thics鈥 section, but it did not include this type of discussion, and when asked for comment, they pointed me back to this section.) Patterson agrees: 鈥淚 think there are going to be more conversations for sure, and in the not too distant future, you might not even be able to do this kind of work.鈥