Cursing in English on Twitter
Cursing is not uncommon during conversations in the physical world. On social media, people can instantly chat with friends without face-to-face interaction, usually in a more public fashion and broadly disseminated through highly connected social network. Will these distinctive features of social media lead to a change in people’s cursing behavior? In this paper, we examine the characteristics of cursing activity on a popular social media platform – Twitter, involving the analysis of about 51 million tweets and about 14 million users. In particular, we explore a set of questions that have been recognized as crucial for understanding cursing in offline communications by prior studies, including the ubiquity, utility, and contextual dependencies of cursing ().
The work presented here is part of a broader agenda of analyzing citizen sensing to understand, inform policy or decision makers and develop tools to help manage important social and human development issues/challenges, including:
- Coordination during disasters: we investigate massive social media communities during disasters via psycholinguistic theories to assist coordination functions of demand, supply and engagement while answering: with whom to coordinate, why and how to. [Highlight: ICCM-13 ignite talk at UN Nairobi, ICWSM-2013 tutorial, SDM-2014 tutorial, First Monday journal highlight for Jan-2014 issue, TechPresident article ]
- Harassment on social media: we analyze social media conversations to identify and monitor harassment by understanding language of source and target users for mining intention, sentiment tone and emotions evoked.
- Prescription drug abuse: we mine social media to capture the knowledge, attitudes and behaviors of prescription drug abusers through the automatic extraction of semantic information (including entities, relationships, triples and other intelligible constructs such as sentiments, emotions, intervals, frequency, dosage, etc.) [Highlight: SemanticWeb.com article on this research]
- Depressive disorders: we leverage social sharing behaviors to mine depression and other mental health issues in this area. [Highlight: Research on Suicide Notes and new project initiated with Mayo Clinic]
- Gender-based violence: we model gender-based dynamics in the social data stream across the world to inform policy decision making of development agencies, in collaboration with the Subject Matter Experts at UN. [Highlight: Joint research with UNFPA experts to directly impact the policy actions]
In particular, cursing discussed here is one of several aspects of language analysis that is used in above projects along with out research on spatio-temporal-thematic, people-content-network, sentiment-emotion analysis discussed at our research on citizen sensing and social media analysis.
- 1 Introduction
- 2 Method and Analysis
- 3 Conclusion
- 4 Acknowledgments
- 5 References
- 6 Citation/Errata
Do you curse? Do you curse on social media? How often do you see people cursing on social media (e.g., Twitter)? Cursing, also called swearing, profanity, or bad language, is the use of certain words and phrases that are considered by some to be rude, impolite, offensive, obscene, or insulting <ref> "Profanity - Wikipedia, the free encyclopedia", Wikipedia, March 2013</ref>. In this paper (), we use cursing, profanity and swearing interchangeably. As Jay <ref name="The utility and ubiquity of taboo words">Jay, T. The utility and ubiquity of taboo words. Perspectives on Psychological Science 4, 2 (2009), 153–161.</ref> pointed out, cursing is a “rich emotional, psychological and sociocultural phenomenon”, which has attracted many researchers from related fields such as psychology, sociology, and linguistics <ref>Jay, T. Do offensive words harm people? Psychology, public policy, and law 15, 2 (2009), 81.</ref> <ref name="The pragmatics of swearing">Jay, T., and Janschewitz, K. The pragmatics of swearing. Journal of Politeness Research. Language, Behaviour, Culture 4, 2 (2008), 267–288.</ref>.
Over the last decade, social media has become an integral part of our daily lives. According to the 2012 Pew Internet & American Life Project report <ref> "Pew Internet: Social Networking (full detail)", PewResearch Internet Project, February 2013</ref>, 69% of online adults use social media sites and the number is steadily increasing. Another Pew study in 2011 <ref> "How American teens navigate the new world of “digital citizenship”", PewResearch Internet Project, November 2011.</ref> shows that 95% of all teens with ages 12-17 are now online and 80% of those online teens are users of social media sites. People post on these sites to share their daily activities, happenings, thoughts and feelings with their contacts, and keep up with close social ties, which makes social media both a valuable data source and a great target for various areas of research and practice, including the study of cursing. While the CSCW community has made great efforts to study various aspects (e.g., credibility <ref>Morris, M. R., Counts, S., Roseway, A., Hoff, A., and Schwarz, J. Tweeting is believing?: understanding microblog credibility perceptions. In Proceedings of the ACM 2012 conference on Computer Supported Cooperative Work, ACM (2012), 441–450.</ref>, privacy <ref>Almuhimedi, H., Wilson, S., Liu, B., Sadeh, N., and Acquisti, A. Tweets are forever: a large-scale quantitative analysis of deleted tweets. In Proceedings of the 2013 conference on Computer supported cooperative work, ACM (2013), 897–908.</ref>) of social networking and social media, our understanding of cursing on social media still remains very limited.
The communication on social media has its own characteristics which differentiates it from offline interaction in the physical world. Let us take Twitter for example. The messages posted on Twitter (i.e., tweets) are usually public and can spread rapidly and widely through the highly connected user network, while the offline conversations usually remain private among the persons involved. In addition, we may find that more of our actual exchange of words in the physical world happens through face-to-face oral communication, while on Twitter we mostly communicate by writing/typing without seeing each other. Will such differences lead to a change in people’s cursing behavior? Will the existing theories on swearing during the offline communication in physical world still be supported if tested on social media?
To address such differences, this paper examines the use of English curse words on the micro-blogging platform Twitter. We collected a random sampling of all public tweets and the data of relevant user accounts every day for four weeks. We first identified English cursing tweets in the collection, and extracted numerous attributes that characterize users and users’ tweeting behaviors. We then evaluated the effect of these attributes with respect to the cursing behaviors on Twitter. This exploratory study aims to improve our understanding of cursing on social media by exploring a set of questions that have been identified as crucial in previous cursing research on offline communication. The answers to these questions may also have valuable implications for the studies of language acquisition, emotion, mental health, verbal abuse, harassment, and gender difference <ref name="The utility and ubiquity of taboo words"/>.
Specifically, we examine four research questions:
- Q1 (Ubiquity): How often do people use curse words on Twitter? What are the most frequently used curse words?
- Q2 (Utility): Why do people use curse words on Twitter? Previous studies <ref name="The utility and ubiquity of taboo words"/> found that the main purpose of cursing is to express emotions. Do people curse to express emotions on Twitter? What are the emotions that people express using curse words?
- Q3 (Contextual Variables): Does the use of curse words depend on various contextual variables such as time (when to curse), location (where to curse), or communication type (how to curse)?
- Q4: Who says curse words to whom on Twitter? Previous research <ref>Jay, T. Why we curse: A neuro-psycho-social theory of speech. John Benjamins Publishing, 2000.</ref> <ref name="We feel fine and searching the emotional web">Kamvar, S. D., and Harris, J. We feel fine and searching the emotional web. In Proceedings of the fourth ACM international conference on Web search and data mining, ACM (2011), 117–126.</ref> suggested that gender of people play important roles in cursing; do they also affect people using or hearing curse words on Twitter?
Method and Analysis
Twitter provides a small random sample of all public tweets via its sample API in real time <ref>https://dev.twitter.com/docs/api/1.1/get/statuses/sample</ref>. Using this API, we continuously collected tweets for four weeks from March 11th 2013 to April 7th 2013. We kept only the users who specified ‘en’ as their language in profiles. Further, we utilized Google Chrome Browser’s embedded language detection library to remove non-English tweets <ref>https://pypi.python.org/pypi/chromium_compact_language_detector/0.2</ref>. In total, we gathered about 51M tweets from 14M distinct user accounts.
Cursing Lexicon Coding
We asked two college students who are native English speakers to independently annotate potential curse words that were collected from Internet. In the end, we kept only 788 words that are considered to be curse words in most cases by two annotators. Besides correctly spelled words, (e.g., fuck, ass), the lexicon also included different variations of curse words, e.g., a55, @$$, $h1t, b!tch, bi+ch, c0ck, f*ck, l3itch, p*ssy, and dik.
We call a tweet cursing tweet if it contains at least one curse word. Twitter users may use different variations of the same word, so we first simply compare words in a tweet against all the curse words in the lexicon. If there is no match, we remove repeating letters in the words (e.g., fuckk → fuck) of a tweet and repeat the matching process. We also convert digits or symbols in a word to their original letters: e.g., 0 → o, 9 → g, ! → i. Moreover, based on our observations, the following symbols, ' ', '%', '-', '.', '#', '\', '’', are frequently used to mask curse words: f ck, f%ck, f.ck, f#ck, f’ck → fuck. We apply the edit distance approach similar to <ref>Sood, S., Antin, J., and Churchill, E. Profanity use in online communities. In Proceedings of the 2012 ACM annual conference on Human Factors in Computing Systems, ACM (2012), 1481–1490.</ref> to spot curse words with mask symbols. Namely, if the edit distance between a candidate word (f ck) and a curse word (fuck) equals the number of mask symbols (1 in this case) in the candidate word, then it is a match.
To evaluate the accuracy of this lexicon-based method to spot cursing tweets, we drew a random sample of 1000 tweets, and asked two annotators to manually label them as cursing or non-cursing independently. Finally, there were 118 tweets labeled as cursing tweets for which both annotators agreed on their labels, and the other 882 tweets were labeled as non-cursing ones. We then tested the lexicon-based spotting approach on this labeled dataset, and the results showed that this lexicon-based method achieved a precision of 98.84%, a recall of 72.03% and F1 score of 83.33%. As expected, this lexicon-based approach for profanity detection provides high precision but lower recall, which is mainly due to the variations in curse words (e.g., due to misspellings and abbreviations) and context sensitivity of cursing. Though we believe that, for this work, high-precision is preferred and recall of 72.03% is considered reasonable, more sophisticated classification methods that can further improve the recall remain an interesting topic for future work.
Cursing Frequency and Choice of Curse Words
Prior studies have found that 0.5% to 0.7% of all the words we speak in our daily lives are curse words <ref>Jay, T. Cursing in America: A Psycholinguistic Study of Dirty Language in the Courts, in the Movies, in the Schoolyards, and on the Streets. John Benjamins Publishing Co, 1992.</ref> <ref name="The sounds of social life">Mehl, M. R., and Pennebaker, J. W. The sounds of social life: a psychometric analysis of students’ daily social environments and natural conversations. Journal of personality and social psychology 84, 4 (2003), 857.</ref>. Turning to Internet chatrooms, Subrahmanyam et. al. <ref name="Connecting developmental constructions to the internet">Subrahmanyam, K., Smahel, D., and Greenfield, P. Connecting developmental constructions to the internet: identity presentation and sexual exploration in online teen chat rooms. Developmental psychology 42, 3 (2006), 395.</ref> reported that 3% of utterances contain curse words. Our comparison of cursing frequencies from different studies is shown in the following Table. Compared with existing studies, our estimate of cursing frequency was conducted for a significantly larger population: 14 million Twitter users and 51 million tweets. After removing punctuation marks and emoticons, we find that curse words occurred at the rate of 0.80% on Twitter, which is more than the rate (0.5%) in <ref name="The sounds of social life"/>. About 7.73% of all the tweets in our collection contain curse words, namely, one out of 13 tweets contains curse words. If we consider one tweet as roughly one utterance, this rate is more than twice the rate (3%) in <ref name="Connecting developmental constructions to the internet"/>.
|Mehl 2003 et. al. <ref name="The sounds of social life"/>||Subrahmanyam 2006 et. al. <ref name="Connecting developmental constructions to the internet"/>||Our work|
|Subject||52 undergraduates||1,150 chatroom users||14 million Twitter users|
|Sample||4 days’ tape recording||12,258 utterance||51 million tweets|
|Cursing Frequency||0.5% of all words||3% of all utterances||0.80% of all words, 7.73% of all tweets|
Besides the cursing frequency, we are also interested in the question: Which curse words are most popular? We manually grouped different variations of curse words into their root forms, e.g., @$$, a$$, → ass. If a curse word is the combination of two or more words, and one of its component words is also a curse word, then it will be grouped into its cursing component word, e.g., dumbass, dumbasses, @sshole, a$$h0!e, a55hole → ass. All the 788 curse words are grouped into 89 distinct groups based on the root curse words and the frequencies of the top 20 words are shown in the following Figure. The most popular curse word is fuck, which covers 33.57% of all the curse word occurrences, followed by shit (15.45%), ass (14.66%), bitch (10.67%), nigga (10.30%), hell (3.91%), whore (1.84%), dick (1.74%), piss (1.55%), and pussy (1.24%).
Realizing that only a small subset of curse words occurs very frequently, we also draw the cumulative distribution of top 20 curse words. We find that the top seven curse words – fuck, shit, ass, bitch, nigga, hell and whore cover 90.40% of all the curse word occurrences.
Cursing vs. Emotion
Psychology studies <ref name="The pragmatics of swearing"/> suggest that “the main purpose of cursing is to express emotions, especially anger and frustration.” Thus, we aim to explore emotions expressed in cursing tweets and compare them with those in non-cursing tweets. We apply Machine Learning classifiers to the 51 million cursing tweets, and obtain the emotion distributions on both cursing and non-cursing tweets, which is shown in following Figure. Not surprisingly, cursing is associated with negative emotions: 21.83% and 16.79% of the cursing tweets express sadness and anger emotions, respectively. In contrast, 11.31% and 4.50% of the non-cursing tweets express sadness and anger emotions, respectively. This can be explained by the fact that curse words are usually used for venting out negative emotions, especially anger and sadness.
Cursing vs. Time
Previous study <ref name="We feel fine and searching the emotional web"/> has shown marked difference in emotions (e.g., stress, happiness) expressed between weekdays and weekend, or between morning and night. Similarly, we investigate the relationship between cursing and two types of time periods: times during a day and days of a week. For each tweet, Twitter provides a timestamp based on UTC timezone, indicating when the tweet was posted. However, it makes more sense to use local time when the tweet was posted, so we calculate the corresponding local timestamp for every tweet whose sender has specified timezone in his/her profile. In following Figure, the lines with triangles and crosses stand for the volumes of overall tweets and cursing tweets, and the line with circles stands for the ratio of cursing tweets to overall tweets. A flat segment of the line with circles suggests the cursing ratio is stable – the increment of cursing tweets keeps pace with that of overall tweets. A rising line segment with circles suggests that the increment of cursing tweets outpaces that of overall tweets. A falling line segment with circles suggests that the increment of cursing tweets is outpaced by that of overall tweets.
We have the following interesting observations from above Figure. First, the pattern of overall tweet volume fits humans’ diurnal activity schedule: it starts rising at 5 am when people get up at the beginning of a day. From then, it keeps rising, and reaches a small peak around lunch time. It keeps rising until it reaches the peak of the day around 9 pm, after which people start preparing to go to sleep. Second, cursing is ever-lasting: the black cursing ratio line with circles always stays above 0, suggesting that people curse all the time throughout the day. Third, the increment of cursing outpaces the increment of overall tweet volume during most of the day time: people curse more and more as they go through the day! In particular, there are two sharp rising slopes: 6 am - 11 am and 3 pm - 1:30 am. We speculate that Twitter users being in good mood during lunch contributes to the flat ratio line segment between 11 am - 2 pm (lunch time). It seems that midnight to 1:30 am is the high time for cursing. After that, the volume of cursing tweets decreases faster than the overall tweets.
We now explore the popularity changes of top seven curse words at different times of a day to gain more insights. We define relative frequency for a curse word as its total number of occurrences in any tweet divided by the total number of tweets in a predefined time window. Three representative time windows are selected: 12 am - 2 am, 5 am-7 am and 12pm-2pm. We observe that the relative frequencies for almost all of top seven curse words keep increasing from 5 am - 7 am to 12 pm - 2 pm and from 12 pm - 2 pm to 12 am - 2 am. On average, from 5 am - 7 am to 12am - 2 am, the relative frequencies of top seven curse words have increased by 59.60%. In descending order of their relative increase of relative frequencies, top seven curse words rank as follows: ass (86.33%), nigga (78.17%), bitch (61.03%), shit (56.90%), fuck (50.85%), whore (34.54%) and hell (23.69%).
To explore how people curse during different days of a week, we plot the ratio of cursing tweets to total tweets each day for four weeks, separately, in following Figure. The general trend is that users start with relatively high cursing ratios on Mondays, Tuesdays and Wednesdays, then the ratios keep decreasing on the following three days, and reaches the lowest point on Saturdays. Then they start rising up on Sundays. To see the general trend clearly, readers are referred to see the four-week average ratio in the plot. Although we observe this general pattern across four weeks, we are still unclear about the reason. We are interested in the popularity changes of top seven words during different days of a week, similar to those at different times of a day. We select the following two time windows: Monday-Tuesday and Friday-Saturday. On average, from Friday-Saturday to Monday-Tuesday, the relative frequencies of top seven curse words have increased by 10.36%. In descending order of their relative increase of relative frequencies, top seven words rank as follows: bitch (15.15%), shit (13.55%), nigga (12.41%), ass (10.37%), whore (10.30%), hell (7.53%) and fuck (7.16%).
Cursing vs. Message Type
Tweets can be grouped into different message types and we are curious whether users curse differently in different types of tweets. Specifically, retweet refers to the tweet that is simply a re-posting of a tweet from another user. If a user receives a tweet from another user, and this user clicks on reply button to write a new tweet to reply to this tweet, then this newly posted tweet is called a reply. If a user starts sending a tweet to another user, and this tweet is not a reply to any other tweet, we name it a starter. If a tweet mentions another user, and it is neither a reply nor a starter, we call it a mention. If a tweet does not belong to any of the above categories, it is an update.
We plot the ratio of cursing tweets in each message category in following Figure, where the black horizontal line stands for the average ratio of cursing tweets to all the tweets. It is interesting to note that although we see quite a bit of cursing messages on Twitter in general, when the messages are sent to other users, the cursing ratios are below average. The ratio of cursing tweets in starters is 3.93%, which is only 51.01% of the average cursing ratio. This suggests that users perform self-censorship to some extent when they directly talk to other users. When they post updates about themselves or simply mention other users’ names, they do not pay as much attention to the use of curse words.
Cursing vs. Location
We leverage latitudes and longitudes embedded within tweets to infer the venues, where tweets were posted. The following Table shows the different categories of venues, the raw number of cursing tweets and the ratio of cursing tweets to all the tweets sent from venues of the same category.
We have the following observations: a) The pattern of more swearing in more relaxed environment still holds, e.g., cursing ratios in a descending order are: Residence (7.08%) > Shop & Service (6.41%) > Nightlife Spot (6.37%) > Entertainment & Recreation (5.71%) > Professional Places (5.64%) > Travel & Transport (5.34%). However, the gaps are much less than those in physical world, partly due to the fact that communications happen in digit world. b) Two exceptions, College Academic Place and High School, have very high cursing rates. This suggests that young high school and college students tend to use more curse words, even in educational places. c) We speculate that users are usually in a good mood while out in the nature, and that is why its cursing ratio is the lowest (4.97%) among all the venues.
Cursing vs. Gender
We applied an gender-detection algorithm based on users' first names and identified 4,639,204 females and 3,826,701 males in our Twitter user collection. Recall that previously we grouped tweets into five categories: mention, reply, retweet, starter and update. Here we consider only reply and starter, since they represent targeted messages between Twitter users with explicit message sender (who) and recipient (to whom) specified. These messages are further divided into four groups based on gender – female to female, male to female, female to male and male to male. To make results comparable, we randomly sampled 100K tweets from each of these four groups and statistics are shown in following Table.
Comparing the same-gender contexts (F to F and M to M) with the mix-gender contexts (F to M and M to F), we observe that people are more likely to use curse words within the same-gender context, and this tendency is more obvious when the message senders are males (5.48% vs. 4.19%). This is consistent with the findings in prior studies <ref name="The pragmatics of swearing"/> <ref>Pilotti, M., Almand, J., Mahamane, S., and Martinez, M. Taboo words in expressive language: Do sex and primary language matter? American International Journal of Contemporary Research 2, 2 (2012), 17–26.</ref> on offline communications. Moreover, Male-to-Male communication has the highest cursing ratio: 5.48%, while Female-to-Male has the lowest cursing ratio: 3.81%.
Regarding the preference of curse words, out of randomly sampled 100K tweets for each of the four groups (see Table below), we also find clear difference between females and males. There are a set of words that are used significantly more often by males than by females, for example, fuck, shit, and nigga. Some other words are significantly overused by females, such as bitch and slut. It is also interesting to observe that such differences are more apparent between two same-gender contexts – F to F vs. M to M. This suggests that the genders of both “who” and “whom” matter in the choice of curse words.
In this paper, we investigated the use of curse words in the context of Twitter based on the analysis of randomly collected 51 million tweets and about 14 million users. In particular, we explored four questions that have been identified as important by the prior swearing studies in the areas of psychology, sociology, and linguistics.
Regarding the question of ubiquity of cursing on Twitter, we examined the frequency of cursing and people’s preference in the use of specific curse words. We found that the curse words occurred at the rate of 0.80% on Twitter, and 7.73% of all the tweets in our dataset contained curse words. We also found that seven most frequently used curse words accounted for more than 90% of all the cursing occurrences. The second question we studied is the utility of cursing, especially the use of cursing to express emotions. We built a classifier which identified five different emotions from tweets – anger, joy, sadness, love, and thankfulness. Based on the classification results, we found that cursing on Twitter was most closely associated with two negative emotions: sadness and anger.
Prior studies suggest that cursing is sensitive to various contextual variables. We focused on examining three contextual variables regarding when, where and how the cursing occurs. We found that the pattern of overall tweet volume matches peoples diurnal activity schedule, and people curse more and more after they getup in the morning till sleep hours of the night. Our study of the relation between cursing and message types suggests that users perform self-censorship when they talk directly to other users. We find that users do curse more in relaxed environments, but the differences across different environments are very small, partly due to the fact that Twitter messages are posted in virtual digital world.
The last question we tried to investigate is about who says curse words to whom. We examined the gender factor and how they might affect people’s cursing behaviors on Twitter. Our results support the findings from prior studies that gender factor relate to people’s propensity to curse and the choice of curse words. Specifically, men curse more than women, men overuse some curse words different from what women use and vice versa, and both men and women are more likely to curse in the same-gender contexts.
This research was supported by US National Science Foundation grant IIS-1111182. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
This work is an updated version of the following work:
Wenbo Wang, Lu Chen, Krishnaprasad Thirunarayan and Amit P. Sheth. Cursing in English on Twitter. In ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW 2014), 2014.
It primary corrects two programming errors. Main change was that the cursing word frequency on Twitter drops from 1.15% to 0.80%, but the cursing tweet frequency (7.73%) stays the same. Overall most discoveries from this paper remain unchanged. Detailed errata information can be found: Errata. Paper with corrections in review mode can be found: Cursing in English on Twitter (with corrections)