Stony Brook Led Team Using Twitter to Measure and Forecast Changes in COVID-19 Symptoms and Mental Health

A team of graduate student researchers led by Stony Brook's Andrew Schwartz (Assistant Professor in the College of Engineering and Applied Science’s Department of Computer Science) and Stanford University's Johannes Eichstaedt is using Twitter to track and analyze COVID-19 symptoms and mental health in U.S. communities. Large scale analysis of linguistic patterns in social media offer one of the few (if not only) large-scale instruments for measuring the physical and psychological health of populations down to the county level, daily, across most of the U.S.

COVID-19 has disrupted the world economically, politically and socially to a level never seen before, and its disruption is still evolving. The United States’ slow start in testing means there is still no population-level surveillance in place, and extended management of the virus spread may need near real-time population-level surveillance to track the results of containment measures being relaxed and/or reinstated for given areas.

At the same time, the response to COVID-19 (social distancing, sheltering in place, etc.) is the largest psychological disruption of society since World War II, and the economic impact of unemployment and economic precarity potentially creates additional distress. Due to these factors, the entire risk allocation will likely be affected, ranging from the well-adjusted (a lowering of subjective well-being) to the vulnerable (an increase in mental illness).

The social networking site, Twitter, has been used in the past to track both communicable and non-communicable diseases (e.g., the flu and heart disease, respectively). As a dynamic, ever-changing data set, Twitter’s unique advantage is providing retrospective baselines from which changes can be detected. In addition, the latest advances in this area allow us to actively track changes across time in psychological and medical variables from social media data. We can present more representative estimates through post-stratification by adjusting for demographic biases of the Twitter samples. By combining the nature of big social media data and the improvement of methods for inferring psychological and health information from it, a Twitter-based surveillance architecture can be a valuable tool to inform COVID-19-related public health decisions.

Creating a COVID-19 Sociolinguistic Base Rate and Measuring Distress Changes for Counties Across the US (and at the Township or Neighborhood Level Across Long Island and New York City)
The research team is utilizing AI-based language assessment and statistical techniques to isolate dependable signals of active COVID-19 infections (such as the discussion of symptoms or those seeking testing). A “sociolinguistic COVID-19 baserate” will be formed from the rate of these linguistic patterns in social media, controlled for general coronavirus trends in discussion.

Adding on to recently validated methods, the researchers are measuring the impact of the virus and of social distancing/shelter in place orders on mental health (including depression, anxiety and loneliness) and subjective well-being (SWB) across counties on a weekly level, and using retrospective base rates to detect relevant changes.

Discovering Reliable Early Signs and Symptoms of an Individual COVID-19 Infection Using Current, Temporal AI Models
The researchers are using individuals’ self-disclosing a positive COVID-19 test on public social media to perform longitudinal analyses with modern deep learning to represent patterns in language to accurately discover discussions of known linguistic patterns. They are also applying open vocabulary techniques to automatically identify changes in the language that reliably precedes a positive diagnosis.

The team’s goals for this project will be to provide ongoing, weekly county-level Twitter-based estimates of sociolinguistic COVID-19 base rates and estimates of psychological functioning (these will cover symptoms/physical connection/isolation and depression, anxiety, loneliness and subjective well-being, respectively). The bias-corrected county estimates will be quickly circulated to the larger epidemiological and psychological research communities. Following on the success of the open-source Differential Language Analysis Toolkit (now used in more than 50 peer-reviewed studies), the team will make these toolkits freely available to the research community.

The team’s multidisciplinary background includes more than eight years of experience in conducting cross-cutting research, which has produced dozens of peer-reviewed papers, numerous software toolkits and many young researchers educated in both social science and computer science (AI). The researchers’ bases span Psychology (Johannes C. Eichstaedt), Computer Science (Andrew Schwartz), and Public Health Epidemiology (Collaborators: Sean Clouston, Stony Brook Medicine; Rebecca Hasdell, Stanford -- Public Health). Their recent approaches to predict health from social media ranked at the top in competitive evaluations with many other entrants spanning academic and industry (including IBM). They are working closely with stakeholders in the Stony Brook Hospital system (serving most of Long Island) as well in the Einstein/Montefiore Hospital (serving the Bronx and New Rochelle, which has a low income, aging population heavily burdened by COVID-19).

Key Collaborators
Johannes C. Eichstaedt, Psychology, Human-Centered AI, Stanford University
Sean Clouston, Public Health Epidemiology, Stony Brook Medicine
Stacey B. Scott, Psychology, Aging Populations/Mobile Assessment, Stony Brook University

Additional Collaborators
Mary Saltz, Department of Biomedical Informatics, Stony Brook University
Lyle Ungar, University of Pennsylvania
I.V. Ramakrishnan, Department of Computer Science, Stony Brook University
Niranjan Balasubramanian, Department of Computer Science, Stony Brook University