Research

Research

Peer-reviewed Articles

Cao, Jian, Thomas Chadefaux. 2024. “Dynamic Synthetic Controls: Accounting for Varying Speeds in Comparative Case Studies.” Political Analysis. doi: 10.1017/pan.2024.14.
Li, Zhuofang, Jian Cao, Nicholas Adams-Cohen, and R. Michael Alvarez. 2023. “The Effect of Misinformation Intervention: Evidence from Trump’s Tweets and the 2020 Election.” Multidisciplinary International Symposium on Disinformation in Open Online Media 2023 Proceedings. doi: 10.1007/978-3-031-47896-3_7.
Cao, Jian, Seo-young Silvia Kim, and R. Michael Alvarez. 2022. “Bayesian Analysis of State Voter Registration Database Integrity.” Statistics, Politics and Policy. doi: 10.1515/spp-2021-0016.
Cao, Jian, Christina M. Ramirez, and R. Michael Alvarez. 2021. “The politics of vaccine hesitancy in the United States.” Social Science Quarterly. doi: 10.1111/ssqu.13106.
Alvarez, R. Michael, Jian Cao, and Yimeng Li. 2021. “Voting Experiences, Perceptions of Fraud, and Voter Confidence.” Social Science Quarterly. doi: 10.1111/ssqu.12940.
Srikanth, Maya, Anqi Liu, Nicholas Adams-Cohen, Jian Cao, R. Michael Alvarez, and Anima Anandkumar. 2021. “Dynamic Social Media Monitoring for Fast-Evolving Online Discussions.” Knowledge Discovery and Data Mining 2021 Proceedings. doi: 10.1145/3447548.3467171.
Cao, Jian, Nicholas Adams-Cohen, and R. Michael Alvarez. 2021. “Reliable and Efficient Long-Term Social Media Monitoring.” Journal of Computer and Communications. doi: 10.4236/jcc.2021.910006.

Working Papers

Multiple Imputation for Large Multi-Scale Data With Linear Constraints. (with Paul Beaumont)

Abstract: The use of multiple imputation of missing data in empirical studies has become increasingly popular in recent years. However, currently available multiple imputation methods face significant challenges when applied to large hierarchical, multidimensional data sets that are subject to linear aggregation constraints. In this paper we introduce a novel multiple imputation method designed to address these challenges. Our method leverages singular multivariate normal distributions within an Expectation Maximization algorithm combined with a Parallel-Sequential Imputation scheme to handle large and complex data sets that include linear aggregation constraints. Testing on real data sets demonstrates that the new method obtains up to twice the accuracy and is as much as an order of magnitude faster than leading alternative methods. We apply our method to estimate a panel data model of average weekly wages and show that our method produces estimates that unbiased and as efficient as estimates based on the dataset with no missing values.

Ballot Rejections and Ballot Curing in Washington State. (with Canyon Foot, Jay Lee, R. Michael Alvarez, Paul Manson, and Paul Gronke)

Abstract: November 2020 was the first time in US history that a plurality of voters cast absentee or mail ballots. The dramatic rise of mail voting in response to the COVID-19 pandemic has led to increased attention on the potential benefits and limitations of conducting elections by mail. One of the main drawbacks to vote-by-mail policies is that states usually reject a much larger percentage of mail ballots than they do ballots cast in-person. This paper uses 27 ballot ``matchback’’ files from the state of Washington to examine, for the first time, the patterns in a state’s challenged and cured ballots. We find that younger voters, voters of color, inexperienced voters, and male voters all have substantially elevated rates of ballot rejections. These patterns are driven by disparities in signature-based ballot challenges, rather than differences in rates of ballot curing or any other part of the process. Additionally, we examine the amount of time between ballot challenges and ballot cures, geographic variation in rejection rates, and discuss potential policy interventions to reduce disparities and lower rejection rates overall.

Work in Progress

“Dynamic Interaction Panel Estimation: Accounting for Complex Interdependence in Panel Data.”
(with Thomas Chadefaux)
“Enhancing Regression Analysis through Self-Aligned DTW-Derived Speed Profiles in Time Series Data.”
(with Thomas Chadefaux)
“The Parallel Quasi-Monte Carlo Bayesian Multi-Scale Multiple Imputation Method.”
(with Paul Beaumont)
“Mailing It In: Voter Confidence in Vote-By-Mail In the 2020 Presidential Election.”
(with R. Michael Alvarez and Seo-young Silvia Kim)

Research Experience

Trinity College Dublin
Research Fellow (January 2022 – Present)
Project: Patterns of Conflict Emergence

Identify patterns in the pre-conflict actions using data on conflict events and in their perceptions using data from financial markets, news articles, and diplomatic documents.

Evaluate the utility of these patterns to improve forecasts of conflict with both historical and live out-of-sample predictions.

Summarize the core features of dangerous patterns into motifs that can help build new theories of conflict emergence and escalation.

California Institute of Technology
Visitor (January 2022 – Present)
Postdoctoral Scholar in Data Science and Election Integrity (July 2019 – December 2021)
Project: Election Auditing

Developed probabilistic matching and Bayesian multivariate models using GCP for large election database auditing in California and Florida.

Implemented entity resolution and anomaly detection on daily snapshots of voter registration databases that contain more than 20 million records and detected 10x more true anomalies than the existing methods did.

Project: Twitter Monitoring

Developed serverless architectures using GCP, AWS, and Oracle for long-term Twitter monitoring. They ingest, process, and store more than 4.5 billion tweets (30 TB in size) related to COVID-19, primary/general elections, and protests.

Work closely with the Computer Science team and implemented topic, spatial, network, and sentiment analyses on the collected tweets and identified COVID-19 misinformation and voting issues in the 2020 Election cycle.

Florida State University
Senior Researcher (August 2018 – June 2019)
Project: Large Missing Data Multiple Imputation

Developed the fastest and most accurate Bayesian inference method for missing data multiple imputation.

Developed a parallel-sequential imputation method that can impute large multi-scale data sets with 1.5 billion observations (500 GB in size).

Project: Economic Impact Modeling

Analyzed the economic impact of Florida’s housing and small business policies.

Developed a NETS-based impact analysis tool that has 1000 times finer resolution than the existing methods.