Do Won Kim

240204

Preliminary Eligibility Check

Brief summary of data

As of Feb 2 (9:30 AM), I downloaded entries in mercury_user (=832) and following_result (=721) tables from the DB.

I first filtered out those participants with the following criteria: (1) those who failed to follow our study account, stored in following_result table; and (2) those without vsid field in the mercury_user table, which denotes that they didn’t finish the Wave 1 survey to the end. This initial filtering process resulted in total 705 participants, the very sample which the eligibility checks are applied to.

Employed eligibility criteria

To be eligible in this analysis,

The account should not be too new. If participants’ Twitter/X accounts were created before Oct 1st 2023, they are eligible; if else, they are not eligible, thus shouyld be filtered out.
save_user_info() does this job. It retrieves and saves public Twitter/X information of an authorized user. I store user_id and user.fields parameters including followers_count, following_count, tweet_count, listed_count and like_count along with the time stamp.
Participants should follow at least one low quality account in our list. For this analysis, I used the current version where there are 316 low quality accounts.
Participants should also follow our study account. There were a few cases where participants requested following in the survey and deleted the request before I accept them. Some participants requested following and I accepted them, but later to be removed from our study account’s list of followers (due to deletion of freindship by them, suspension or deletion of their accounts, etc.). Hence, we have to check before inviting these participants to Wave 2 that they actually keep following us.
relationship_check() does #2 and #3 above. It checks dyad relationships (=connection_status) between participants and low quality accounts as well as our study account.

The results are stored in eligibility_results.csv file. This has three columns:

user_id : participant’s Twitter ids
criteria :
- account_created (whether the account was created before Oct 1st, 2023),
- following_us (whether the participant is following our study account), and
- following_NG (whether the participant is following the low quality accounts from our NewsGuard list).
eligible : True or False
count : For following_NG criterion, I also counted the number of low quality accounts that the participant is following to get some distribution.

Analysis

Load the eligibility_results.csv.

library(readr)
library(DT)
df <- read_csv("eligibility_results.csv", col_names = TRUE)
df |> unique() -> df
df$user_id = as.character(df$user_id)
datatable(df)

Since the data is long-type, let’s reshape the data into wide type to ease the analysis.

library(tidyverse)
df |> 
  pivot_wider(id_cols = "user_id", id_expand = TRUE, names_from = "criteria", values_from = "eligible") -> df_wide

datatable(df_wide)

Starting N = 705

Accounts created before Oct 23, 2023: n=540 (=76.6%)
Among these n=540 sample, those that follow our study account: 530 (=98% out of 540; 75.2% out of starting N)
Among these n=399 sample, those that follow low quality account: 97 (=18.3% out of 399; 13.76% out of starting N)

df_wide |> 
  filter(account_created == "TRUE") |>
  count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1   540

df_wide |> 
  filter(account_created == "TRUE" & following_us == "TRUE") |>
  count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1   530

df_wide |> 
  filter(account_created == "TRUE" & following_us == "TRUE" & following_NG == "TRUE") |>
  count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1    97

So, if we trust this estimate (=13.76% out of those finished Wave 1), to invite 375 participants to Wave 2, we need:

13.76% X (participants who passed other criteria (thus finished Wave 1 survey without being dropped out, as well as account_created, following_us criteria) = 375

We need 2725 (finished Wave 1) to have 375 invited to Wave 2.

Reverse chronological home timeline

I randomly selected 24 participants out of 705 and collected their reverse chronological home timeline data, to check whether there are any low quality accounts found in these 24 participants’ home timeline.

df_subset = df_wide |> filter(!is.na(hometimeline))
df_subset2 = df_subset |> 
  mutate(following_NG = ifelse(is.na(following_NG), "FALSE", "TRUE"))

datatable(df_subset2)

Let’s make a confusion matrix here.

df_subset2 |> filter(account_created == TRUE) -> df_subset2

df_subset2 |> mutate(
  eligible_following = as.factor(ifelse(following_NG == TRUE, 1, 0)),
  eligible_hometimeline = as.factor(ifelse(hometimeline == TRUE, 1, 0))
) -> df_conf

datatable(df_conf)

library(caret)
# prediction: hometimeline, reference: following 
confusionMatrix(df_conf$eligible_hometimeline, df_conf$eligible_following)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 10  2
##          1  1  4
##                                          
##                Accuracy : 0.8235         
##                  95% CI : (0.5657, 0.962)
##     No Information Rate : 0.6471         
##     P-Value [Acc > NIR] : 0.09843        
##                                          
##                   Kappa : 0.5984         
##                                          
##  Mcnemar's Test P-Value : 1.00000        
##                                          
##             Sensitivity : 0.9091         
##             Specificity : 0.6667         
##          Pos Pred Value : 0.8333         
##          Neg Pred Value : 0.8000         
##              Prevalence : 0.6471         
##          Detection Rate : 0.5882         
##    Detection Prevalence : 0.7059         
##       Balanced Accuracy : 0.7879         
##                                          
##        'Positive' Class : 0              
##