20240207

Updated Preliminary Eligibility Check

Brief summary of data

As of Feb 2 (9:30 AM), I downloaded entries in mercury_user (=832) and following_result (=721) tables from the DB.

I first filtered out those participants with the following criteria: (1) those who failed to follow our study account, stored in following_result table; and (2) those without vsid field in the mercury_user table, which denotes that they didn’t finish the Wave 1 survey to the end. This initial filtering process resulted in total 705 participants, the very sample which the eligibility checks are applied to.

Employed eligibility criteria

To be eligible in this analysis,

  1. The account should not be too new. If participants’ Twitter/X accounts were created before Oct 1st 2023, they are eligible; if else, they are not eligible, thus shouyld be filtered out.

    save_user_info() does this job. It retrieves and saves public Twitter/X information of an authorized user. I store user_id and user.fields parameters including followers_count, following_count, tweet_count, listed_count and like_count along with the time stamp.

  2. Participants should follow at least one low quality account in our list.

    • For this analysis, I used two lists: (1) current version where there are 316 low quality accounts, and (2) longer version with 440 low quality accounts.
  3. Participants should also follow our study account. There were a few cases where participants requested following in the survey and deleted the request before I accept them. Some participants requested following and I accepted them, but later to be removed from our study account’s list of followers (due to deletion of freindship by them, suspension or deletion of their accounts, etc.). Hence, we have to check before inviting these participants to Wave 2 that they actually keep following us.

    relationship_check() does #2 and #3 above. It checks dyad relationships (=connection_status) between participants and low quality accounts as well as our study account.

The results are stored in new_eligibility_results.csv file. This has three columns:

  • user_id : participant’s Twitter ids

  • criteria :

    • account_created (whether the account was created before Oct 1st, 2023),

    • following_us / following_us_new (whether the participant is following our study account), and

    • following_NG / following_NG_new (whether the participant is following the low quality accounts from our NewsGuard list of 316 vs. 440).

  • eligible : True or False

  • count : For following_NG criterion, I also counted the number of low quality accounts that the participant is following to get some distribution.

Analysis

Load the new_eligibility_results.csv.

library(readr)
library(DT)
df <- read_csv("new_eligibility_results.csv", col_names = TRUE)
df |> unique() -> df
df$user_id = as.character(df$user_id)
datatable(df)

Since the data is long-type, let’s reshape the data into wide type to ease the analysis.

library(tidyverse)
df |> 
  pivot_wider(id_cols = "user_id", id_expand = TRUE, names_from = "criteria", values_from = "eligible") -> df_wide

datatable(df_wide)

With 316 list (original):

Starting N = 705

  • Accounts created before Oct 23, 2023: n=540 (=76.6%)

  • Those that follow our study account: 530 (=98% out of 540; 75.2% out of starting N)

  • Those that follow low quality account: 97 (=18.3% out of 530; 13.76% out of starting N)

With 440 list (new):

Starting N = 705

  • Accounts created before Oct 23, 2023: n=540 (=76.6%)

  • Those that follow our study account: 528 (=97.8% out of 540; 74.9% out of starting N)

  • Those that follow low quality account: 117 (=22.16% out of 528; 16.6% out of starting N)

So, the number has moved up: 97 –> 117 (20 up) with the longer list.

Reverse chronological home timeline

I randomly selected 24 participants out of 705 and collected their reverse chronological home timeline data, to check whether there are any low quality accounts found in these 24 participants’ home timeline.

df_subset = df_wide |> filter(!is.na(hometimeline))
df_subset2 = df_subset |> 
  mutate(following_NG = ifelse(is.na(following_NG), "FALSE", "TRUE"),
         following_NG_new = ifelse(is.na(following_NG_new), "FALSE", "TRUE"))

datatable(df_subset2)

Let’s make a confusion matrix here.

With 316 list (original):

df_subset2 |> filter(account_created == TRUE) |>
  select(user_id, account_created, following_us, following_NG, hometimeline) -> df_subset3
df_subset3 |> mutate(
  eligible_following = as.factor(ifelse(following_NG == TRUE, 1, 0)),
  eligible_hometimeline = as.factor(ifelse(hometimeline == TRUE, 1, 0))
) -> df_conf

datatable(df_conf)
library(caret)
# prediction: hometimeline, reference: following 
confusionMatrix(df_conf$eligible_hometimeline, df_conf$eligible_following)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 10  2
##          1  1  4
##                                          
##                Accuracy : 0.8235         
##                  95% CI : (0.5657, 0.962)
##     No Information Rate : 0.6471         
##     P-Value [Acc > NIR] : 0.09843        
##                                          
##                   Kappa : 0.5984         
##                                          
##  Mcnemar's Test P-Value : 1.00000        
##                                          
##             Sensitivity : 0.9091         
##             Specificity : 0.6667         
##          Pos Pred Value : 0.8333         
##          Neg Pred Value : 0.8000         
##              Prevalence : 0.6471         
##          Detection Rate : 0.5882         
##    Detection Prevalence : 0.7059         
##       Balanced Accuracy : 0.7879         
##                                          
##        'Positive' Class : 0              
## 

With 440 list (new):

df_subset2 |> filter(account_created == TRUE & following_us_new == TRUE) |> 
    select(user_id, account_created, following_us_new, following_NG_new, hometimeline_new) -> df_subset3_new
df_subset3_new |> mutate(
  eligible_following = as.factor(ifelse(following_NG_new == TRUE, 1, 0)),
  eligible_hometimeline = as.factor(ifelse(hometimeline_new == TRUE, 1, 0))
) -> df_conf_new

datatable(df_conf_new)
library(caret)
# prediction: hometimeline, reference: following 
confusionMatrix(df_conf_new$eligible_hometimeline, df_conf_new$eligible_following)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 8 3
##          1 2 4
##                                           
##                Accuracy : 0.7059          
##                  95% CI : (0.4404, 0.8969)
##     No Information Rate : 0.5882          
##     P-Value [Acc > NIR] : 0.2326          
##                                           
##                   Kappa : 0.3796          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.8000          
##             Specificity : 0.5714          
##          Pos Pred Value : 0.7273          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.5882          
##          Detection Rate : 0.4706          
##    Detection Prevalence : 0.6471          
##       Balanced Accuracy : 0.6857          
##                                           
##        'Positive' Class : 0               
##