Eligibility Check Final

1 Description of Data

I compared Verasight’s raw data with our DB data to filter out those (1) who failed to finish the survey to the end, and (2) who took the survey several times with multiple Verasight accounts but authorized the same Twitter account each time. Please refer the this notebook for this initial step: https://do-won.github.io/design/240208.html. I followed the same steps with the supplementary data including participants who finished the Wave 1 survey after we got the raw data from Verasight.

This process led to identifying a total of 780 (674 old entries + 106 new entries) users who successfully passed attention checks, authorized their Twitter account, followed our study account at the time of the survey, and fully completed the survey.

Based on this list of 780 users, I will ran eligibility checks.

2 Employed eligibility criteria

To be eligible for Wave 2,

[1] The account should not be too new. If participants’ Twitter/X accounts were created before Nov 1st 2023, they are eligible; if else, they are not eligible, thus should be filtered out.

account_created (whether the account was created before Nov 1st, 2023)

[2] Participants should follow our study account. There were a few cases where participants requested following in the survey and deleted the request before I accept them. Some participants requested following and I accepted them, but later to be removed from our study account’s list of followers (due to deletion of friendship by them, suspension or deletion of their accounts, etc.). Hence, we have to check before inviting these participants to Wave 2 that they actually keep following us.

following_us (whether the participant is following our study account)

[3] (Following) Participants should follow at least one low quality account in our list.

The list of untrustworthy new sources that we use is ~~~~ (ADD description).

following_NG (whether the participant is following at least one of the low quality accounts from our list.

[4] (Home timeline) Participants should follow at least one low quality account in our list.

hometimeline (whether participant’s home timeline has at least one low quality accounts from our list.

3 Summary of the data

The eligibility check results are stored in eligibility_final.csv file.

Four columns:

user_id : participant’s Twitter ids
criteria :
- account_created (whether the account was created before Nov 1st, 2023),
- following_us (whether the participant is following our study account),
- following_NG (whether the participant is following the low quality accounts from our list).
- hometimeline (whether the participant’s home timeline has the low quality accounts’ tweets from our list).
eligible : True or False
count : For following_NG and hometimeline criterion, I also counted the number of low quality accounts (or tweets from these accounts).

library(readr)
library(tidyverse)
library(DT)
library(caret)
eligibility_final <- read_csv("eligibility_final.csv", 
    col_types = cols(id = col_skip(), user_id = col_character()))

eligibility_final |> datatable()

Since the data is long-type, let’s reshape the data into wide type to ease the analysis.

eligibility_final |> 
  pivot_wider(id_cols = "user_id", id_expand = TRUE, names_from = "criteria", values_from = "passed") |> 
  mutate(account_created = ifelse(is.na(account_created), FALSE, account_created),
         following_us = ifelse(is.na(following_us), FALSE, following_us),
         following_NG = ifelse(is.na(following_NG), FALSE, following_NG)) -> df_wide_final

datatable(df_wide_final)

4 Result

4.1 Eligibility rate

Starting N = 780

Accounts created before Nov 1 2023: n=651 (=88%)
(Among these accounts created before Nov 1 2023) Those that follow our study account: 639 (=98.2% out of 651; 82% out of starting N)

Below shows comparison between Following vs. Home timeline criteria:

(Among these accounts are not too new & follow our study account) Those that follow any low quality accounts from our list: 208 (=32.6% out of 639; 26.7% out of starting N)
(Among these accounts are not too new & follow our study account) Those that has in their home timeline tweets from any low quality accounts from our list: 319 (=50% out of 639; 40.9% out of starting N)
(Among these accounts are not too new & follow our study account) Those that has that follow OR has in their home timeline tweets from any low quality accounts from our list: 325 (=50.1% out of 639; 41.7% out of starting N)

# Accounts created before Nov 23, 2023 
df_wide_final |> 
  filter(account_created == "TRUE") |>
  count() 

# + Those that follow our study account
df_wide_final |> 
  filter(account_created == "TRUE" & following_us == "TRUE") |>
  count() 

# Those that follow any low quality accounts 
df_wide_final |> 
  filter(account_created == "TRUE" & following_us == "TRUE") |>
  filter(following_NG == "TRUE") |>
  count() 

# Those with home timeline with any low quality tweets  
df_wide_final |> 
  filter(account_created == "TRUE" & following_us == "TRUE") |>
  filter(hometimeline == "TRUE") |>
  count() 


# Those follow OR have homet timeline tweets 
df_wide_final |> 
  filter(account_created == "TRUE" & following_us == "TRUE") |>
  filter(following_NG == "TRUE" | hometimeline == "TRUE") |>
  count()

4.2 Confusion Matrices

Based on those 639 users who passed the #1 and #2 criteria (among these accounts are not too new & follow our study account), let’s make some confusion matrices.

We assume that Following is the ground truth.

df_wide_final |> 
  filter(account_created == TRUE & following_us == TRUE) |>
  select(user_id, following_NG, hometimeline) |> 
  mutate(
  EG_following = as.factor(ifelse(following_NG == TRUE, 1, 0)),
  EG_hometimeline = as.factor(ifelse(hometimeline == TRUE, 1, 0))) -> df_conf_final

		Following(ground truth)	Following(ground truth)
		Positive (TRUE)	Negative (FALSE)
Home Timeline	Positive (TRUE)	TP (202)	FP (117)
Home Timeline	Negative (FALSE)	FN (6)	TN (314)

# prediction: hometimeline, reference: following 
confusionMatrix(df_conf_final$EG_hometimeline, df_conf_final$EG_following, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 314   6
##          1 117 202
##                                           
##                Accuracy : 0.8075          
##                  95% CI : (0.7748, 0.8374)
##     No Information Rate : 0.6745          
##     P-Value [Acc > NIR] : 4.165e-14       
##                                           
##                   Kappa : 0.6148          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.9712          
##             Specificity : 0.7285          
##          Pos Pred Value : 0.6332          
##          Neg Pred Value : 0.9812          
##              Prevalence : 0.3255          
##          Detection Rate : 0.3161          
##    Detection Prevalence : 0.4992          
##       Balanced Accuracy : 0.8498          
##                                           
##        'Positive' Class : 1               
##

5 Next steps?

We will invite these 325 participants (1) who passed all in-survey criteria (e.g., attention checks, fully completed to the end, etc.), (2) whose Twitter/X accounts are not too new (created before Nov 1 2023), (3) who follow our study account, (4) follow OR exposed in their home timelines to untrustworthy source in our list.

df_wide_final |> 
  filter(account_created == "TRUE" & following_us == "TRUE") |>
  filter(following_NG == "TRUE" | hometimeline == "TRUE") |>
  select(user_id) -> to_invite


users_for_elig_test_final <- read_csv("users_for_elig_test_final.csv", 
    col_types = cols(user_id = col_character()))


users_for_elig_test_final |> 
  merge(to_invite, by="user_id") -> users_to_invite_wave2

write.csv(users_to_invite_wave2, file="Users_to_invite_Wave2.csv", row.names = FALSE)