20240207
Do Won Kim
2024-02-07
Updated Preliminary Eligibility Check
Brief summary of data
As of Feb 2 (9:30 AM), I downloaded entries in
mercury_user
(=832) and following_result
(=721) tables from the DB.
I first filtered out those participants with the following criteria:
(1) those who failed to follow our study account, stored in
following_result
table; and (2) those without
vsid
field in the mercury_user
table, which
denotes that they didn’t finish the Wave 1 survey to the end. This
initial filtering process resulted in total 705
participants, the very sample which the eligibility checks are applied
to.
Employed eligibility criteria
To be eligible in this analysis,
The account should not be too new. If participants’ Twitter/X accounts were created before Oct 1st 2023, they are eligible; if else, they are not eligible, thus shouyld be filtered out.
save_user_info()
does this job. It retrieves and saves public Twitter/X information of an authorized user. I storeuser_id
and user.fields parameters includingfollowers_count
,following_count
,tweet_count
,listed_count
andlike_count
along with the time stamp.Participants should follow at least one low quality account in our list.
- For this analysis, I used two lists: (1) current version where there are 316 low quality accounts, and (2) longer version with 440 low quality accounts.
Participants should also follow our study account. There were a few cases where participants requested following in the survey and deleted the request before I accept them. Some participants requested following and I accepted them, but later to be removed from our study account’s list of followers (due to deletion of freindship by them, suspension or deletion of their accounts, etc.). Hence, we have to check before inviting these participants to Wave 2 that they actually keep following us.
relationship_check()
does #2 and #3 above. It checks dyad relationships (=connection_status) between participants and low quality accounts as well as our study account.
The results are stored in new_eligibility_results.csv
file. This has three columns:
user_id
: participant’s Twitter idscriteria
:account_created
(whether the account was created before Oct 1st, 2023),following_us
/following_us_new
(whether the participant is following our study account), andfollowing_NG
/following_NG_new
(whether the participant is following the low quality accounts from our NewsGuard list of 316 vs. 440).
eligible
: True or Falsecount
: Forfollowing_NG
criterion, I also counted the number of low quality accounts that the participant is following to get some distribution.
Analysis
Load the new_eligibility_results.csv
.
library(readr)
library(DT)
df <- read_csv("new_eligibility_results.csv", col_names = TRUE)
df |> unique() -> df
df$user_id = as.character(df$user_id)
datatable(df)
Since the data is long-type, let’s reshape the data into wide type to ease the analysis.
library(tidyverse)
df |>
pivot_wider(id_cols = "user_id", id_expand = TRUE, names_from = "criteria", values_from = "eligible") -> df_wide
datatable(df_wide)
With 316 list (original):
Starting N = 705
Accounts created before Oct 23, 2023: n=540 (=76.6%)
Those that follow our study account: 530 (=98% out of 540; 75.2% out of starting N)
Those that follow low quality account: 97 (=18.3% out of 530; 13.76% out of starting N)
With 440 list (new):
Starting N = 705
Accounts created before Oct 23, 2023: n=540 (=76.6%)
Those that follow our study account: 528 (=97.8% out of 540; 74.9% out of starting N)
Those that follow low quality account: 117 (=22.16% out of 528; 16.6% out of starting N)
So, the number has moved up: 97 –> 117 (20 up) with the longer list.
Reverse chronological home timeline
I randomly selected 24 participants out of 705 and collected their reverse chronological home timeline data, to check whether there are any low quality accounts found in these 24 participants’ home timeline.
df_subset = df_wide |> filter(!is.na(hometimeline))
df_subset2 = df_subset |>
mutate(following_NG = ifelse(is.na(following_NG), "FALSE", "TRUE"),
following_NG_new = ifelse(is.na(following_NG_new), "FALSE", "TRUE"))
datatable(df_subset2)
Let’s make a confusion matrix here.
With 316 list (original):
df_subset2 |> filter(account_created == TRUE) |>
select(user_id, account_created, following_us, following_NG, hometimeline) -> df_subset3
df_subset3 |> mutate(
eligible_following = as.factor(ifelse(following_NG == TRUE, 1, 0)),
eligible_hometimeline = as.factor(ifelse(hometimeline == TRUE, 1, 0))
) -> df_conf
datatable(df_conf)
library(caret)
# prediction: hometimeline, reference: following
confusionMatrix(df_conf$eligible_hometimeline, df_conf$eligible_following)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 10 2
## 1 1 4
##
## Accuracy : 0.8235
## 95% CI : (0.5657, 0.962)
## No Information Rate : 0.6471
## P-Value [Acc > NIR] : 0.09843
##
## Kappa : 0.5984
##
## Mcnemar's Test P-Value : 1.00000
##
## Sensitivity : 0.9091
## Specificity : 0.6667
## Pos Pred Value : 0.8333
## Neg Pred Value : 0.8000
## Prevalence : 0.6471
## Detection Rate : 0.5882
## Detection Prevalence : 0.7059
## Balanced Accuracy : 0.7879
##
## 'Positive' Class : 0
##
With 440 list (new):
df_subset2 |> filter(account_created == TRUE & following_us_new == TRUE) |>
select(user_id, account_created, following_us_new, following_NG_new, hometimeline_new) -> df_subset3_new
df_subset3_new |> mutate(
eligible_following = as.factor(ifelse(following_NG_new == TRUE, 1, 0)),
eligible_hometimeline = as.factor(ifelse(hometimeline_new == TRUE, 1, 0))
) -> df_conf_new
datatable(df_conf_new)
library(caret)
# prediction: hometimeline, reference: following
confusionMatrix(df_conf_new$eligible_hometimeline, df_conf_new$eligible_following)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 8 3
## 1 2 4
##
## Accuracy : 0.7059
## 95% CI : (0.4404, 0.8969)
## No Information Rate : 0.5882
## P-Value [Acc > NIR] : 0.2326
##
## Kappa : 0.3796
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.8000
## Specificity : 0.5714
## Pos Pred Value : 0.7273
## Neg Pred Value : 0.6667
## Prevalence : 0.5882
## Detection Rate : 0.4706
## Detection Prevalence : 0.6471
## Balanced Accuracy : 0.6857
##
## 'Positive' Class : 0
##