Eligibility Check (Additional Analysis)
Do Won Kim
2024-02-12
1 Checking Assumption
Question: Do people who follow zero qualifying accounts have consistent enough exposure in their home timeline to be worth including?
By adopting Following OR Hometimeline
as our eligibility
criteria, we assume that people are repetitively/consistently exposed to
indirect low quality tweets. That is, if we find indirect low-quality
tweets from users’ home timelines (i.e., tweets that their friends
retweeted or quoted) at the time when we collected reverse chronological
timeline data, we assume that exposure is not a one-time event.
To check this assumption, let’s compare (1) data we collected by randomly choosing 24 participants for preliminary eligibility analysis (Feb 2) and (2) data collected for the same participants during the main eligibility check (Feb 10).
From the preliminary
eligibility check, I retrieved user_id
of 24 randomly
chosen participants and searched them in our main eligibility results
csv file.
In preliminary check, these users were the ones with “no following
but found in home timeline” condition: user_id
=
20416629
, 26845113
- Home timeline (T),
Following (F)
library(readr)
library(tidyverse)
library(DT)
df <- read_csv("new_eligibility_results.csv",
col_types = cols(id = col_skip(), user_id = col_character()))
random24_user_id <- c("1724530394920263680", "1741720467851841536",
"1746407404164517888", "1747176108846354432",
"1747472625889087488", "1752043460171898880",
"1752738731633590272", "1514349532297015296",
"1545708171364384768", "1558101392291729408",
"1645150133233807360", "1649171080760655872",
"1652117495212384256", "1666933608592953344",
"1679674213731319808", "20416629", "263213192",
"26845113", "30014187", "3726853273", "43115340",
"439510861", "55753989", "867793824688390144")
df |>
pivot_wider(id_cols = "user_id", id_expand = TRUE,
names_from = "criteria", values_from = "eligible") |>
filter(user_id %in% random24_user_id) -> df_wide_random24
df_wide_random24 |>
filter(account_created == TRUE & following_us == TRUE) |>
select(user_id, hometimeline, following_NG,
hometimeline_10, following_NG_10,
hometimeline_20, following_NG_20,
hometimeline_30, following_NG_30) |> datatable()
Before: user_id
= 20416629
,
26845113
- Home timeline (T), Following (F)
After:
20416629
: Home Timeline (T), Following (F) (regardless of which list we use, this user is still not following but being exposed to low quality tweets in home timeline)26845113
: Home Timeline (F), Following (F)
Based on the same list (List 1), there is no low quality tweets found in home timeline anymore. This is evidence against our assumption.
However, when we change to the longer list (List 4), it becomes: Home Timeline (T), Following (T)
- In List 4, direct tweets from
POTUS
,CrimeWatchMpls
and indirect retweet fromRepStefanik
are shown in Home Timeline.
- In List 4, direct tweets from
So, of the two cases where users were not following any eligible accounts but had previously been exposed to indirect tweets from these accounts, one supports our assumption while the other does not.
2 List 4
We are likely to choose List 4 (N=1515). The problem is that the list is too long so muting all of them will take a lot of time, and given power law distribution of followers, almost all of the muting treatment will be through those accounts with larger number of followers.
Brendan’s suggestion: we reverse rank the accounts by number of followers in the pilot and select some cutoff on how many accounts to include (which we can show covers X% of all eligible accounts) and then mute 70% of those accounts.
Hence, I used Twitter API to retrieve the number of followers, sorted the followers in descending order, and calculated the percentage cumulative followers, to determine the number of accounts to be muted by a certain threshold (e.g., 99% of total followers).
library(readr)
library(tidyverse)
library(DT)
List4_with_Followers <- read_csv("inventory_lists/List4_with_Followers.csv",
col_types = cols(...1 = col_skip(), target_user_id =col_character()))
List4_with_Followers |>
arrange(desc(followers)) |>
mutate(total_followers = sum(followers),
cumulative_followers = cumsum(followers),
percentage_cumulative_followers = cumulative_followers/total_followers*100,
percentage_cumulative_followers = round(percentage_cumulative_followers, 3)) |>
select(target_user_id, twitter_handle, fib_index, Score, followers,
cumulative_followers, percentage_cumulative_followers) |>
datatable(filter = "top", selection = "multiple",
caption = "List 4 | % Cumulative Followers")
If we set threshold as 99% (=keep only the accounts that take up 99% of total sum of followers), we have 813 accounts. (70% = ~570 accounts to mute; ~3 hours per user)
List4_with_Followers |>
arrange(desc(followers)) |>
mutate(total_followers = sum(followers),
cumulative_followers = cumsum(followers),
percentage_cumulative_followers = cumulative_followers/total_followers*100,
percentage_cumulative_followers = round(percentage_cumulative_followers, 3)) |>
filter(percentage_cumulative_followers < 99) -> List4_with_Followers_99
nrow(List4_with_Followers_99)
## [1] 813
If we set threshold as 95% (=keep only the accounts that take up 95% of total sum of followers), we have 388 accounts. (70% = ~272 accounts to mute; ~1.36 hours per user)
List4_with_Followers |>
arrange(desc(followers)) |>
mutate(total_followers = sum(followers),
cumulative_followers = cumsum(followers),
percentage_cumulative_followers = cumulative_followers/total_followers*100,
percentage_cumulative_followers = round(percentage_cumulative_followers, 3)) |>
filter(percentage_cumulative_followers < 95) -> List4_with_Followers_95
nrow(List4_with_Followers_95)
## [1] 388
# data from folder
new_connection_status <- read_csv("new_connection_status.csv",
col_types = cols(...1 = col_skip(),
user_id = col_character(),
target_user_id = col_character()))
new_home_match_data <- read_csv("new_home_match_data.csv",
col_types = cols(user_id = col_character(),
target_user_id = col_character()))
List1 <- read_csv("inventory_lists/List1.csv",
col_types = cols_only(target_user_id = col_character(),
twitter_handle = col_guess(),
Score = col_guess(),
followers = col_guess()))
List2 <- read_csv("inventory_lists/List2.csv",
col_types = cols_only(target_user_id = col_character(),
twitter_handle = col_guess(),
fib_index = col_guess(),
total_reshares = col_guess(),
Score = col_guess(),
followers = col_guess()))
List3 <- read_csv("inventory_lists/List3.csv",
col_types = cols_only(target_user_id = col_character(),
twitter_handle = col_guess(),
fib_index = col_guess(),
total_reshares = col_guess(),
Score = col_guess(),
followers = col_guess()))
List4 <- read_csv("inventory_lists/List4.csv",
col_types = cols_only(target_user_id = col_character(),
twitter_handle = col_guess(),
fib_index = col_guess(),
total_reshares = col_guess(),
Score = col_guess(),
followers = col_guess()))
new_connection_status |>
merge(List4, by="target_user_id", all.x = TRUE) |>
mutate(List4 = ifelse(!is.na(twitter_handle), TRUE, FALSE)) -> following_tb
List3 |> select(target_user_id) |> mutate(list="List3") -> List3_for_merge
List2 |> select(target_user_id) |> mutate(list="List2") -> List2_for_merge
List1 |> select(target_user_id) |> mutate(list="List1") -> List1_for_merge
following_tb |>
merge(List3_for_merge, by="target_user_id", all.x = TRUE) |>
mutate(List3 = ifelse(!is.na(list), TRUE, FALSE)) |>
select(-list) -> following_tb
following_tb |>
merge(List2_for_merge, by="target_user_id", all.x = TRUE) |>
mutate(List2 = ifelse(!is.na(list), TRUE, FALSE)) |>
select(-list) -> following_tb
following_tb |>
merge(List1_for_merge, by="target_user_id", all.x = TRUE) |>
mutate(List1 = ifelse(!is.na(list), TRUE, FALSE)) |>
select(-list) -> following_tb
new_home_match_data |>
merge(List4, by="target_user_id") -> hometimeline_tb
We have a shortened list of low quality accounts to mute based on each cutoff (99% vs. 95%). But how much does each shortened list cover eligible accounts?
following_tb$Followed_by = TRUE
hometimeline_tb$In_Hometimeline = TRUE
following_tb |>
select(target_user_id, Followed_by) |>
unique() -> following_eligible
hometimeline_tb |>
select(target_user_id, In_Hometimeline) |>
unique() -> hometimeline_eligible
List4_with_Followers |>
arrange(desc(followers)) |>
mutate(total_followers = sum(followers),
cumulative_followers = cumsum(followers),
percentage_cumulative_followers = cumulative_followers/total_followers*100,
percentage_cumulative_followers = round(percentage_cumulative_followers, 3)) -> List4_with_PerCumFollowers
Followed_by
: If TRUE, then the low quality account is included in the eligible set (=there are eligible users who follow that account)In_Hometimeline
: If TRUE, then the low quality account is included in the eligible set (=these accounts are found in eligible users’ home timeline)
List4_with_PerCumFollowers |>
merge(following_eligible, by="target_user_id", all.x = TRUE) |>
merge(hometimeline_eligible, by="target_user_id", all.x = TRUE) |>
select(target_user_id, twitter_handle, followers, percentage_cumulative_followers, Followed_by, In_Hometimeline) |>
arrange((percentage_cumulative_followers)) |>
filter(percentage_cumulative_followers < 99) |>
group_by(Followed_by) |> count() |>
ungroup() |>
mutate(percentage = round(n/sum(n),3)) |>
datatable(caption = "Cutoff: 99% | What % are from eligible accounts (Followed by eligible users)?", filter="none")
List4_with_PerCumFollowers |>
merge(following_eligible, by="target_user_id", all.x = TRUE) |>
merge(hometimeline_eligible, by="target_user_id", all.x = TRUE) |>
select(target_user_id, twitter_handle, followers, percentage_cumulative_followers, Followed_by, In_Hometimeline) |>
arrange((percentage_cumulative_followers)) |>
filter(percentage_cumulative_followers < 99) |>
group_by(In_Hometimeline) |> count() |>
ungroup() |>
mutate(percentage = round(n/sum(n),3)) |>
datatable(caption = "Cutoff: 99% | What % are from eligible accounts (Found in eligible users' home timelines)?", filter="none")
List4_with_PerCumFollowers |>
merge(following_eligible, by="target_user_id", all.x = TRUE) |>
merge(hometimeline_eligible, by="target_user_id", all.x = TRUE) |>
select(target_user_id, twitter_handle, followers, percentage_cumulative_followers, Followed_by, In_Hometimeline) |>
arrange((percentage_cumulative_followers)) |>
filter(percentage_cumulative_followers < 95) |>
group_by(Followed_by) |> count() |>
ungroup() |>
mutate(percentage = round(n/sum(n),3)) |>
datatable(caption = "Cutoff: 95% | What % are from eligible accounts (Followed by eligible users)?", filter="none")
List4_with_PerCumFollowers |>
merge(following_eligible, by="target_user_id", all.x = TRUE) |>
merge(hometimeline_eligible, by="target_user_id", all.x = TRUE) |>
select(target_user_id, twitter_handle, followers, percentage_cumulative_followers, Followed_by, In_Hometimeline) |>
arrange((percentage_cumulative_followers)) |>
filter(percentage_cumulative_followers < 95) |>
group_by(In_Hometimeline) |> count() |>
ungroup() |>
mutate(percentage = round(n/sum(n),3)) |>
datatable(caption = "Cutoff: 95% | What % are from eligible accounts (Found in eligible users' home timelines)?", filter="none")
It seems like 95% threshold better captures eligible accounts (those accounts that are actually followed by users and/or found in users’ home timelines).