Eligibility Check (Additional Analysis)

1 Checking Assumption

Question: Do people who follow zero qualifying accounts have consistent enough exposure in their home timeline to be worth including?

By adopting Following OR Hometimeline as our eligibility criteria, we assume that people are repetitively/consistently exposed to indirect low quality tweets. That is, if we find indirect low-quality tweets from users’ home timelines (i.e., tweets that their friends retweeted or quoted) at the time when we collected reverse chronological timeline data, we assume that exposure is not a one-time event.

To check this assumption, let’s compare (1) data we collected by randomly choosing 24 participants for preliminary eligibility analysis (Feb 2) and (2) data collected for the same participants during the main eligibility check (Feb 10).

From the preliminary eligibility check, I retrieved user_id of 24 randomly chosen participants and searched them in our main eligibility results csv file.

In preliminary check, these users were the ones with “no following but found in home timeline” condition: user_id = 20416629, 26845113 - Home timeline (T), Following (F)

library(readr)
library(tidyverse)
library(DT)
df <- read_csv("new_eligibility_results.csv", 
               col_types = cols(id = col_skip(), user_id = col_character()))
random24_user_id <- c("1724530394920263680", "1741720467851841536",
                      "1746407404164517888", "1747176108846354432",
                      "1747472625889087488", "1752043460171898880",
                      "1752738731633590272", "1514349532297015296",
                      "1545708171364384768", "1558101392291729408",
                      "1645150133233807360", "1649171080760655872",
                      "1652117495212384256", "1666933608592953344",
                      "1679674213731319808", "20416629", "263213192", 
                      "26845113", "30014187", "3726853273", "43115340", 
                      "439510861", "55753989", "867793824688390144")

df |> 
  pivot_wider(id_cols = "user_id", id_expand = TRUE, 
              names_from = "criteria", values_from = "eligible") |>
  filter(user_id %in% random24_user_id) -> df_wide_random24
df_wide_random24 |>
  filter(account_created == TRUE & following_us == TRUE) |> 
  select(user_id, hometimeline, following_NG, 
         hometimeline_10, following_NG_10,
         hometimeline_20, following_NG_20,
         hometimeline_30, following_NG_30) |> datatable()

Before: user_id = 20416629, 26845113 - Home timeline (T), Following (F)

After:

  1. 20416629: Home Timeline (T), Following (F) (regardless of which list we use, this user is still not following but being exposed to low quality tweets in home timeline)

  2. 26845113 : Home Timeline (F), Following (F)

  • Based on the same list (List 1), there is no low quality tweets found in home timeline anymore. This is evidence against our assumption.

  • However, when we change to the longer list (List 4), it becomes: Home Timeline (T), Following (T)

    • In List 4, direct tweets from POTUS, CrimeWatchMpls and indirect retweet from RepStefanik are shown in Home Timeline.

So, of the two cases where users were not following any eligible accounts but had previously been exposed to indirect tweets from these accounts, one supports our assumption while the other does not.

2 List 4

We are likely to choose List 4 (N=1515). The problem is that the list is too long so muting all of them will take a lot of time, and given power law distribution of followers, almost all of the muting treatment will be through those accounts with larger number of followers. 

Brendan’s suggestion: we reverse rank the accounts by number of followers in the pilot and select some cutoff on how many accounts to include (which we can show covers X% of all eligible accounts) and then mute 70% of those accounts. 

Hence, I used Twitter API to retrieve the number of followers, sorted the followers in descending order, and calculated the percentage cumulative followers, to determine the number of accounts to be muted by a certain threshold (e.g., 99% of total followers).

library(readr)
library(tidyverse)
library(DT)
List4_with_Followers <- read_csv("inventory_lists/List4_with_Followers.csv",
                                 col_types = cols(...1 = col_skip(),                                                  target_user_id =col_character()))
List4_with_Followers |> 
  arrange(desc(followers)) |> 
  mutate(total_followers = sum(followers),
         cumulative_followers = cumsum(followers),
         percentage_cumulative_followers = cumulative_followers/total_followers*100,
         percentage_cumulative_followers = round(percentage_cumulative_followers, 3)) |> 
  select(target_user_id, twitter_handle, fib_index, Score, followers,
         cumulative_followers, percentage_cumulative_followers) |>
  datatable(filter = "top", selection = "multiple",
            caption = "List 4 | % Cumulative Followers")

If we set threshold as 99% (=keep only the accounts that take up 99% of total sum of followers), we have 813 accounts. (70% = ~570 accounts to mute; ~3 hours per user)

List4_with_Followers |> 
  arrange(desc(followers)) |> 
  mutate(total_followers = sum(followers),
         cumulative_followers = cumsum(followers),
         percentage_cumulative_followers = cumulative_followers/total_followers*100,
         percentage_cumulative_followers = round(percentage_cumulative_followers, 3)) |> 
  filter(percentage_cumulative_followers < 99) -> List4_with_Followers_99

nrow(List4_with_Followers_99)
## [1] 813

If we set threshold as 95% (=keep only the accounts that take up 95% of total sum of followers), we have 388 accounts. (70% = ~272 accounts to mute; ~1.36 hours per user)

List4_with_Followers |> 
  arrange(desc(followers)) |> 
  mutate(total_followers = sum(followers),
         cumulative_followers = cumsum(followers),
         percentage_cumulative_followers = cumulative_followers/total_followers*100,
         percentage_cumulative_followers = round(percentage_cumulative_followers, 3)) |> 
  filter(percentage_cumulative_followers < 95) -> List4_with_Followers_95

nrow(List4_with_Followers_95)
## [1] 388

# data from folder 
new_connection_status <- read_csv("new_connection_status.csv", 
    col_types = cols(...1 = col_skip(), 
                     user_id = col_character(), 
                     target_user_id = col_character()))

new_home_match_data <- read_csv("new_home_match_data.csv", 
    col_types = cols(user_id = col_character(), 
        target_user_id = col_character()))


List1 <- read_csv("inventory_lists/List1.csv", 
    col_types = cols_only(target_user_id = col_character(), 
                          twitter_handle = col_guess(),
                          Score = col_guess(), 
                          followers = col_guess()))

List2 <- read_csv("inventory_lists/List2.csv", 
    col_types = cols_only(target_user_id = col_character(),
                          twitter_handle = col_guess(), 
                          fib_index = col_guess(), 
                          total_reshares = col_guess(), 
                          Score = col_guess(), 
                          followers = col_guess()))

List3 <- read_csv("inventory_lists/List3.csv", 
    col_types = cols_only(target_user_id = col_character(), 
                          twitter_handle = col_guess(), 
                          fib_index = col_guess(), 
                          total_reshares = col_guess(), 
                          Score = col_guess(), 
                          followers = col_guess()))

List4 <- read_csv("inventory_lists/List4.csv", 
    col_types = cols_only(target_user_id = col_character(), 
                          twitter_handle = col_guess(), 
                          fib_index = col_guess(), 
                          total_reshares = col_guess(), 
                          Score = col_guess(), 
                          followers = col_guess()))

new_connection_status |> 
  merge(List4, by="target_user_id", all.x = TRUE) |> 
  mutate(List4 = ifelse(!is.na(twitter_handle), TRUE, FALSE)) -> following_tb


List3 |> select(target_user_id) |> mutate(list="List3") -> List3_for_merge
List2 |> select(target_user_id) |> mutate(list="List2") -> List2_for_merge
List1 |> select(target_user_id) |> mutate(list="List1") -> List1_for_merge

following_tb |> 
  merge(List3_for_merge, by="target_user_id", all.x = TRUE) |>
  mutate(List3 = ifelse(!is.na(list), TRUE, FALSE)) |>
  select(-list) -> following_tb


following_tb |> 
  merge(List2_for_merge, by="target_user_id", all.x = TRUE) |>
  mutate(List2 = ifelse(!is.na(list), TRUE, FALSE)) |>
  select(-list) -> following_tb


following_tb |> 
  merge(List1_for_merge, by="target_user_id", all.x = TRUE) |>
  mutate(List1 = ifelse(!is.na(list), TRUE, FALSE)) |>
  select(-list) -> following_tb

new_home_match_data |>
  merge(List4, by="target_user_id") -> hometimeline_tb

We have a shortened list of low quality accounts to mute based on each cutoff (99% vs. 95%). But how much does each shortened list cover eligible accounts?

following_tb$Followed_by = TRUE
hometimeline_tb$In_Hometimeline = TRUE

following_tb |>   
  select(target_user_id, Followed_by) |> 
  unique() -> following_eligible

hometimeline_tb |>
  select(target_user_id, In_Hometimeline) |> 
  unique() -> hometimeline_eligible

List4_with_Followers |> 
  arrange(desc(followers)) |> 
  mutate(total_followers = sum(followers),
         cumulative_followers = cumsum(followers),
         percentage_cumulative_followers = cumulative_followers/total_followers*100,
         percentage_cumulative_followers = round(percentage_cumulative_followers, 3)) -> List4_with_PerCumFollowers
  • Followed_by : If TRUE, then the low quality account is included in the eligible set (=there are eligible users who follow that account)

  • In_Hometimeline : If TRUE, then the low quality account is included in the eligible set (=these accounts are found in eligible users’ home timeline)

List4_with_PerCumFollowers |> 
  merge(following_eligible, by="target_user_id", all.x = TRUE) |>
  merge(hometimeline_eligible, by="target_user_id", all.x = TRUE) |>
  select(target_user_id, twitter_handle, followers, percentage_cumulative_followers, Followed_by, In_Hometimeline) |>
  arrange((percentage_cumulative_followers)) |>
  filter(percentage_cumulative_followers < 99) |>
  group_by(Followed_by) |> count() |> 
  ungroup() |> 
  mutate(percentage = round(n/sum(n),3)) |> 
  datatable(caption = "Cutoff: 99% | What % are from eligible accounts (Followed by eligible users)?", filter="none")
List4_with_PerCumFollowers |> 
  merge(following_eligible, by="target_user_id", all.x = TRUE) |>
  merge(hometimeline_eligible, by="target_user_id", all.x = TRUE) |>
  select(target_user_id, twitter_handle, followers, percentage_cumulative_followers, Followed_by, In_Hometimeline) |>
  arrange((percentage_cumulative_followers)) |>
  filter(percentage_cumulative_followers < 99) |>
  group_by(In_Hometimeline) |> count() |> 
  ungroup() |> 
  mutate(percentage = round(n/sum(n),3))  |> 
  datatable(caption = "Cutoff: 99% | What % are from eligible accounts (Found in eligible users' home timelines)?", filter="none")
List4_with_PerCumFollowers |> 
  merge(following_eligible, by="target_user_id", all.x = TRUE) |>
  merge(hometimeline_eligible, by="target_user_id", all.x = TRUE) |>
  select(target_user_id, twitter_handle, followers, percentage_cumulative_followers, Followed_by, In_Hometimeline) |>
  arrange((percentage_cumulative_followers)) |>
  filter(percentage_cumulative_followers < 95) |>
  group_by(Followed_by) |> count() |> 
  ungroup() |> 
  mutate(percentage = round(n/sum(n),3)) |>
  datatable(caption = "Cutoff: 95% | What % are from eligible accounts (Followed by eligible users)?", filter="none")
List4_with_PerCumFollowers |> 
  merge(following_eligible, by="target_user_id", all.x = TRUE) |>
  merge(hometimeline_eligible, by="target_user_id", all.x = TRUE) |>
  select(target_user_id, twitter_handle, followers, percentage_cumulative_followers, Followed_by, In_Hometimeline) |>
  arrange((percentage_cumulative_followers)) |>
  filter(percentage_cumulative_followers < 95) |>
  group_by(In_Hometimeline) |> count() |> 
  ungroup() |> 
  mutate(percentage = round(n/sum(n),3)) |> 
  datatable(caption = "Cutoff: 95% | What % are from eligible accounts (Found in eligible users' home timelines)?", filter="none")

It seems like 95% threshold better captures eligible accounts (those accounts that are actually followed by users and/or found in users’ home timelines).


3 Notes

  • List 4 with number of followers added: link to the repository (Or you can directly download the file with this link.

  • List 4 with number of followers + percentage cumulative followers: link

→ From either of these lists, we can remove elected officials and candidates