70% stratified random sampling - Option 2

1 Option 2

Randomize within strata by home timeline tweets instead of followers - better captures actual treatment

  1. Cutoff at 95% (accounts that take up to 95% of total sum number of followers) = 486 accounts

  2. Order by number of times appeared on users’ home timeline (not number of followers) => exposure

  3. Do the same 70% stratified random sampling

library(readr)
home_match_data <- read_csv("new_home_match_data.csv", 
                            col_types = cols(user_id = col_character(),
                            target_user_id = col_character()))
library(tidyverse)
home_match_data |> 
  group_by(target_user_id) |> 
  count() |>
  rename(exposure = n) -> home_match_count
# Load inventory list
updated_inventory <- read_csv("inventory_lists/updated_inventory.csv", 
                              col_types = cols(
                                target_user_id = col_character(), 
                                twitter_handle = col_character()))

updated_inventory |>
  merge(home_match_count, by="target_user_id", all.x=TRUE) |>
  mutate(exposure = ifelse(is.na(exposure), 0, exposure)) |>
  arrange(desc(followers)) -> updated_inventory_with_exposure

library(DT)
  • exposure: how many times the low-quality account appeared in users’ home timelines

  • followed_by: number of participants each low-quality account is followed by

1.0.1 Arranged by # of followers (for 95% cutoff)

new_connection_status <- read_csv("new_connection_status.csv", 
    col_types = cols(user_id = col_character(), 
        target_user_id = col_character())) 

new_connection_status |> 
  group_by(target_user_id) |>
  count() |> 
  rename(followed_by = n) -> connection_count
  
updated_inventory_with_exposure |>
   select("target_user_id", "twitter_handle",
         "fib_index", "Score", "followers", 
         "exposure") |> 
  merge(connection_count, by="target_user_id", all.x = TRUE) |>
  mutate(followed_by = ifelse(is.na(followed_by), 0, followed_by)) |>
  arrange(desc(followers)) |>
  datatable(filter = "top", selection = "multiple",
            caption = "Arranged by number of followers")

1.0.2 What if we arrange the list by exposure?

updated_inventory_with_exposure |>
   select("target_user_id", "twitter_handle",
         "fib_index", "Score", "followers", 
         "exposure") |> 
  merge(connection_count, by="target_user_id", all.x = TRUE) |>
  mutate(followed_by = ifelse(is.na(followed_by), 0, followed_by)) |>
  arrange(desc(exposure)) |>
  datatable(filter = "top", selection = "multiple",
            caption = "Arranged by number of exposure")

2 Question

XHNews, PDChina are in top 10 accounts with highest number of followers but they are exposed 0 (not appeared in home timeline of users already processed (since we didn’t get raw data for new entries, the tables are based on what we have - old entries). But they are followed by one participant.

So, my question is: Does “Cutoff at 95% (accounts that take up to 95% of total sum number of followers; 486 accounts)” make sense? We will lose those after 486th that has more than 1 exposure. Should we arrange the list by exposure (=number of times appeared on users’ home timelines) and set cutoff as exposure > 0?