Gap between Verasight’s raw data vs. server/DB data

Problem

  • The gap between eligible survey sample filtered from raw data (698) vs. what we have in our DB/server (866).

Some background information:

  • In the end block of the survey, I embedded codes to send vsid to the DB. Hence, those users with vsid in the DB have finished to the end of the Wave 1 survey. In other words, I assumed that having vsid indicates having survived all the requirements during the survey (e.g., attention checks, authorization, following, etc.).

  • Now we are seeing this gap (698 from survey data vs. 866 from DB), it might be the case that my assumption is wrong.

library(tidyverse)
library(DT)
library(haven)
library(readr)

# load raw data 
raw_df <- readRDS("~/Downloads/2023-095a_files/verasight_wave1/2023-095a_client_wave-1.rds")

# Eligible sample filtered from raw data 
eligible_survey_sample <- raw_df |> 
  filter(Progress==100) |> # who finished wave 1 till the end
  filter(attn_5 > 3) |> # who passed attention check
  filter(userid!="") |> # who authorized their Twitter account
  filter(Q7_0==1) # who agreed to follow our study account 

# load server data (`mercury_user` table in DB)
db_df <- read_csv("mercury_user.csv", 
    col_types = cols(user_id = col_character()))

# Eligible sample filtered from DB data
eligible_db_sample <- db_df |> 
  mutate(userid = as.character(user_id),
         screename = screen_name) |> 
  select(userid, screename, vsid) |>
  filter(!is.na(vsid)) # who finished wave 1 till the end 
nrow(eligible_survey_sample) # 698
## [1] 698
nrow(eligible_db_sample) # 866
## [1] 866

Warning: Something weird is going on here

While merging the eligible_survey_sample and eligible_db_sample, I found that several duplicated accounts with different vsid. That means, in eligible_db_sample, we don’t have duplicates (if userid is same, vsid is updated for that userid). But in the raw data, there are participants with different vsids (=Verasight panel accounts) but actually they are the same person with a specific Twitter account (userid).

eligible_survey_sample |> 
  merge(eligible_db_sample, by=c("userid","screename"), all = TRUE) -> test

test |> 
  head(20) |> 
  datatable()

So let’s forget about the eligible_survey_sample and eligible_db_sample for now.

My strategy is as follows: To compare the data from server/DB (db_df) with Verasight’s raw data (raw_df), merge these two data sets, by userid and screenname (we know that they are unique).

# change the column names to merge
db_df <- db_df |> 
  mutate(userid = as.character(user_id),
         screename = screen_name) |> 
  select(userid, screename, vsid)

raw_df$userid = as.character(raw_df$userid)

# merge datasets
db_df |> # x(baseline data) from DB, # y(merged data) from Verasgiht
  merge(raw_df, by=c("userid","screename"), all=TRUE) |> 
  filter(userid!="") |>
  select(userid, screename, vsid.x, vsid.y, Progress) -> test
  • vsid.x : vsid recorded in the DB (if a participant tried multiple attempts using different Verasight accounts, the last attempt is recorded)

  • vsid.y : vsid in the raw data from Verasight

datatable(test)

Case 1.

If vsid.x = NA (no vsid in DB) but we have vsid.y and Progress==100,

  • Updated vsid in the DB for the first two
test |> 
  filter(is.na(vsid.x) & !is.na(vsid.y) & Progress==100) |> datatable()
  • Removed TedAndrew31499 (this user changed its username to verawliam223)
test |>
  filter(screename!="TedAndrew31499") -> test

Case 2.

Remove cases with vsid.x = NA, vsid.y = NA, Progress=NA (=124 cases).

  • 1030 - 124 cases removed = 906 left
test |> 
  filter(is.na(vsid.x) & is.na(vsid.y) & is.na(Progress)) |> datatable()
test = test[-c(which(is.na(test$vsid.x) & is.na(test$vsid.y) & is.na(test$Progres))),] 

Case 3.

If vsid.x = NA but we have vsid.y, and Progress < 100, we remove those cases, since they didn’t go to the end of the survey (that’s why we don’t have vsid in the DB).

  • 906 - 14 cases removed = 892 left
test |> 
  filter(is.na(vsid.x) & !is.na(vsid.y) & Progress < 100)|> datatable()
test = test[-c(which(is.na(test$vsid.x) & !is.na(test$vsid.y) & test$Progress < 100)),] 

Case 4.

(Mysterious,,,, I don’t know why we have these cases, but….) We have 190 cases where vsid.x values exist in DB (which means that users have went through till the end of survey) but vsid.y = NA (and Progress=NA).

  • If I remove those 190 from 892, we have 702 users left (which is the same with Verasight’s calculation)! These 702 users have authorized and finished Wave 1 survey.
test |> 
  filter(!is.na(vsid.x) & is.na(vsid.y) & is.na(Progress))|> datatable()
test = test[-c(which(!is.na(test$vsid.x) & is.na(test$vsid.y) & is.na(test$Progress))),]
datatable(test)

Now as the last step, let’s remove repetitive cases (userids appearing several times)

  • Fizzdbrain99231 (10)

  • Stephen17072 (7)

  • verawliam223 (8)

  • Vhilz1 (3)

test[which(test$screename=="ItzGibbyyy"),3] = "648271c8b83a37de9f32f488"
test[which(test$screename=="azheng284"),3] = "64017b4029ef81c28a2b65a8"
test |> 
  filter(screename!="Fizzdbrain99231") |>
  filter(screename!="Stephen17072") |>
  filter(screename!="verawliam223") |>
  filter(screename!="Vhilz1") -> test 
test |> select(userid, screename, vsid.y, Progress) |>
  unique() |> datatable()

Next steps

Based on the list of 674 users, I will update the DB and run eligibility scripts!

test |> select(userid, screename, vsid.y, Progress) |>
  unique() -> final_db_df

names(final_db_df) = c("user_id","screen_name","vsid","Progress")

# load db data
original_db_df <- read_csv("mercury_user.csv", 
    col_types = cols(user_id = col_character()))

original_db_df = original_db_df[,-8]
# id,user_id,screen_name,access_token,access_token_secret,session_start,oauth_token,vsid 

final_db_df |> 
  select("user_id","screen_name","vsid") |> 
  merge(original_db_df, by=c("user_id","screen_name"))  -> users_for_eligibility_test

users_for_eligibility_test |> select(-"id") -> users_for_eligibility_test

write.csv(users_for_eligibility_test, file = "users_for_elig_test.csv", row.names=FALSE)