Gap between Verasight’s raw data vs. server/DB data
Do Won Kim
2024-02-08
Problem
- The gap between
eligible survey sample
filtered from raw data (698) vs. what we have in our DB/server (866).
Some background information:
In the end block of the survey, I embedded codes to send
vsid
to the DB. Hence, those users withvsid
in the DB have finished to the end of the Wave 1 survey. In other words, I assumed that havingvsid
indicates having survived all the requirements during the survey (e.g., attention checks, authorization, following, etc.).Now we are seeing this gap (698 from survey data vs. 866 from DB), it might be the case that my assumption is wrong.
library(tidyverse)
library(DT)
library(haven)
library(readr)
# load raw data
raw_df <- readRDS("~/Downloads/2023-095a_files/verasight_wave1/2023-095a_client_wave-1.rds")
# Eligible sample filtered from raw data
eligible_survey_sample <- raw_df |>
filter(Progress==100) |> # who finished wave 1 till the end
filter(attn_5 > 3) |> # who passed attention check
filter(userid!="") |> # who authorized their Twitter account
filter(Q7_0==1) # who agreed to follow our study account
# load server data (`mercury_user` table in DB)
db_df <- read_csv("mercury_user.csv",
col_types = cols(user_id = col_character()))
# Eligible sample filtered from DB data
eligible_db_sample <- db_df |>
mutate(userid = as.character(user_id),
screename = screen_name) |>
select(userid, screename, vsid) |>
filter(!is.na(vsid)) # who finished wave 1 till the end
nrow(eligible_survey_sample) # 698
## [1] 698
nrow(eligible_db_sample) # 866
## [1] 866
Warning: Something weird is going on here
While merging the eligible_survey_sample
and
eligible_db_sample
, I found that several duplicated
accounts with different vsid
. That means, in
eligible_db_sample
, we don’t have duplicates (if
userid
is same, vsid
is updated for that
userid
). But in the raw data, there are participants with
different vsid
s (=Verasight panel accounts) but actually
they are the same person with a specific Twitter account
(userid
).
eligible_survey_sample |>
merge(eligible_db_sample, by=c("userid","screename"), all = TRUE) -> test
test |>
head(20) |>
datatable()
So let’s forget about the eligible_survey_sample
and
eligible_db_sample
for now.
My strategy is as follows: To compare the data from server/DB
(db_df
) with Verasight’s raw data (raw_df
),
merge these two data sets, by userid
and
screenname
(we know that they are unique).
# change the column names to merge
db_df <- db_df |>
mutate(userid = as.character(user_id),
screename = screen_name) |>
select(userid, screename, vsid)
raw_df$userid = as.character(raw_df$userid)
# merge datasets
db_df |> # x(baseline data) from DB, # y(merged data) from Verasgiht
merge(raw_df, by=c("userid","screename"), all=TRUE) |>
filter(userid!="") |>
select(userid, screename, vsid.x, vsid.y, Progress) -> test
vsid.x
: vsid recorded in the DB (if a participant tried multiple attempts using different Verasight accounts, the last attempt is recorded)vsid.y
: vsid in the raw data from Verasight
datatable(test)
Case 1.
If vsid.x
= NA (no vsid in DB) but we have
vsid.y
and Progress==100,
- Updated
vsid
in the DB for the first two
test |>
filter(is.na(vsid.x) & !is.na(vsid.y) & Progress==100) |> datatable()
- Removed
TedAndrew31499
(this user changed its username toverawliam223
)
test |>
filter(screename!="TedAndrew31499") -> test
Case 2.
Remove cases with vsid.x
= NA, vsid.y
= NA,
Progress
=NA (=124 cases).
- 1030 - 124 cases removed = 906 left
test |>
filter(is.na(vsid.x) & is.na(vsid.y) & is.na(Progress)) |> datatable()
test = test[-c(which(is.na(test$vsid.x) & is.na(test$vsid.y) & is.na(test$Progres))),]
Case 3.
If vsid.x
= NA but we have vsid.y
, and
Progress
< 100, we remove those cases, since they didn’t
go to the end of the survey (that’s why we don’t have vsid
in the DB).
- 906 - 14 cases removed = 892 left
test |>
filter(is.na(vsid.x) & !is.na(vsid.y) & Progress < 100)|> datatable()
test = test[-c(which(is.na(test$vsid.x) & !is.na(test$vsid.y) & test$Progress < 100)),]
Case 4.
(Mysterious,,,, I don’t know why we have these cases, but….) We have
190 cases where vsid.x
values exist in DB (which means that
users have went through till the end of survey) but vsid.y
= NA (and Progress
=NA).
- If I remove those 190 from 892, we have 702 users left (which is the same with Verasight’s calculation)! These 702 users have authorized and finished Wave 1 survey.
test |>
filter(!is.na(vsid.x) & is.na(vsid.y) & is.na(Progress))|> datatable()
test = test[-c(which(!is.na(test$vsid.x) & is.na(test$vsid.y) & is.na(test$Progress))),]
datatable(test)
Now as the last step, let’s remove repetitive cases (userids appearing several times)
Fizzdbrain99231 (10)
Stephen17072 (7)
verawliam223 (8)
Vhilz1 (3)
test[which(test$screename=="ItzGibbyyy"),3] = "648271c8b83a37de9f32f488"
test[which(test$screename=="azheng284"),3] = "64017b4029ef81c28a2b65a8"
test |>
filter(screename!="Fizzdbrain99231") |>
filter(screename!="Stephen17072") |>
filter(screename!="verawliam223") |>
filter(screename!="Vhilz1") -> test
test |> select(userid, screename, vsid.y, Progress) |>
unique() |> datatable()
Next steps
Based on the list of 674 users, I will update the DB and run eligibility scripts!
test |> select(userid, screename, vsid.y, Progress) |>
unique() -> final_db_df
names(final_db_df) = c("user_id","screen_name","vsid","Progress")
# load db data
original_db_df <- read_csv("mercury_user.csv",
col_types = cols(user_id = col_character()))
original_db_df = original_db_df[,-8]
# id,user_id,screen_name,access_token,access_token_secret,session_start,oauth_token,vsid
final_db_df |>
select("user_id","screen_name","vsid") |>
merge(original_db_df, by=c("user_id","screen_name")) -> users_for_eligibility_test
users_for_eligibility_test |> select(-"id") -> users_for_eligibility_test
write.csv(users_for_eligibility_test, file = "users_for_elig_test.csv", row.names=FALSE)