wave1_0207
Do Won Kim
2024-02-08
[Wave 1] Raw Data
A. Comparison of Verasight’s summary vs. raw data
The raw data files contain:
- 4,032 survey starts, including completes and incompletes (anyone who got to the consent page)
library(tidyverse)
library(DT)
library(haven)
df <- readRDS("~/Downloads/2023-095a_files/verasight_wave1/2023-095a_client_wave-1.rds")
nrow(df)
## [1] 4032
- 3,517 completes including attention check failures (per Verasight)
df |> filter(Progress==100) |> count()
## # A tibble: 1 × 1
## n
## <int>
## 1 3517
3,064 completes excluding attention check failures (per Verasight)
- 3,517 - 453 = 3,064
df |> filter(Progress==100) |> filter(attn_5 < 4) |> count()
## # A tibble: 1 × 1
## n
## <int>
## 1 453
But, further looking into the data: 878 completes excluding attention check failures (without NAs)
Those who passed attention checks: 878
Those who failed attention checks: 453
NAs: 2186
df |> filter(Progress==100) |> group_by(attn_5) |> count()
## # A tibble: 6 × 2
## # Groups: attn_5 [6]
## attn_5 n
## <dbl+lbl> <int>
## 1 1 [Strongly agree] 262
## 2 2 [Somewhat agree] 69
## 3 3 [Neither agree nor disagree] 122
## 4 4 [Somewhat disagree] 31
## 5 5 [Strongly disagree] 847
## 6 NA 2186
701 completes who shared their Twitter accounts and completed the survey (per Verasight)
- But I see 703 completes who shared Twitter accounts
df |> filter(Progress==100) |> filter(userid!="") |> count()
## # A tibble: 1 × 1
## n
## <int>
## 1 703
It seems that what Verasight denotes as “completes” (those cases with Progress==100) include lots of respondents with NAs. In other words, even if participants’ Progress==100, that doesn’t necessarily mean they passed all the survey requirements.
With this in mind, we are looking for those participants who (1) passed attention checks, (2) consented and authorized their twitter info, (3) agreed to follow our study account, and (4) finished to the end of the survey (Progress==100).
The baseline is subsample we start the analysis is N=4,032. These are who reached to the consent page (started the survey). Let’s breakdown by recruitment mode.
Twitter recruitment: 80
New panelist recruited after January 1, 2024: 749
Existing panelist: 3203
df |> group_by(vs_recruitment) |> count()
## # A tibble: 3 × 2
## # Groups: vs_recruitment [3]
## vs_recruitment n
## <fct> <int>
## 1 Twitter recruitment 80
## 2 New panelist recruited after January 1, 2024 749
## 3 Existing panelist 3203
We only filter “completes” with Progress==100 (N=3,517).
Twitter recruitment: 58 (58/80=72.5%)
New panelist recruited after January 1, 2024: 651/749 (86.92%)
Existing panelist: 2808/3203 (87.67%)
df |> filter(Progress==100) |> group_by(vs_recruitment) |> count()
## # A tibble: 3 × 2
## # Groups: vs_recruitment [3]
## vs_recruitment n
## <fct> <int>
## 1 Twitter recruitment 58
## 2 New panelist recruited after January 1, 2024 651
## 3 Existing panelist 2808
But as we now know, these cases include those that should be dropped out for our purpose.
1) Passed attention check?
Among what Verasight calls as “completes” (Progress == 100), I filtered in only those who passed attention checks. This resulted in N=878.
df |> filter(Progress==100) |> filter(attn_5 > 3) |> count()
## # A tibble: 1 × 1
## n
## <int>
## 1 878
(Among those who are complete) Those who passed attention checks:
Twitter recruitment: 42/58 (72.4%)
New panelist recruited after January 1, 2024: 125/651 (19.2%)
Existing panelist: 711/2808 (25.3%)
(Note that denominator is moving!)
df |> filter(Progress==100) |> filter(attn_5 > 3) |>
group_by(vs_recruitment) |> count()
## # A tibble: 3 × 2
## # Groups: vs_recruitment [3]
## vs_recruitment n
## <fct> <int>
## 1 Twitter recruitment 42
## 2 New panelist recruited after January 1, 2024 125
## 3 Existing panelist 711
3) Agreed to follow our study account?
(Among those who are complete, passed attention checks, authorized Twitter accounts) Those who agreed to follow our study account (N=698):
Twitter recruitment: 35/35 (100%)
New panelist recruited after January 1, 2024: 92/92 (100%)
Existing panelist: 576 (99.13%)
df |> filter(Progress==100) |> filter(attn_5 > 3) |>
filter(userid!="") |>
filter(Q7_0==1) |>
group_by(vs_recruitment) |>
count()
## # A tibble: 3 × 2
## # Groups: vs_recruitment [3]
## vs_recruitment n
## <fct> <int>
## 1 Twitter recruitment 35
## 2 New panelist recruited after January 1, 2024 92
## 3 Existing panelist 571
B. WTA distribution
So let’s only take those who passed our in survey criteria: 698
df |> filter(Progress==100) |> filter(attn_5 > 3) |>
filter(userid!="") |>
filter(Q7_0==1) -> wta_df
We randomly assigned participants into two groups - one with
scale
vs. the other one with an open-ended
version.
For those 698 people, I checked the distributions of WTA answers. The
below user’s response is weird thus removed (the user was assigned to
scale
ver. but we only see N/As).
wta_df |> select(userid, WTA_1_18:WTA_2_confirm) |>
filter(is.na(WTA_1_18) & is.na(WTA_1_2) & is.na(WTA_2))
## # A tibble: 1 × 11
## userid WTA_1_18 WTA_1_confirm WTA_1_2 WTA_1_2_confirm T_WTA2_First_Click
## <chr> <dbl> <dbl+lbl> <dbl> <dbl+lbl> <dbl>
## 1 16360894752… NA 1 [Continue … NA NA NA
## # ℹ 5 more variables: T_WTA2_Last_Click <dbl>, T_WTA2_Page_Submit <dbl>,
## # T_WTA2_Click_Count <dbl>, WTA_2 <dbl>, WTA_2_confirm <dbl+lbl>
The data looks like this:
wta_df |>
filter(userid!="1636089475200700421") |>
select(userid, WTA_1_18:WTA_2_confirm) |>
mutate(assigned_to = ifelse(is.na(WTA_2), "scale", "open_ended")) -> tb1
datatable(tb1)
1) scale
ver.
Let’s check distribution of WTA for those who were assigned to
scale
ver.
285 users out of 345 assigned to
scale
ver. chose their WTA values within 0~30 scale.60 users out of 345 assigned to
scale
ver. chose ‘over 30’ and wrote a number in the follow-up question.
tb1 |>
filter(!is.na(WTA_1_18)) -> wta_scale_within
tb1 |>
filter(!is.na(WTA_1_2)) -> wta_scale_over30
datatable(wta_scale_within)
datatable(wta_scale_over30)
Summary statistics:
summary(wta_scale_within$WTA_1_18)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 8.00 15.00 14.56 20.00 30.00
summary(wta_scale_over30$WTA_1_2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30 50 50 16866 100 1000000
Among those who chose within the scale:
WTA <= $15: 168
58.95% out of 285 within
48.7% out of
scale
(=within + over 30)
Among those assigned to scale
:
WTA >= $30: 60 (over 30) + 20 (among within) = 80
- 23.19% out of
scale
- 23.19% out of
wta_scale_within |> filter(WTA_1_18 <= 15) |> count()
## # A tibble: 1 × 1
## n
## <int>
## 1 168
wta_scale_within |> filter(WTA_1_18 == 30) |> count()
## # A tibble: 1 × 1
## n
## <int>
## 1 20
wta_scale_over30 |> count()
## # A tibble: 1 × 1
## n
## <int>
## 1 60
2) op
ver.
Let’s check distribution of WTA for those who were assigned to
op
ver. (N=352)
tb1 |>
filter(assigned_to=="open_ended") |>
filter(!is.na(WTA_2)) |>
mutate(WTA_2 = as.double(WTA_2)) -> wta_op
datatable(wta_op)
Summary statistics:
summary(wta_op$WTA_2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000e+00 1.000e+01 1.000e+01 2.844e+06 2.000e+01 1.000e+09
table(wta_op$WTA_2)
##
## 0 0.01 1 2 3 4 5 6 7 8 9 9.99 10
## 5 1 11 5 5 1 43 1 3 6 5 1 97
## 12 14 15 20 23 25 30 35 40 50 75 100 120
## 2 2 36 57 1 12 4 1 2 19 1 13 1
## 150 200 399 500 600 650 700 1000 2500 4000 1e+06 1e+09
## 1 2 1 4 1 1 1 2 1 1 1 1
WTA <= $15: 224/352 (63.58%)
WTA >= $30 : 58/352 (16.48%)
wta_op |> filter(WTA_2 <= 15) |> count()
## # A tibble: 1 × 1
## n
## <int>
## 1 224
wta_op |> filter(WTA_2 >= 30) |> count()
## # A tibble: 1 × 1
## n
## <int>
## 1 58
Given the fact that we will drop cases where WTA > $15 in Wave 2, open-ended ver. seems to be a better option?
Brendan’s note: