wave1_0207

[Wave 1] Raw Data

A. Comparison of Verasight’s summary vs. raw data

The raw data files contain:

  • 4,032 survey starts, including completes and incompletes (anyone who got to the consent page)
library(tidyverse)
library(DT)
library(haven)
df <- readRDS("~/Downloads/2023-095a_files/verasight_wave1/2023-095a_client_wave-1.rds")
nrow(df)
## [1] 4032
  • 3,517 completes including attention check failures (per Verasight)
df |> filter(Progress==100) |> count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  3517
  • 3,064 completes excluding attention check failures (per Verasight)

    • 3,517 - 453 = 3,064
df |> filter(Progress==100) |> filter(attn_5 < 4) |> count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   453
  • But, further looking into the data: 878 completes excluding attention check failures (without NAs)

    • Those who passed attention checks: 878

    • Those who failed attention checks: 453

    • NAs: 2186

df |> filter(Progress==100) |> group_by(attn_5) |> count()
## # A tibble: 6 × 2
## # Groups:   attn_5 [6]
##   attn_5                              n
##   <dbl+lbl>                       <int>
## 1  1 [Strongly agree]               262
## 2  2 [Somewhat agree]                69
## 3  3 [Neither agree nor disagree]   122
## 4  4 [Somewhat disagree]             31
## 5  5 [Strongly disagree]            847
## 6 NA                               2186
  • 701 completes who shared their Twitter accounts and completed the survey (per Verasight)

    • But I see 703 completes who shared Twitter accounts
df |> filter(Progress==100) |> filter(userid!="") |> count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   703

It seems that what Verasight denotes as “completes” (those cases with Progress==100) include lots of respondents with NAs. In other words, even if participants’ Progress==100, that doesn’t necessarily mean they passed all the survey requirements.

With this in mind, we are looking for those participants who (1) passed attention checks, (2) consented and authorized their twitter info, (3) agreed to follow our study account, and (4) finished to the end of the survey (Progress==100).

The baseline is subsample we start the analysis is N=4,032. These are who reached to the consent page (started the survey). Let’s breakdown by recruitment mode.

  • Twitter recruitment: 80

  • New panelist recruited after January 1, 2024: 749

  • Existing panelist: 3203

df |> group_by(vs_recruitment) |> count()
## # A tibble: 3 × 2
## # Groups:   vs_recruitment [3]
##   vs_recruitment                                   n
##   <fct>                                        <int>
## 1 Twitter recruitment                             80
## 2 New panelist recruited after January 1, 2024   749
## 3 Existing panelist                             3203

We only filter “completes” with Progress==100 (N=3,517).

  • Twitter recruitment: 58 (58/80=72.5%)

  • New panelist recruited after January 1, 2024: 651/749 (86.92%)

  • Existing panelist: 2808/3203 (87.67%)

df |> filter(Progress==100) |> group_by(vs_recruitment) |> count()
## # A tibble: 3 × 2
## # Groups:   vs_recruitment [3]
##   vs_recruitment                                   n
##   <fct>                                        <int>
## 1 Twitter recruitment                             58
## 2 New panelist recruited after January 1, 2024   651
## 3 Existing panelist                             2808

But as we now know, these cases include those that should be dropped out for our purpose.

1) Passed attention check?

Among what Verasight calls as “completes” (Progress == 100), I filtered in only those who passed attention checks. This resulted in N=878.

df |> filter(Progress==100) |> filter(attn_5 > 3) |> count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   878

(Among those who are complete) Those who passed attention checks:

  • Twitter recruitment: 42/58 (72.4%)

  • New panelist recruited after January 1, 2024: 125/651 (19.2%)

  • Existing panelist: 711/2808 (25.3%)

(Note that denominator is moving!)

df |> filter(Progress==100) |> filter(attn_5 > 3) |> 
  group_by(vs_recruitment) |> count()
## # A tibble: 3 × 2
## # Groups:   vs_recruitment [3]
##   vs_recruitment                                   n
##   <fct>                                        <int>
## 1 Twitter recruitment                             42
## 2 New panelist recruited after January 1, 2024   125
## 3 Existing panelist                              711

2) Consented and authorized Twitter info?

(Among those who are complete and passed attention checks) Those who consented and authorized Twitter accounts (N=703):

  • Twitter recruitment: 35/42 (88.3%)

  • New panelist recruited after January 1, 2024: 92/125 (73.6%)

  • Existing panelist: 576/711 (81%)

df |> filter(Progress==100) |> filter(attn_5 > 3) |> 
  filter(userid!="") |> group_by(vs_recruitment) |> count()
## # A tibble: 3 × 2
## # Groups:   vs_recruitment [3]
##   vs_recruitment                                   n
##   <fct>                                        <int>
## 1 Twitter recruitment                             35
## 2 New panelist recruited after January 1, 2024    92
## 3 Existing panelist                              576

3) Agreed to follow our study account?

(Among those who are complete, passed attention checks, authorized Twitter accounts) Those who agreed to follow our study account (N=698):

  • Twitter recruitment: 35/35 (100%)

  • New panelist recruited after January 1, 2024: 92/92 (100%)

  • Existing panelist: 576 (99.13%)

df |> filter(Progress==100) |> filter(attn_5 > 3) |> 
  filter(userid!="") |> 
  filter(Q7_0==1) |> 
  group_by(vs_recruitment) |> 
  count()
## # A tibble: 3 × 2
## # Groups:   vs_recruitment [3]
##   vs_recruitment                                   n
##   <fct>                                        <int>
## 1 Twitter recruitment                             35
## 2 New panelist recruited after January 1, 2024    92
## 3 Existing panelist                              571

B. WTA distribution

So let’s only take those who passed our in survey criteria: 698

df |> filter(Progress==100) |> filter(attn_5 > 3) |> 
  filter(userid!="") |> 
  filter(Q7_0==1) -> wta_df

We randomly assigned participants into two groups - one with scale vs. the other one with an open-ended version.

For those 698 people, I checked the distributions of WTA answers. The below user’s response is weird thus removed (the user was assigned to scale ver. but we only see N/As).

wta_df |> select(userid, WTA_1_18:WTA_2_confirm) |>
  filter(is.na(WTA_1_18) & is.na(WTA_1_2) & is.na(WTA_2)) 
## # A tibble: 1 × 11
##   userid       WTA_1_18 WTA_1_confirm WTA_1_2 WTA_1_2_confirm T_WTA2_First_Click
##   <chr>           <dbl> <dbl+lbl>       <dbl> <dbl+lbl>                    <dbl>
## 1 16360894752…       NA 1 [Continue …      NA NA                              NA
## # ℹ 5 more variables: T_WTA2_Last_Click <dbl>, T_WTA2_Page_Submit <dbl>,
## #   T_WTA2_Click_Count <dbl>, WTA_2 <dbl>, WTA_2_confirm <dbl+lbl>

The data looks like this:

wta_df |> 
  filter(userid!="1636089475200700421") |>
  select(userid, WTA_1_18:WTA_2_confirm) |> 
  mutate(assigned_to = ifelse(is.na(WTA_2), "scale", "open_ended")) -> tb1

datatable(tb1)

1) scale ver.

Let’s check distribution of WTA for those who were assigned to scale ver.

  • 285 users out of 345 assigned to scale ver. chose their WTA values within 0~30 scale.

  • 60 users out of 345 assigned to scale ver. chose ‘over 30’ and wrote a number in the follow-up question.

tb1 |>
  filter(!is.na(WTA_1_18)) -> wta_scale_within

tb1 |>
  filter(!is.na(WTA_1_2)) -> wta_scale_over30

datatable(wta_scale_within)
datatable(wta_scale_over30)

Summary statistics:

summary(wta_scale_within$WTA_1_18)  
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    8.00   15.00   14.56   20.00   30.00
summary(wta_scale_over30$WTA_1_2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      30      50      50   16866     100 1000000

Among those who chose within the scale:

  • WTA <= $15: 168

    • 58.95% out of 285 within

    • 48.7% out of scale (=within + over 30)

Among those assigned to scale :

  • WTA >= $30: 60 (over 30) + 20 (among within) = 80

    • 23.19% out of scale
wta_scale_within |> filter(WTA_1_18 <= 15) |> count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   168
wta_scale_within |> filter(WTA_1_18 == 30) |> count() 
## # A tibble: 1 × 1
##       n
##   <int>
## 1    20
wta_scale_over30 |> count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1    60

2) op ver.

Let’s check distribution of WTA for those who were assigned to op ver. (N=352)

tb1 |> 
  filter(assigned_to=="open_ended") |> 
  filter(!is.na(WTA_2)) |>
  mutate(WTA_2 = as.double(WTA_2)) -> wta_op

datatable(wta_op)

Summary statistics:

summary(wta_op$WTA_2)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.000e+00 1.000e+01 1.000e+01 2.844e+06 2.000e+01 1.000e+09
table(wta_op$WTA_2)
## 
##     0  0.01     1     2     3     4     5     6     7     8     9  9.99    10 
##     5     1    11     5     5     1    43     1     3     6     5     1    97 
##    12    14    15    20    23    25    30    35    40    50    75   100   120 
##     2     2    36    57     1    12     4     1     2    19     1    13     1 
##   150   200   399   500   600   650   700  1000  2500  4000 1e+06 1e+09 
##     1     2     1     4     1     1     1     2     1     1     1     1
  • WTA <= $15: 224/352 (63.58%)

  • WTA >= $30 : 58/352 (16.48%)

wta_op |> filter(WTA_2 <= 15) |> count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   224
wta_op |> filter(WTA_2 >= 30) |> count() 
## # A tibble: 1 × 1
##       n
##   <int>
## 1    58

Given the fact that we will drop cases where WTA > $15 in Wave 2,  open-ended ver. seems to be a better option?

Brendan’s note:

generally agree on open-end given that we’re trying to maximize eligibility, though we’ll have to think about how to handle people who write 1000000000 etc. for stats)