Title: | Generic Machine Learning Inference |
---|---|
Description: | Generic Machine Learning Inference on heterogeneous treatment effects in randomized experiments as proposed in Chernozhukov, Demirer, Duflo and Fernández-Val (2020) <arXiv:1712.04802>. This package's workhorse is the 'mlr3' framework of Lang et al. (2019) <doi:10.21105/joss.01903>, which enables the specification of a wide variety of machine learners. The main functionality, GenericML(), runs Algorithm 1 in Chernozhukov, Demirer, Duflo and Fernández-Val (2020) <arXiv:1712.04802> for a suite of user-specified machine learners. All steps in the algorithm are customizable via setup functions. Methods for printing and plotting are available for objects returned by GenericML(). Parallel computing is supported. |
Authors: | Max Welz [aut, cre] |
Maintainer: | Max Welz <[email protected]> |
License: | GPL (>= 3) |
Version: | 0.2.3 |
Built: | 2025-02-21 06:02:26 UTC |
Source: | https://github.com/mwelz/genericml |
Performs the linear regression for the Best Linear Predictor (BLP) procedure.
BLP( Y, D, propensity_scores, proxy_BCA, proxy_CATE, HT = FALSE, X1_control = setup_X1(), vcov_control = setup_vcov(), external_weights = NULL, significance_level = 0.05 )
BLP( Y, D, propensity_scores, proxy_BCA, proxy_CATE, HT = FALSE, X1_control = setup_X1(), vcov_control = setup_vcov(), external_weights = NULL, significance_level = 0.05 )
Y |
A numeric vector containing the response variable. |
D |
A binary vector of treatment assignment. Value one denotes assignment to the treatment group and value zero assignment to the control group. |
propensity_scores |
A numeric vector of propensity scores. We recommend to use the estimates of a |
proxy_BCA |
A numeric vector of proxy baseline conditional average (BCA) estimates. We recommend to use the estimates of a |
proxy_CATE |
A numeric vector of proxy conditional average treatment effect (CATE) estimates. We recommend to use the estimates of a |
HT |
Logical. If |
X1_control |
Specifies the design matrix |
vcov_control |
Specifies the covariance matrix estimator. Must be an object of class |
external_weights |
Optional vector of external numeric weights for weighted regression (in addition to the standard weights used when |
significance_level |
Significance level. Default is 0.05. |
An object of class "BLP"
, consisting of the following components:
generic_targets
A matrix of the inferential results on the BLP generic targets.
coefficients
An object of class "coeftest"
, contains the coefficients of the BLP regression.
lm
An object of class "lm"
used to fit the linear regression model.
Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.
setup_X1()
,
setup_diff()
,
setup_vcov()
,
propensity_score()
,
proxy_BCA()
,
proxy_CATE()
## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Y <- runif(n) # outcome variable propensity_scores <- rep(0.5, n) # propensity scores proxy_BCA <- runif(n) # proxy BCA estimates proxy_CATE <- runif(n) # proxy CATE estimates ## perform BLP BLP(Y, D, propensity_scores, proxy_BCA, proxy_CATE)
## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Y <- runif(n) # outcome variable propensity_scores <- rep(0.5, n) # propensity scores proxy_BCA <- runif(n) # proxy BCA estimates proxy_CATE <- runif(n) # proxy CATE estimates ## perform BLP BLP(Y, D, propensity_scores, proxy_BCA, proxy_CATE)
Performs Classification Analysis (CLAN) on all variables in a design matrix.
CLAN( Z_CLAN, membership, equal_variances = FALSE, diff = setup_diff(), external_weights = NULL, significance_level = 0.05 )
CLAN( Z_CLAN, membership, equal_variances = FALSE, diff = setup_diff(), external_weights = NULL, significance_level = 0.05 )
Z_CLAN |
A numeric matrix holding variables on which classification analysis (CLAN) shall be performed. CLAN will be performed on each column of the matrix. |
membership |
A logical matrix that indicates the group membership of each observation in |
equal_variances |
(deprecated and will be removed in a future release) If |
diff |
Specifies the generic targets of CLAN. Must be an object of class |
external_weights |
Optional vector of external numeric weights for weighted means. |
significance_level |
Significance level. Default is 0.05. |
An object of the class "CLAN"
, consisting of the following components:
generic_targets
A list of result matrices for each variable in Z_CLAN
. Each matrix contains inferential results on the CLAN generic targets.
coefficients
A matrix of point estimates of each CLAN generic target parameter.
Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.
quantile_group()
,
setup_diff()
## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates Z_CLAN <- matrix(runif(n*p), n, p) # design matrix to perform CLAN on membership <- quantile_group(rnorm(n)) # group membership ## perform CLAN CLAN(Z_CLAN, membership)
## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates Z_CLAN <- matrix(runif(n*p), n, p) # design matrix to perform CLAN on membership <- quantile_group(rnorm(n)) # group membership ## perform CLAN CLAN(Z_CLAN, membership)
Performs the linear regression for the Group Average Treatments Effects (GATES) procedure.
GATES( Y, D, propensity_scores, proxy_BCA, proxy_CATE, membership, HT = FALSE, X1_control = setup_X1(), vcov_control = setup_vcov(), diff = setup_diff(), monotonize = TRUE, external_weights = NULL, significance_level = 0.05 )
GATES( Y, D, propensity_scores, proxy_BCA, proxy_CATE, membership, HT = FALSE, X1_control = setup_X1(), vcov_control = setup_vcov(), diff = setup_diff(), monotonize = TRUE, external_weights = NULL, significance_level = 0.05 )
Y |
A numeric vector containing the response variable. |
D |
A binary vector of treatment assignment. Value one denotes assignment to the treatment group and value zero assignment to the control group. |
propensity_scores |
A numeric vector of propensity scores. We recommend to use the estimates of a |
proxy_BCA |
A numeric vector of proxy baseline conditional average (BCA) estimates. We recommend to use the estimates of a |
proxy_CATE |
A numeric vector of proxy conditional average treatment effect (CATE) estimates. We recommend to use the estimates of a |
membership |
A logical matrix that indicates the group membership of each observation in |
HT |
Logical. If |
X1_control |
Specifies the design matrix |
vcov_control |
Specifies the covariance matrix estimator. Must be an object of class |
diff |
Specifies the generic targets of CLAN. Must be an object of class |
monotonize |
Logical. Should GATES point estimates and confidence bounds be rearranged to be monotonically increasing following the monotonization method of Chernozhukov et al. (2009, Biometrika)? Default is |
external_weights |
Optional vector of external numeric weights for weighted regression (in addition to the standard weights used when |
significance_level |
Significance level. Default is 0.05. |
An object of class "GATES"
, consisting of the following components:
generic_targets
A matrix of the inferential results on the GATES generic targets.
coefficients
An object of class "coeftest"
, contains the coefficients of the GATES regression.
lm
An object of class "lm"
used to fit the linear regression model.
Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802. Chernozhukov V., Fernández-Val I., Galichon, A. (2009). “Improving Point and Interval Estimators of Monotone Functions by Rearrangement.” Biometrika, 96(3), 559–575. doi:10.1093/biomet/asp030.
setup_X1()
,
setup_diff()
,
setup_vcov()
,
propensity_score()
,
proxy_BCA()
,
proxy_CATE()
## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Y <- runif(n) # outcome variable propensity_scores <- rep(0.5, n) # propensity scores proxy_BCA <- runif(n) # proxy BCA estimates proxy_CATE <- runif(n) # proxy CATE estimates membership <- quantile_group(proxy_CATE) # group membership ## perform GATES GATES(Y, D, propensity_scores, proxy_BCA, proxy_CATE, membership)
## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Y <- runif(n) # outcome variable propensity_scores <- rep(0.5, n) # propensity scores proxy_BCA <- runif(n) # proxy BCA estimates proxy_CATE <- runif(n) # proxy CATE estimates membership <- quantile_group(proxy_CATE) # group membership ## perform GATES GATES(Y, D, propensity_scores, proxy_BCA, proxy_CATE, membership)
Performs generic machine learning inference on heterogeneous treatment effects as in Chernozhukov, Demirer, Duflo and Fernández-Val (2020) with user-specified machine learning methods. Intended for randomized experiments.
GenericML( Z, D, Y, learners_GenericML, learner_propensity_score = "constant", num_splits = 100, Z_CLAN = NULL, HT = FALSE, quantile_cutoffs = c(0.25, 0.5, 0.75), X1_BLP = setup_X1(), X1_GATES = setup_X1(), diff_GATES = setup_diff(), diff_CLAN = setup_diff(), vcov_BLP = setup_vcov(), vcov_GATES = setup_vcov(), monotonize = TRUE, external_weights = NULL, equal_variances_CLAN = FALSE, prop_aux = 0.5, stratify = setup_stratify(), significance_level = 0.05, min_variation = 1e-05, parallel = FALSE, num_cores = parallel::detectCores(), seed = NULL, store_learners = FALSE, store_splits = TRUE )
GenericML( Z, D, Y, learners_GenericML, learner_propensity_score = "constant", num_splits = 100, Z_CLAN = NULL, HT = FALSE, quantile_cutoffs = c(0.25, 0.5, 0.75), X1_BLP = setup_X1(), X1_GATES = setup_X1(), diff_GATES = setup_diff(), diff_CLAN = setup_diff(), vcov_BLP = setup_vcov(), vcov_GATES = setup_vcov(), monotonize = TRUE, external_weights = NULL, equal_variances_CLAN = FALSE, prop_aux = 0.5, stratify = setup_stratify(), significance_level = 0.05, min_variation = 1e-05, parallel = FALSE, num_cores = parallel::detectCores(), seed = NULL, store_learners = FALSE, store_splits = TRUE )
Z |
A numeric design matrix that holds the covariates in its columns. |
D |
A binary vector of treatment assignment. Value one denotes assignment to the treatment group and value zero assignment to the control group. |
Y |
A numeric vector containing the response variable. |
learners_GenericML |
A character vector specifying the machine learners to be used for estimating the baseline conditional average (BCA) and conditional average treatment effect (CATE). Either |
learner_propensity_score |
The estimator of the propensity scores. Either a numeric vector (which is then taken as estimates of the propensity scores) or a string specifying the estimator. In the latter case, the string must either be equal to |
num_splits |
Number of sample splits. Default is 100. Must be larger than one. If you want to run |
Z_CLAN |
A numeric matrix holding variables on which classification analysis (CLAN) shall be performed. CLAN will be performed on each column of the matrix. If |
HT |
Logical. If |
quantile_cutoffs |
The cutoff points of the quantiles that shall be used for GATES grouping. Default is |
X1_BLP |
Specifies the design matrix |
X1_GATES |
Same as |
diff_GATES |
Specifies the generic targets of GATES. Must be an object of class |
diff_CLAN |
Same as |
vcov_BLP |
Specifies the covariance matrix estimator in the BLP regression. Must be an object of class |
vcov_GATES |
Same as |
monotonize |
Logical. Should GATES point estimates and confidence bounds be rearranged to be monotonically increasing following the monotonization method of Chernozhukov et al. (2009, Biometrika)? Default is |
external_weights |
Optional vector of external numeric weights for weighted means in CLAN and weighted regression in BLP and GATES (in addition to the standard weights used when |
equal_variances_CLAN |
(deprecated and will be removed in a future release) Logical. If |
prop_aux |
Proportion of samples that shall be in the auxiliary set in case of random sample splitting. Default is 0.5. The number of samples in the auxiliary set will be equal to |
stratify |
A list that specifies whether or not stratified sample splitting shall be performed. It is recommended to use the returned object of |
significance_level |
Significance level for VEIN. Default is 0.05. |
min_variation |
Specifies a threshold for the minimum variation of the BCA/CATE predictions. If the variation of a BCA/CATE prediction falls below this threshold, random noise with distribution |
parallel |
Logical. If |
num_cores |
Number of cores to be used in parallelization (if applicable). Default is the number of cores of the user's machine. |
seed |
Random seed. Default is |
store_learners |
Logical. If |
store_splits |
Logical. If |
The specifications "lasso"
, "random_forest"
, and "tree"
in learners_GenericML
and learner_propensity_score
correspond to the following mlr3
specifications (we omit the keywords classif.
and regr.
). "lasso"
is a cross-validated Lasso estimator, which corresponds to 'mlr3::lrn("cv_glmnet", s = "lambda.min", alpha = 1)'
. "random_forest"
is a random forest with 500 trees, which corresponds to 'mlr3::lrn("ranger", num.trees = 500)'
. "tree"
is a tree learner, which corresponds to 'mlr3::lrn("rpart")'
. Warning: GenericML()
can be quite memory-intensive, in particular when the data set is large. To alleviate memory usage, consider setting store_learners = FALSE
, choosing a low number of cores via num_cores
(at the expense of longer computing time), setting prop_aux
to a value smaller than the default of 0.5, or using GenericML_combine()
.
An object of class "GenericML"
. On this object, we recommend to use the accessor functions get_BLP()
, get_GATES()
, and get_CLAN()
to extract the results of the analyses of BLP, GATES, and CLAN, respectively. An object of class "GenericML"
contains the following components:
VEIN
A list containing two sub-lists called best_learners
and all_learners
, respectively. Each of these two sub-lists contains the inferential VEIN results on the generic targets of the BLP, GATES, and CLAN analyses. all_learners
does this for all learners specified in the argument learners_GenericML
, best_learners
only for the corresponding best learners. Which learner is best for which analysis is assessed by the criteria discussed in Sections 5.2 and 5.3 of the paper.
best
A list containing information on the evaluation of which learner is the best for which analysis. Contains four components. The first three contain the name of the best learner for BLP, GATES, and CLAN, respectively. The fourth component, overview
, contains the two criteria used to determine the best learners (discussed in Sections 5.2 and 5.3 of the paper).
propensity_scores
The propensity score estimates as well as the "mlr3"
objects used to estimate them (if mlr3
was used for estimation).
GenericML_single
Only nonempty if store_learners = TRUE
. Contains all intermediate results of each learners for each split. That is, for a given learner (first level of the list) and split (second level), objects of classes "BLP"
, "GATES"
, "CLAN"
, "proxy_BCA"
, "proxy_CATE"
as well as the criteria (
"best"
)) are listed, which were computed with the given learner and split.
splits
Only nonempty if store_splits = TRUE
. Contains a character matrix of dimension length(Y)
by num_splits
. Contains the group membership (main or auxiliary) of each observation (rows) in each split (columns). "M"
denotes the main set, "A"
the auxiliary set.
generic_targets
A list of generic target estimates for each learner. More specifically, each component is a list of the generic target estimates pertaining to the BLP, GATES, and CLAN analyses. Each of those lists contains a three-dimensional array containing the generic targets of a single learner for all sample splits (except CLAN where there is one more layer of lists).
arguments
A list of arguments used in the function call.
In an earlier development version, Lucas Kitzmueller alerted us to several minor bugs and proposed fixes. Many thanks to him!
Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.
Lang M., Binder M., Richter J., Schratz P., Pfisterer F., Coors S., Au Q., Casalicchio G., Kotthoff L., Bischl B. (2019). “mlr3: A Modern Object-Oriented Machine Learning Framework in R.” Journal of Open Source Software, 4(44), 1903. doi:10.21105/joss.01903.
Chernozhukov V., Fernández-Val I., Galichon, A. (2009). “Improving Point and Interval Estimators of Monotone Functions by Rearrangement.” Biometrika, 96(3), 559–575. doi:10.1093/biomet/asp030.
plot.GenericML()
print.GenericML()
get_BLP()
,
get_GATES()
,
get_CLAN()
,
setup_X1()
,
setup_diff()
,
setup_vcov()
,
setup_stratify()
,
GenericML_single()
,
GenericML_combine()
if (require("glmnet") && require("ranger")) { ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## column names of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("lasso", "mlr3::lrn('ranger', num.trees = 10)") ## glmnet v4.1.3 isn't supported on Solaris, so skip Lasso in this case if(Sys.info()["sysname"] == "SunOS") learners <- learners[-1] ## specify quantile cutoffs (the 4 quartile groups here) quantile_cutoffs <- c(0.25, 0.5, 0.75) ## specify the differenced generic targets of GATES and CLAN # use G4-G1, G4-G2, G4-G3 as differenced generic targets in GATES diff_GATES <- setup_diff(subtract_from = "most", subtracted = c(1,2,3)) # use G1-G3, G1-G2 as differenced generic targets in CLAN diff_CLAN <- setup_diff(subtract_from = "least", subtracted = c(3,2)) ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, quantile_cutoffs = quantile_cutoffs, diff_GATES = diff_GATES, diff_CLAN = diff_CLAN, parallel = FALSE) ## access BLP generic targets for best learner and make plot get_BLP(x, plot = TRUE) ## access GATES generic targets for best learner and make plot get_GATES(x, plot = TRUE) ## access CLAN generic targets for "V1" & best learner and make plot get_CLAN(x, variable = "V1", plot = TRUE) }
if (require("glmnet") && require("ranger")) { ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## column names of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("lasso", "mlr3::lrn('ranger', num.trees = 10)") ## glmnet v4.1.3 isn't supported on Solaris, so skip Lasso in this case if(Sys.info()["sysname"] == "SunOS") learners <- learners[-1] ## specify quantile cutoffs (the 4 quartile groups here) quantile_cutoffs <- c(0.25, 0.5, 0.75) ## specify the differenced generic targets of GATES and CLAN # use G4-G1, G4-G2, G4-G3 as differenced generic targets in GATES diff_GATES <- setup_diff(subtract_from = "most", subtracted = c(1,2,3)) # use G1-G3, G1-G2 as differenced generic targets in CLAN diff_CLAN <- setup_diff(subtract_from = "least", subtracted = c(3,2)) ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, quantile_cutoffs = quantile_cutoffs, diff_GATES = diff_GATES, diff_CLAN = diff_CLAN, parallel = FALSE) ## access BLP generic targets for best learner and make plot get_BLP(x, plot = TRUE) ## access GATES generic targets for best learner and make plot get_GATES(x, plot = TRUE) ## access CLAN generic targets for "V1" & best learner and make plot get_CLAN(x, variable = "V1", plot = TRUE) }
This function combines multiple "GenericML"
objects into one "GenericML"
object. Combining several "GenericML"
objects can be useful when you cannot run GenericML()
for sufficiently many splits due to memory constraints. In this case, you may run GenericML()
multiple times with only a small number of sample splits each and combine the returned "GenericML"
objects into one GenericML
object with this function.
GenericML_combine(x)
GenericML_combine(x)
x |
A list of |
To ensure consistency of the estimates, all "GenericML"
objects in the list x
must have the exact same parameter specifications in their original call to GenericML()
, except for the parameters num_splits
, parallel
, num_cores
, seed
, and store_learners
(i.e. these arguments may vary between the "GenericML"
objects in the list x
). An error will be thrown if this is not satisfied.
A"GenericML"
object as returned by GenericML()
. In the arguments
component of this object, the objects parallel
, num_cores
, seed
, and store_learners
are set to NULL
as these might differ between the individual GenericML
objects in x
. Moreover, the propensity_scores
component of the returned object is taken from the first "GenericML"
object in x
.
if (require("glmnet") && require("ranger")) { ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## column names of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("lasso", "mlr3::lrn('ranger', num.trees = 10)") ## glmnet v4.1.3 isn't supported on Solaris, so skip Lasso in this case if(Sys.info()["sysname"] == "SunOS") learners <- learners[-1] ## call GenericML three times and store the returned objects in a list x x <- lapply(1:3, function(...) GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE)) ## combine the objects in x into one GenericML object genML <- GenericML_combine(x) ## you can use all methods of GenericML objects on the combined object, for instance accessors: get_BLP(genML, plot = TRUE) }
if (require("glmnet") && require("ranger")) { ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## column names of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("lasso", "mlr3::lrn('ranger', num.trees = 10)") ## glmnet v4.1.3 isn't supported on Solaris, so skip Lasso in this case if(Sys.info()["sysname"] == "SunOS") learners <- learners[-1] ## call GenericML three times and store the returned objects in a list x x <- lapply(1:3, function(...) GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE)) ## combine the objects in x into one GenericML object genML <- GenericML_combine(x) ## you can use all methods of GenericML objects on the combined object, for instance accessors: get_BLP(genML, plot = TRUE) }
Performs generic ML inference for a single learning technique and a given split of the data. Can be seen as a single iteration of Algorithm 1 in the paper.
GenericML_single( Z, D, Y, learner, propensity_scores, M_set, A_set = setdiff(1:length(Y), M_set), Z_CLAN = NULL, HT = FALSE, quantile_cutoffs = c(0.25, 0.5, 0.75), X1_BLP = setup_X1(), X1_GATES = setup_X1(), diff_GATES = setup_diff(), diff_CLAN = setup_diff(), vcov_BLP = setup_vcov(), vcov_GATES = setup_vcov(), monotonize = TRUE, equal_variances_CLAN = FALSE, external_weights = NULL, significance_level = 0.05, min_variation = 1e-05 )
GenericML_single( Z, D, Y, learner, propensity_scores, M_set, A_set = setdiff(1:length(Y), M_set), Z_CLAN = NULL, HT = FALSE, quantile_cutoffs = c(0.25, 0.5, 0.75), X1_BLP = setup_X1(), X1_GATES = setup_X1(), diff_GATES = setup_diff(), diff_CLAN = setup_diff(), vcov_BLP = setup_vcov(), vcov_GATES = setup_vcov(), monotonize = TRUE, equal_variances_CLAN = FALSE, external_weights = NULL, significance_level = 0.05, min_variation = 1e-05 )
Z |
A numeric design matrix that holds the covariates in its columns. |
D |
A binary vector of treatment assignment. Value one denotes assignment to the treatment group and value zero assignment to the control group. |
Y |
A numeric vector containing the response variable. |
learner |
A character specifying the machine learner to be used for estimating the baseline conditional average (BCA) and conditional average treatment effect (CATE). Either |
propensity_scores |
A numeric vector of propensity score estimates. |
M_set |
A numerical vector of indices of observations in the main sample. |
A_set |
A numerical vector of indices of observations in the auxiliary sample. Default is complementary set to |
Z_CLAN |
A numeric matrix holding variables on which classification analysis (CLAN) shall be performed. CLAN will be performed on each column of the matrix. If |
HT |
Logical. If |
quantile_cutoffs |
The cutoff points of the quantiles that shall be used for GATES grouping. Default is |
X1_BLP |
Specifies the design matrix |
X1_GATES |
Same as |
diff_GATES |
Specifies the generic targets of GATES. Must be an object of class |
diff_CLAN |
Same as |
vcov_BLP |
Specifies the covariance matrix estimator in the BLP regression. Must be an object of class |
vcov_GATES |
Same as |
monotonize |
Logical. Should GATES point estimates and confidence bounds be rearranged to be monotonically increasing following the monotonization method of Chernozhukov et al. (2009, Biometrika)? Default is |
equal_variances_CLAN |
(deprecated and will be removed in a future release) Logical. If |
external_weights |
Optional vector of external numeric weights for weighted means in CLAN and weighted regression in BLP and GATES (in addition to the standard weights used when |
significance_level |
Significance level for VEIN. Default is 0.05. |
min_variation |
Specifies a threshold for the minimum variation of the BCA/CATE predictions. If the variation of a BCA/CATE prediction falls below this threshold, random noise with distribution |
The specifications "lasso"
, "random_forest"
, and "tree"
in learner
correspond to the following mlr3
specifications (we omit the keywords classif.
and regr.
). "lasso"
is a cross-validated Lasso estimator, which corresponds to 'mlr3::lrn("cv_glmnet", s = "lambda.min", alpha = 1)'
. "random_forest"
is a random forest with 500 trees, which corresponds to 'mlr3::lrn("ranger", num.trees = 500)'
. "tree"
is a tree learner, which corresponds to 'mlr3::lrn("rpart")'
.
A list with the following components:
BLP
An object of class "BLP"
.
GATES
An object of class "GATES"
.
CLAN
An object of class "CLAN"
.
proxy_BCA
An object of class "proxy_BCA"
.
proxy_CATE
An object of class "proxy_CATE"
.
best
Estimates of the parameters for finding the best learner. Returned by
lambda_parameters()
.
Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.
Lang M., Binder M., Richter J., Schratz P., Pfisterer F., Coors S., Au Q., Casalicchio G., Kotthoff L., Bischl B. (2019). “mlr3: A Modern Object-Oriented Machine Learning Framework in R.” Journal of Open Source Software, 4(44), 1903. doi:10.21105/joss.01903.
Chernozhukov V., Fernández-Val I., Galichon, A. (2009). “Improving Point and Interval Estimators of Monotone Functions by Rearrangement.” Biometrika, 96(3), 559–575. doi:10.1093/biomet/asp030.
if(require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates Z <- matrix(runif(n*p), n, p) # design matrix D <- rbinom(n, 1, 0.5) # random treatment assignment Y <- runif(n) # outcome variable propensity_scores <- rep(0.5, n) # propensity scores M_set <- sample(1:n, size = n/2) # main set ## specify learner learner <- "mlr3::lrn('ranger', num.trees = 10)" ## run single GenericML iteration GenericML_single(Z, D, Y, learner, propensity_scores, M_set) }
if(require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates Z <- matrix(runif(n*p), n, p) # design matrix D <- rbinom(n, 1, 0.5) # random treatment assignment Y <- runif(n) # outcome variable propensity_scores <- rep(0.5, n) # propensity scores M_set <- sample(1:n, size = n/2) # main set ## specify learner learner <- "mlr3::lrn('ranger', num.trees = 10)" ## run single GenericML iteration GenericML_single(Z, D, Y, learner, propensity_scores, M_set) }
The best learner is determined by maximizing the criteria and
, see Sections 5.2 and 5.3 of the paper. This function accesses the estimates of these two criteria,
get_best(x)
get_best(x)
x |
An object of the class |
An object of class "best"
, which consists of the following components:
BLP
A string holding the name of the best learner for a BLP analysis.
GATES
A string holding the name of the best learner for a GATES analysis.
CLAN
A string holding the name of the best learner for a CLAN analysis (same learner as in GATES
).
overview
A numeric matrix of the estimates of the performance measures and
for each learner.
GenericML()
,
get_BLP()
,
get_GATES()
,
get_CLAN()
if(require("rpart") && require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## column names of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("tree", "mlr3::lrn('ranger', num.trees = 10)") ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE) ## access best learner get_best(x) ## access BLP generic targets for best learner w/o plot get_BLP(x, learner = "best", plot = FALSE) ## access BLP generic targets for ranger learner w/o plot get_BLP(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access GATES generic targets for best learner w/o plot get_GATES(x, learner = "best", plot = FALSE) ## access GATES generic targets for ranger learner w/o plot get_GATES(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access CLAN generic targets for "V1" & best learner, w/o plot get_CLAN(x, learner = "best", variable = "V1", plot = FALSE) ## access CLAN generic targets for "V1" & ranger learner, w/o plot get_CLAN(x, learner = "mlr3::lrn('ranger', num.trees = 10)", variable = "V1", plot = FALSE) }
if(require("rpart") && require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## column names of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("tree", "mlr3::lrn('ranger', num.trees = 10)") ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE) ## access best learner get_best(x) ## access BLP generic targets for best learner w/o plot get_BLP(x, learner = "best", plot = FALSE) ## access BLP generic targets for ranger learner w/o plot get_BLP(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access GATES generic targets for best learner w/o plot get_GATES(x, learner = "best", plot = FALSE) ## access GATES generic targets for ranger learner w/o plot get_GATES(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access CLAN generic targets for "V1" & best learner, w/o plot get_CLAN(x, learner = "best", variable = "V1", plot = FALSE) ## access CLAN generic targets for "V1" & ranger learner, w/o plot get_CLAN(x, learner = "mlr3::lrn('ranger', num.trees = 10)", variable = "V1", plot = FALSE) }
Accessor function for the BLP generic target estimates
get_BLP(x, learner = "best", plot = TRUE)
get_BLP(x, learner = "best", plot = TRUE)
x |
An object of the class |
learner |
A character string of the learner whose BLP generic target estimates shall be accessed. Default is |
plot |
Logical. If |
An object of class "BLP_info"
, which consists of the following components:
estimate
A numeric vector of point estimates of the BLP generic targets.
confidence_interval
A numeric matrix of the lower and upper confidence bounds for each generic target. The confidence level of the implied confidence interval is equal to 1 - 2 * significance_level
.
confidence_level
The confidence level of the confidence intervals. Equals 1 - 2 * significance_level
.
learner
The argument learner
.
plot
An object of class "ggplot"
. Only returned if the argument plot = TRUE
.
GenericML()
,
get_GATES()
,
get_CLAN()
,
get_best()
,
print.BLP_info()
,
print.GATES_info()
,
print.CLAN_info()
if(require("rpart") && require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## column names of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("tree", "mlr3::lrn('ranger', num.trees = 10)") ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE) ## access best learner get_best(x) ## access BLP generic targets for best learner w/o plot get_BLP(x, learner = "best", plot = FALSE) ## access BLP generic targets for ranger learner w/o plot get_BLP(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access GATES generic targets for best learner w/o plot get_GATES(x, learner = "best", plot = FALSE) ## access GATES generic targets for ranger learner w/o plot get_GATES(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access CLAN generic targets for "V1" & best learner, w/o plot get_CLAN(x, learner = "best", variable = "V1", plot = FALSE) ## access CLAN generic targets for "V1" & ranger learner, w/o plot get_CLAN(x, learner = "mlr3::lrn('ranger', num.trees = 10)", variable = "V1", plot = FALSE) }
if(require("rpart") && require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## column names of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("tree", "mlr3::lrn('ranger', num.trees = 10)") ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE) ## access best learner get_best(x) ## access BLP generic targets for best learner w/o plot get_BLP(x, learner = "best", plot = FALSE) ## access BLP generic targets for ranger learner w/o plot get_BLP(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access GATES generic targets for best learner w/o plot get_GATES(x, learner = "best", plot = FALSE) ## access GATES generic targets for ranger learner w/o plot get_GATES(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access CLAN generic targets for "V1" & best learner, w/o plot get_CLAN(x, learner = "best", variable = "V1", plot = FALSE) ## access CLAN generic targets for "V1" & ranger learner, w/o plot get_CLAN(x, learner = "mlr3::lrn('ranger', num.trees = 10)", variable = "V1", plot = FALSE) }
Accessor function for the CLAN generic target estimates
get_CLAN(x, variable, learner = "best", plot = TRUE)
get_CLAN(x, variable, learner = "best", plot = TRUE)
x |
An object of the class |
variable |
The (character) name of a variabe on which CLAN was performed. |
learner |
A character string of the learner whose CLAN generic target estimates shall be accessed. Default is |
plot |
Logical. If |
An object of class "CLAN_info"
, which consists of the following components:
estimate
A numeric vector of point estimates of the CLAN generic targets.
confidence_interval
A numeric matrix of the lower and upper confidence bounds for each generic target. The confidence level of the implied confidence interval is equal to 1 - 2 * significance_level
.
confidence_level
The confidence level of the confidence intervals. Equals 1 - 2 * significance_level
.
learner
The argument learner
.
plot
An object of class "ggplot"
. Only returned if the argument plot = TRUE
.
CLAN_variable
The name of the CLAN variable of interest.
GenericML()
,
get_BLP()
,
get_GATES()
,
get_best()
,
print.BLP_info()
,
print.GATES_info()
,
print.CLAN_info()
if(require("rpart") && require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## column names of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("tree", "mlr3::lrn('ranger', num.trees = 10)") ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE) ## access best learner get_best(x) ## access BLP generic targets for best learner w/o plot get_BLP(x, learner = "best", plot = FALSE) ## access BLP generic targets for ranger learner w/o plot get_BLP(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access GATES generic targets for best learner w/o plot get_GATES(x, learner = "best", plot = FALSE) ## access GATES generic targets for ranger learner w/o plot get_GATES(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access CLAN generic targets for "V1" & best learner, w/o plot get_CLAN(x, learner = "best", variable = "V1", plot = FALSE) ## access CLAN generic targets for "V1" & ranger learner, w/o plot get_CLAN(x, learner = "mlr3::lrn('ranger', num.trees = 10)", variable = "V1", plot = FALSE) }
if(require("rpart") && require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## column names of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("tree", "mlr3::lrn('ranger', num.trees = 10)") ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE) ## access best learner get_best(x) ## access BLP generic targets for best learner w/o plot get_BLP(x, learner = "best", plot = FALSE) ## access BLP generic targets for ranger learner w/o plot get_BLP(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access GATES generic targets for best learner w/o plot get_GATES(x, learner = "best", plot = FALSE) ## access GATES generic targets for ranger learner w/o plot get_GATES(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access CLAN generic targets for "V1" & best learner, w/o plot get_CLAN(x, learner = "best", variable = "V1", plot = FALSE) ## access CLAN generic targets for "V1" & ranger learner, w/o plot get_CLAN(x, learner = "mlr3::lrn('ranger', num.trees = 10)", variable = "V1", plot = FALSE) }
Accessor function for the GATES generic target estimates
get_GATES(x, learner = "best", plot = TRUE)
get_GATES(x, learner = "best", plot = TRUE)
x |
An object of the class |
learner |
A character string of the learner whose GATES generic target estimates shall be accessed. Default is |
plot |
Logical. If |
An object of class "GATES_info"
, which consists of the following components:
estimate
A numeric vector of point estimates of the GATES generic targets.
confidence_interval
A numeric matrix of the lower and upper confidence bounds for each generic target. The confidence level of the implied confidence interval is equal to 1 - 2 * significance_level
.
confidence_level
The confidence level of the confidence intervals. Equals 1 - 2 * significance_level
.
learner
The argument learner
.
plot
An object of class "ggplot"
. Only returned if the argument plot = TRUE
.
GenericML()
,
get_BLP()
,
get_CLAN()
,
get_best()
,
print.BLP_info()
,
print.GATES_info()
,
print.CLAN_info()
if(require("rpart") && require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## column names of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("tree", "mlr3::lrn('ranger', num.trees = 10)") ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE) ## access best learner get_best(x) ## access BLP generic targets for best learner w/o plot get_BLP(x, learner = "best", plot = FALSE) ## access BLP generic targets for ranger learner w/o plot get_BLP(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access GATES generic targets for best learner w/o plot get_GATES(x, learner = "best", plot = FALSE) ## access GATES generic targets for ranger learner w/o plot get_GATES(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access CLAN generic targets for "V1" & best learner, w/o plot get_CLAN(x, learner = "best", variable = "V1", plot = FALSE) ## access CLAN generic targets for "V1" & ranger learner, w/o plot get_CLAN(x, learner = "mlr3::lrn('ranger', num.trees = 10)", variable = "V1", plot = FALSE) }
if(require("rpart") && require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## column names of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("tree", "mlr3::lrn('ranger', num.trees = 10)") ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE) ## access best learner get_best(x) ## access BLP generic targets for best learner w/o plot get_BLP(x, learner = "best", plot = FALSE) ## access BLP generic targets for ranger learner w/o plot get_BLP(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access GATES generic targets for best learner w/o plot get_GATES(x, learner = "best", plot = FALSE) ## access GATES generic targets for ranger learner w/o plot get_GATES(x, learner = "mlr3::lrn('ranger', num.trees = 10)", plot = FALSE) ## access CLAN generic targets for "V1" & best learner, w/o plot get_CLAN(x, learner = "best", variable = "V1", plot = FALSE) ## access CLAN generic targets for "V1" & ranger learner, w/o plot get_CLAN(x, learner = "mlr3::lrn('ranger', num.trees = 10)", variable = "V1", plot = FALSE) }
This function tests for statistical significance of all CLAN difference parameters that were specified in the function setup_diff()
. It reports all CLAN variables along which there are significant difference parameters, which corresponds to evidence for treatment effect heterogeneity along this variable, at the specified significance level.
heterogeneity_CLAN(x, learner = "best", significance_level = 0.05)
heterogeneity_CLAN(x, learner = "best", significance_level = 0.05)
x |
An object of class |
learner |
A character string of the learner whose CLAN generic target estimates are of interest. Default is |
significance_level |
Level for the significance tests. Default is 0.05. |
An object of class "heterogeneity_CLAN"
, consisting of the following components:
p_values
A matrix of p values of all CLAN difference parameters for all CLAN variables.
significant
The names of variables with at least one significant CLAN difference parameter ("variables"
), their number "num_variables"
, and the total number of significant CLAN difference parameters "num_params"
. All significance tests were performed at level significance_level
.
min_pval
Information on the smallest p value: Its value ("value"
), the variable in which it was estimated ("variable"
), the CLAN difference parameter it belongs to ("parameter"
), and whether or not it is significant at level significance_level
("significant"
).
"learner"
Name of the learner whose median estimates we used for the listed results.
"significance_level"
The level of the significance tests.
Estimates the lambda parameters and
whose medians are used to find the best ML method.
lambda_parameters(BLP, GATES, proxy_CATE, membership)
lambda_parameters(BLP, GATES, proxy_CATE, membership)
BLP |
An object of class |
GATES |
An object of class |
proxy_CATE |
Proxy estimates of the CATE. |
membership |
A logical matrix that indicates the group membership of each observation in |
A list containing the estimates of and
, denoted
lambda
and lambda.bar
, respectively.
Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.
## generate data set.seed(1) n <- 200 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Y <- runif(n) # outcome variable propensity_scores <- rep(0.5, n) # propensity scores proxy_BCA <- runif(n) # proxy BCA estimates proxy_CATE <- runif(n) # proxy CATE estimates membership <- quantile_group(proxy_CATE) # group membership ## perform BLP BLP <- BLP(Y, D, propensity_scores, proxy_BCA, proxy_CATE) ## perform GATES GATES <- GATES(Y, D, propensity_scores, proxy_BCA, proxy_CATE, membership) ## get estimates of the lambda parameters lambda_parameters(BLP, GATES, proxy_CATE, membership)
## generate data set.seed(1) n <- 200 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Y <- runif(n) # outcome variable propensity_scores <- rep(0.5, n) # propensity scores proxy_BCA <- runif(n) # proxy BCA estimates proxy_CATE <- runif(n) # proxy CATE estimates membership <- quantile_group(proxy_CATE) # group membership ## perform BLP BLP <- BLP(Y, D, propensity_scores, proxy_BCA, proxy_CATE) ## perform GATES GATES <- GATES(Y, D, propensity_scores, proxy_BCA, proxy_CATE, membership) ## get estimates of the lambda parameters lambda_parameters(BLP, GATES, proxy_CATE, membership)
Calculates the lower and and median of a vector as proposed in Comment 4.2 in the paper.
Med(x)
Med(x)
x |
A numeric vector. |
A list with the upper, lower, and usual median (where the latter is the average of the former two).
Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.
set.seed(1) x <- runif(100) Med(x)
set.seed(1) x <- runif(100) Med(x)
"GenericML"
objectVisualizes the estimates of the generic targets of interest: plots the point estimates as well as the corresponding confidence intervals. The generic targets of interest can be (subsets of) the parameters of the BLP, GATES, or CLAN analysis.
## S3 method for class 'GenericML' plot( x, type = "GATES", learner = "best", CLAN_variable = NULL, groups = "all", ATE = TRUE, limits = NULL, title = NULL, ... )
## S3 method for class 'GenericML' plot( x, type = "GATES", learner = "best", CLAN_variable = NULL, groups = "all", ATE = TRUE, limits = NULL, title = NULL, ... )
x |
An object of the class |
type |
The analysis whose parameters shall be plotted. Either |
learner |
The learner whose results are to be returned. Default is |
CLAN_variable |
Name of the CLAN variable to be plotted. Only applicable if |
groups |
Character vector indicating the per-group parameter estimates that shall be plotted in GATES and CLAN analyses. Default is |
ATE |
Logical. If |
limits |
A numeric vector of length two holding the limits of the y-axis of the plot. |
title |
The title of the plot. |
... |
Additional arguments to be passed down. |
If you wish to retrieve the data frame that this plot method visualizes, please use setup_plot()
.
An object of class "ggplot"
.
setup_plot()
,
GenericML()
,
get_BLP()
,
get_GATES()
,
get_CLAN()
,
setup_diff()
if(require("ranger")) { ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## name the columns of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("random_forest") ## specify quantile cutoffs (the 4 quartile groups here) quantile_cutoffs <- c(0.25, 0.5, 0.75) ## specify the differenced generic targets of GATES and CLAN diff_GATES <- setup_diff(subtract_from = "most", subtracted = c(1,2,3)) diff_CLAN <- setup_diff(subtract_from = "least", subtracted = c(3,2)) ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, quantile_cutoffs = quantile_cutoffs, diff_GATES = diff_GATES, diff_CLAN = diff_CLAN, parallel = FALSE) ## plot BLP parameters plot(x, type = "BLP") ## plot GATES parameters "G1", "G4", "G4-G1" plot(x, type = "GATES", groups = c("G1", "G4", "G4-G1")) ## plot CLAN parameters "G1", "G2", "G2-G1" of variable "V1": plot(x, type = "CLAN", CLAN_variable = "V1", groups = c("G1", "G2", "G1-G3")) }
if(require("ranger")) { ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## name the columns of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("random_forest") ## specify quantile cutoffs (the 4 quartile groups here) quantile_cutoffs <- c(0.25, 0.5, 0.75) ## specify the differenced generic targets of GATES and CLAN diff_GATES <- setup_diff(subtract_from = "most", subtracted = c(1,2,3)) diff_CLAN <- setup_diff(subtract_from = "least", subtracted = c(3,2)) ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, quantile_cutoffs = quantile_cutoffs, diff_GATES = diff_GATES, diff_CLAN = diff_CLAN, parallel = FALSE) ## plot BLP parameters plot(x, type = "BLP") ## plot GATES parameters "G1", "G4", "G4-G1" plot(x, type = "GATES", groups = c("G1", "G4", "G4-G1")) ## plot CLAN parameters "G1", "G2", "G2-G1" of variable "V1": plot(x, type = "CLAN", CLAN_variable = "V1", groups = c("G1", "G2", "G1-G3")) }
"BLP_info"
objectPrint method for a "BLP_info"
object
## S3 method for class 'BLP_info' print(x, digits = max(3L, getOption("digits") - 3L), ...)
## S3 method for class 'BLP_info' print(x, digits = max(3L, getOption("digits") - 3L), ...)
x |
An object of the class |
digits |
Number of digits to print. |
... |
Additional arguments to be passed down. |
A print to the console.
"CLAN_info"
objectPrint method for a "CLAN_info"
object
## S3 method for class 'CLAN_info' print(x, digits = max(3L, getOption("digits") - 3L), ...)
## S3 method for class 'CLAN_info' print(x, digits = max(3L, getOption("digits") - 3L), ...)
x |
An object of the class |
digits |
Number of digits to print. |
... |
Additional arguments to be passed down. |
A print to the console.
"GATES_info"
objectPrint method for a "GATES_info"
object
## S3 method for class 'GATES_info' print(x, digits = max(3L, getOption("digits") - 3L), ...)
## S3 method for class 'GATES_info' print(x, digits = max(3L, getOption("digits") - 3L), ...)
x |
An object of the class |
digits |
Number of digits to print. |
... |
Additional arguments to be passed down. |
A print to the console.
GenericML
objectPrints key results of the analyses conducted in GenericML()
.
## S3 method for class 'GenericML' print(x, digits = max(3L, getOption("digits") - 3L), ...)
## S3 method for class 'GenericML' print(x, digits = max(3L, getOption("digits") - 3L), ...)
x |
An object of the class |
digits |
Number of digits to print. |
... |
Additional arguments to be passed down. |
A print to the console.
if(require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## specify learners learners <- c("random_forest") ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE) ## print print(x) }
if(require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## specify learners learners <- c("random_forest") ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE) ## print print(x) }
"heterogeneity_CLAN"
objectPrint method for a "heterogeneity_CLAN"
object
## S3 method for class 'heterogeneity_CLAN' print(x, ...)
## S3 method for class 'heterogeneity_CLAN' print(x, ...)
x |
An object of class |
... |
Additional arguments to be passed down. |
A print to the console.
Estimates the propensity scores for binary treatment assignment
and covariates
. Either done by taking the empirical mean of
(which should equal roughly 0.5, since we assume a randomized experiment), or by direct machine learning estimation.
propensity_score(Z, D, estimator = "constant")
propensity_score(Z, D, estimator = "constant")
Z |
A numeric design matrix that holds the covariates in its columns. |
D |
A binary vector of treatment assignment. Value one denotes assignment to the treatment group and value zero assignment to the control group. |
estimator |
Character specifying the estimator. Must either be equal to |
The specifications "lasso"
, "random_forest"
, and "tree"
in estimator
correspond to the following mlr3
specifications (we omit the keywords classif.
and regr.
). "lasso"
is a cross-validated Lasso estimator, which corresponds to 'mlr3::lrn("cv_glmnet", s = "lambda.min", alpha = 1)'
. "random_forest"
is a random forest with 500 trees, which corresponds to 'mlr3::lrn("ranger", num.trees = 500)'
. "tree"
is a tree learner, which corresponds to 'mlr3::lrn("rpart")'
.
An object of class "propensity_score"
, consisting of the following components:
estimates
A numeric vector of propensity score estimates.
mlr3_objects
"mlr3"
objects used for estimation. Only non-empty if mlr3
was used.
Rosenbaum P.R., Rubin D.B. (1983). “The Central Role of the Propensity Score in Observational Studies for Causal Effects.” Biometrika, 70(1), 41–55. doi:10.1093/biomet/70.1.41.
Lang M., Binder M., Richter J., Schratz P., Pfisterer F., Coors S., Au Q., Casalicchio G., Kotthoff L., Bischl B. (2019). “mlr3: A Modern Object-Oriented Machine Learning Framework in R.” Journal of Open Source Software, 4(44), 1903. doi:10.21105/joss.01903.
## generate data set.seed(1) n <- 100 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix ## estimate propensity scores via mean(D)... propensity_score(Z, D, estimator = "constant") ## ... and via SVM with cache size 40 if(require("e1071")){ propensity_score(Z, D, estimator = 'mlr3::lrn("svm", cachesize = 40)') }
## generate data set.seed(1) n <- 100 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix ## estimate propensity scores via mean(D)... propensity_score(Z, D, estimator = "constant") ## ... and via SVM with cache size 40 if(require("e1071")){ propensity_score(Z, D, estimator = 'mlr3::lrn("svm", cachesize = 40)') }
Proxy estimation of the Baseline Conditional Average (BCA), defined by . Estimation is done on the auxiliary sample, but BCA predictions are made for all observations.
proxy_BCA(Z, D, Y, A_set, learner, min_variation = 1e-05)
proxy_BCA(Z, D, Y, A_set, learner, min_variation = 1e-05)
Z |
A numeric design matrix that holds the covariates in its columns. |
D |
A binary vector of treatment assignment. Value one denotes assignment to the treatment group and value zero assignment to the control group. |
Y |
A numeric vector containing the response variable. |
A_set |
A numerical vector of the indices of the observations in the auxiliary sample. |
learner |
A string specifying the machine learner for the estimation. Either |
min_variation |
Specifies a threshold for the minimum variation of the predictions. If the variation of a BCA prediction falls below this threshold, random noise with distribution |
The specifications "lasso"
, "random_forest"
, and "tree"
in learner
correspond to the following mlr3
specifications (we omit the keywords classif.
and regr.
). "lasso"
is a cross-validated Lasso estimator, which corresponds to 'mlr3::lrn("cv_glmnet", s = "lambda.min", alpha = 1)'
. "random_forest"
is a random forest with 500 trees, which corresponds to 'mlr3::lrn("ranger", num.trees = 500)'
. "tree"
is a tree learner, which corresponds to 'mlr3::lrn("rpart")'
.
An object of class "proxy_BCA"
, consisting of the following components:
estimates
A numeric vector of BCA estimates of each observation.
mlr3_objects
"mlr3"
objects used for estimation.
Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.
Lang M., Binder M., Richter J., Schratz P., Pfisterer F., Coors S., Au Q., Casalicchio G., Kotthoff L., Bischl B. (2019). “mlr3: A Modern Object-Oriented Machine Learning Framework in R.” Journal of Open Source Software, 4(44), 1903. doi:10.21105/joss.01903.
if(require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome A_set <- sample(1:n, size = n/2) # auxiliary set ## BCA predictions via random forest proxy_BCA(Z, D, Y, A_set, learner = "mlr3::lrn('ranger', num.trees = 10)") }
if(require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome A_set <- sample(1:n, size = n/2) # auxiliary set ## BCA predictions via random forest proxy_BCA(Z, D, Y, A_set, learner = "mlr3::lrn('ranger', num.trees = 10)") }
Proxy estimation of the Conditional Average Treatment Effect (CATE), defined by . Estimation is done on the auxiliary sample, but CATE predictions are made for all observations.
proxy_CATE(Z, D, Y, A_set, learner, proxy_BCA = NULL, min_variation = 1e-05)
proxy_CATE(Z, D, Y, A_set, learner, proxy_BCA = NULL, min_variation = 1e-05)
Z |
A numeric design matrix that holds the covariates in its columns. |
D |
A binary vector of treatment assignment. Value one denotes assignment to the treatment group and value zero assignment to the control group. |
Y |
A numeric vector containing the response variable. |
A_set |
A numerical vector of the indices of the observations in the auxiliary sample. |
learner |
A string specifying the machine learner for the estimation. Either |
proxy_BCA |
A vector of proxy estimates of the baseline conditional average, BCA, |
min_variation |
Minimum variation of the predictions before random noise with distribution |
The specifications "lasso"
, "random_forest"
, and "tree"
in learner
correspond to the following mlr3
specifications (we omit the keywords classif.
and regr.
). "lasso"
is a cross-validated Lasso estimator, which corresponds to 'mlr3::lrn("cv_glmnet", s = "lambda.min", alpha = 1)'
. "random_forest"
is a random forest with 500 trees, which corresponds to 'mlr3::lrn("ranger", num.trees = 500)'
. "tree"
is a tree learner, which corresponds to 'mlr3::lrn("rpart")'
.
An object of class "proxy_CATE"
, consisting of the following components:
estimates
A numeric vector of CATE estimates of each observation.
mlr3_objects
"mlr3"
objects used for estimation of (
Y1_learner
) and (
Y0_learner
). The latter is not available if proxy_BCA = NULL
.
Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.
Lang M., Binder M., Richter J., Schratz P., Pfisterer F., Coors S., Au Q., Casalicchio G., Kotthoff L., Bischl B. (2019). “mlr3: A Modern Object-Oriented Machine Learning Framework in R.” Journal of Open Source Software, 4(44), 1903. doi:10.21105/joss.01903.
if(require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome A_set <- sample(1:n, size = n/2) # auxiliary set ## CATE predictions via random forest proxy_CATE(Z, D, Y, A_set, learner = "mlr3::lrn('ranger', num.trees = 10)") }
if(require("ranger")){ ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome A_set <- sample(1:n, size = n/2) # auxiliary set ## CATE predictions via random forest proxy_CATE(Z, D, Y, A_set, learner = "mlr3::lrn('ranger', num.trees = 10)") }
Partitions a vector into quantile groups and returns a logical matrix indicating group membership.
quantile_group(x, cutoffs = c(0.25, 0.5, 0.75))
quantile_group(x, cutoffs = c(0.25, 0.5, 0.75))
x |
A numeric vector to be partitioned. |
cutoffs |
A numeric vector denoting the quantile cutoffs for the partition. Default are the quartiles: |
An object of type "quantile_group"
, which is a logical matrix indicating group membership.
set.seed(1) x <- runif(100) cutoffs <- c(0.25, 0.5, 0.75) quantile_group(x, cutoffs)
set.seed(1) x <- runif(100) cutoffs <- c(0.25, 0.5, 0.75) quantile_group(x, cutoffs)
diff
argumentsThis setup function controls how differences of generic target parameters are taken. Returns a list with two components, called subtract_from
and subtracted
. The first element (subtract_from
) denotes what shall be the base group to subtract from in the generic targets of interest (GATES or CLAN); either "most"
or "least"
. The second element (subtracted
) are the groups to be subtracted from subtract_from
, which is a subset of , where
equals the number of groups. The number of groups should be consistent with the number of groups induced by the argument
quantile_cutoffs
, which is the cardinality of quantile_cutoffs
, plus one.
setup_diff(subtract_from = "most", subtracted = 1)
setup_diff(subtract_from = "most", subtracted = 1)
subtract_from |
String indicating the base group to subtract from, either |
subtracted |
Vector indicating the groups to be subtracted from the group specified in |
The output of this setup function is intended to be used as argument in the functions GenericML()
and GenericML_single()
(arguments diff_GATES
, diff_CLAN
), as well as GATES()
and CLAN()
(argument diff
).
An object of class "setup_diff"
, consisting of the following components:
subtract_from
A character equal to "most"
or "least"
.
subtracted
A numeric vector of group indices.
See the description above for details.
Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.
GenericML()
,
GenericML_single()
,
CLAN()
,
GATES()
,
setup_X1()
,
setup_vcov()
## specify quantile cutoffs (the 4 quartile groups here) quantile_cutoffs <- c(0.25, 0.5, 0.75) ## Use group difference GK-G1 as generic targets in GATES and CLAN ## Gx is the x-th group setup_diff(subtract_from = "most", subtracted = 1) ## Use GK-G1, GK-G2, GK-G3 as differenced generic targets setup_diff(subtract_from = "most", subtracted = c(1,2,3)) ## Use G1-G2, G1-G3 as differenced generic targets setup_diff(subtract_from = "least", subtracted = c(3,2))
## specify quantile cutoffs (the 4 quartile groups here) quantile_cutoffs <- c(0.25, 0.5, 0.75) ## Use group difference GK-G1 as generic targets in GATES and CLAN ## Gx is the x-th group setup_diff(subtract_from = "most", subtracted = 1) ## Use GK-G1, GK-G2, GK-G3 as differenced generic targets setup_diff(subtract_from = "most", subtracted = c(1,2,3)) ## Use G1-G2, G1-G3 as differenced generic targets setup_diff(subtract_from = "least", subtracted = c(3,2))
GenericML()
plotExtract the relevant information for visualizing the point and interval estimates of the generic targets of interest. The generic targets of interest can be (subsets of) the parameters of the BLP, GATES, or CLAN analysis.
setup_plot( x, type = "GATES", learner = "best", CLAN_variable = NULL, groups = "all" )
setup_plot( x, type = "GATES", learner = "best", CLAN_variable = NULL, groups = "all" )
x |
An object of the class |
type |
The analysis whose parameters shall be plotted. Either |
learner |
The learner whose results are to be returned. Default is |
CLAN_variable |
Name of the CLAN variable to be plotted. Only applicable if |
groups |
Character vector indicating the per-group parameter estimates that shall be plotted in GATES and CLAN analyses. Default is |
This function is used internally by plot.GenericML()
. It may also be useful for users who want to produce a similar plot, but who want more control over what information to display or how to display that information.
An object of class "setup_plot"
, which is a list with the following elements.
data_plot
A data frame containing point and interval estimates of the generic target specified in the argument type
.
data_BLP
A data frame containing point and interval estimates of the BLP analysis.
confidence_level
The confidence level of the confidence intervals. The confidence level is equal to 1 - 2 * significance_level
, which is the adjustment proposed in the paper.
if(require("ranger") && require("ggplot2")) { ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## name the columns of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("random_forest") ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE) ## the plot we wish to replicate plot(x = x, type = "GATES") ## get the data to plot the GATES estimates data <- setup_plot(x = x, type = "GATES") ## define variables to appease the R CMD check group <- estimate <- ci_lower <- ci_upper <- NULL ## replicate the plot(x, type = "GATES") # for simplicity, we skip aligning the colors ggplot(mapping = aes(x = group, y = estimate), data = data$data_plot) + geom_hline(aes(yintercept = 0), color = "black", linetype = "dotted") + geom_hline(aes(yintercept = data$data_BLP["beta.1", "estimate"], color = "ATE"), linetype = "dashed") + geom_hline(aes(yintercept = data$data_BLP["beta.1", "ci_lower"], color = paste0(100*data$confidence_level, "% CI (ATE)")), linetype = "dashed") + geom_hline(yintercept = data$data_BLP["beta.1", "ci_upper"], linetype = "dashed", color = "red") + geom_point(aes(color = paste0("GATES with ", 100*data$confidence_level, "% CI")), size = 3) + geom_errorbar(mapping = aes(ymin = ci_lower, ymax = ci_upper)) }
if(require("ranger") && require("ggplot2")) { ## generate data set.seed(1) n <- 150 # number of observations p <- 5 # number of covariates D <- rbinom(n, 1, 0.5) # random treatment assignment Z <- matrix(runif(n*p), n, p) # design matrix Y0 <- as.numeric(Z %*% rexp(p) + rnorm(n)) # potential outcome without treatment Y1 <- 2 + Y0 # potential outcome under treatment Y <- ifelse(D == 1, Y1, Y0) # observed outcome ## name the columns of Z colnames(Z) <- paste0("V", 1:p) ## specify learners learners <- c("random_forest") ## perform generic ML inference # small number of splits to keep computation time low x <- GenericML(Z, D, Y, learners, num_splits = 2, parallel = FALSE) ## the plot we wish to replicate plot(x = x, type = "GATES") ## get the data to plot the GATES estimates data <- setup_plot(x = x, type = "GATES") ## define variables to appease the R CMD check group <- estimate <- ci_lower <- ci_upper <- NULL ## replicate the plot(x, type = "GATES") # for simplicity, we skip aligning the colors ggplot(mapping = aes(x = group, y = estimate), data = data$data_plot) + geom_hline(aes(yintercept = 0), color = "black", linetype = "dotted") + geom_hline(aes(yintercept = data$data_BLP["beta.1", "estimate"], color = "ATE"), linetype = "dashed") + geom_hline(aes(yintercept = data$data_BLP["beta.1", "ci_lower"], color = paste0(100*data$confidence_level, "% CI (ATE)")), linetype = "dashed") + geom_hline(yintercept = data$data_BLP["beta.1", "ci_upper"], linetype = "dashed", color = "red") + geom_point(aes(color = paste0("GATES with ", 100*data$confidence_level, "% CI")), size = 3) + geom_errorbar(mapping = aes(ymin = ci_lower, ymax = ci_upper)) }
This function controls whether or not stratified sample splitting shall be performed. If no stratified sampling shall be performed, do not pass any arguments to this function (this is the default). If stratified sampling shall be performed, use this function to pass arguments to stratified()
in the package "splitstackshape". In this case, the specification for prop_aux
in GenericML()
does not have an effect because the number of samples in the auxiliary set is specified with the size
argument in stratified()
.
setup_stratify(...)
setup_stratify(...)
... |
Named objects that shall be used as arguments in |
The output of this setup function is intended to be used as argument stratify
in the function GenericML()
. If arguments are passed to stratified()
via this function, make sure to pass the necessary objects that stratified()
in the "splitstackshape" package requires. The necessary objects are called indt
, group
, and size
(see the documentation of stratified()
for details). If either of these objects is missing, an error is thrown.
A list of named objects (possibly empty) specifying the stratified sampling strategy. If empty, no stratified sampling will be performed and instead ordinary random sampling will be performed.
stratified()
,
GenericML()
## sample data of group membership (with two groups) set.seed(1) n <- 500 groups <- data.frame(group1 = rbinom(n, 1, 0.2), group2 = rbinom(n, 1, 0.3)) ## suppose we want both groups to be present in a strata... group <- c("group1", "group2") ## ... and that the size of the strata equals half of the observations per group size <- 0.5 ## obtain a list of arguments that will be passed to splitstackshape::stratified() setup_stratify(indt = groups, group = group, size = size) ## if no stratified sampling shall be used, do not pass anything setup_stratify()
## sample data of group membership (with two groups) set.seed(1) n <- 500 groups <- data.frame(group1 = rbinom(n, 1, 0.2), group2 = rbinom(n, 1, 0.3)) ## suppose we want both groups to be present in a strata... group <- c("group1", "group2") ## ... and that the size of the strata equals half of the observations per group size <- 0.5 ## obtain a list of arguments that will be passed to splitstackshape::stratified() setup_stratify(indt = groups, group = group, size = size) ## if no stratified sampling shall be used, do not pass anything setup_stratify()
vcov_control
argumentsReturns a list with two elements called estimator
and arguments
. The element estimator
is a string specifying the covariance matrix estimator to be used in the linear regression regression of interest and needs to be a covariance estimator function in the "sandwich" package. The second element, arguments
, is a list of arguments that shall be passed to the function specified in the first element, estimator
.
setup_vcov(estimator = "vcovHC", arguments = list(type = "const"))
setup_vcov(estimator = "vcovHC", arguments = list(type = "const"))
estimator |
Character specifying a covariance matrix estimator in the "sandwich" package. Default is |
arguments |
A list of arguments that are to be passed to the function in the |
The output of this setup function is intended to be used as argument in the functions GenericML()
and GenericML_single()
(arguments vcov_BLP
, vcov_GATES
), as well as BLP()
and GATES()
(argument vcov_control
).
An object of class "setup_vcov"
, consisting of the following components:
estimator
A character equal to covariance estimation function names in the "sandwich" package.
arguments
A list of arguments that shall be passed to the function specified in the estimator
argument.
See the description above for details.
Zeileis A. (2004). “Econometric Computing with HC and HAC Covariance Matrix Estimators.” Journal of Statistical Software, 11(10), 1–17. doi:10.18637/jss.v011.i10
Zeileis A. (2006). “Object-Oriented Computation of Sandwich Estimators.” Journal of Statistical Software, 16(9), 1–16. doi:10.18637/jss.v016.i09
GenericML()
,
GenericML_single()
,
BLP()
,
GATES()
,
setup_X1()
,
setup_diff()
# use standard homoskedastic OLS covariance matrix estimate setup_vcov(estimator = "vcovHC", arguments = list(type = "const")) # use White's heteroskedasticity-robust estimator setup_vcov(estimator = "vcovHC", arguments = list(type = "HC0")) if (require("sandwich")){ # use HAC-robust estimator with prewhitening and Andrews' (Econometrica, 1991) weights # since weightsAndrews() is a function in 'sandwich', require this package setup_vcov(estimator = "vcovHAC", arguments = list(prewhite = TRUE, weights = weightsAndrews)) }
# use standard homoskedastic OLS covariance matrix estimate setup_vcov(estimator = "vcovHC", arguments = list(type = "const")) # use White's heteroskedasticity-robust estimator setup_vcov(estimator = "vcovHC", arguments = list(type = "HC0")) if (require("sandwich")){ # use HAC-robust estimator with prewhitening and Andrews' (Econometrica, 1991) weights # since weightsAndrews() is a function in 'sandwich', require this package setup_vcov(estimator = "vcovHAC", arguments = list(prewhite = TRUE, weights = weightsAndrews)) }
in the BLP or GATES regressionReturns a list with three elements. The first element of the list, funs_Z
, controls which functions of matrix Z
are used as regressors in . The second element,
covariates
, is an optional matrix of custom covariates that shall be included in . The third element,
fixed_effects
, controls the inclusion of fixed effects.
setup_X1(funs_Z = c("B"), covariates = NULL, fixed_effects = NULL)
setup_X1(funs_Z = c("B"), covariates = NULL, fixed_effects = NULL)
funs_Z |
Character vector controlling the functions of |
covariates |
Optional numeric matrix containing additional covariates to be included in |
fixed_effects |
Numeric vector of integers that indicates cluster membership of the observations: For each cluster, a fixed effect will be added. Default is |
The output of this setup function is intended to be used as argument in the functions GenericML()
and GenericML_single()
(arguments X1_BLP
, X1_GATES
), as well as BLP()
and GATES()
(argument X1_control
).
An object of class "setup_X1"
, consisting of the following components:
funs_Z
A character vector, being a subset of c("S", "B", "p")
.
covariates
Either NULL
or a numeric matrix.
fixed_effects
Either NULL
or an integer vector indicating cluster membership.
See the description above for details.
Chernozhukov V., Demirer M., Duflo E., Fernández-Val I. (2020). “Generic Machine Learning Inference on Heterogenous Treatment Effects in Randomized Experiments.” arXiv preprint arXiv:1712.04802. URL: https://arxiv.org/abs/1712.04802.
GenericML()
,
GenericML_single()
,
BLP()
,
GATES()
,
setup_vcov()
,
setup_diff()
set.seed(1) n <- 100 # sample size p <- 5 # number of covariates covariates <- matrix(runif(n*p), n, p) # sample matrix of covariates # let there be three clusters; assign membership randomly fixed_effects <- sample(c(1,2,3), size = n, replace = TRUE) # use BCA estimates in matrix X1 setup_X1(funs_Z = "B", covariates = NULL, fixed_effects = NULL) # use BCA and propensity score estimates in matrix X1 # uses uniform covariates and fixed effects setup_X1(funs_Z = c("B", "p"), covariates = covariates, fixed_effects = NULL)
set.seed(1) n <- 100 # sample size p <- 5 # number of covariates covariates <- matrix(runif(n*p), n, p) # sample matrix of covariates # let there be three clusters; assign membership randomly fixed_effects <- sample(c(1,2,3), size = n, replace = TRUE) # use BCA estimates in matrix X1 setup_X1(funs_Z = "B", covariates = NULL, fixed_effects = NULL) # use BCA and propensity score estimates in matrix X1 # uses uniform covariates and fixed effects setup_X1(funs_Z = c("B", "p"), covariates = covariates, fixed_effects = NULL)
Check if user's OS is a Unix system
TrueIfUnix()
TrueIfUnix()
A Boolean that is TRUE
if the user's operating system is a Unix system and FALSE
otherwise.