-
Notifications
You must be signed in to change notification settings - Fork 5
Expand file tree
/
Copy pathREADME.Rmd
More file actions
193 lines (156 loc) · 8.69 KB
/
README.Rmd
File metadata and controls
193 lines (156 loc) · 8.69 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
output: github_document
keep_md: TRUE
---
<!-- README.md is generated from README.Rmd. Please edit README.Rmd file -->
```{r, echo = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
[](http://www.gnu.org/licenses/gpl-3.0.html)
[](https://cran.r-project.org/package=SamplingStrata)
[](http://www.r-pkg.org/pkg/SamplingStrata)
[](http://www.awesomeofficialstatistics.org)
# SamplingStrata <img src="man/figures/apple-touch-icon-152x152.png" align="right" alt="" />
This package offers an approach for the determination of the best
stratification of a sampling frame, the one that ensures the
minimum sample cost under the condition to satisfy precision
constraints in a multivariate and multidomain case. This
approach is based on the use of the genetic algorithm: each
solution (i.e. a particular partition in strata of the sampling
frame) is considered as an individual in a population; the
fitness of all individuals is evaluated applying the
Bethel-Chromy algorithm to calculate the sampling size
satisfying precision constraints on the target estimates.
Functions in the package allow to:
* support in the preparation of required input data;
* execute the optimization step;
* analyse the obtained results of the optimisation step;
* select a sample from the new frame accordingly to the best allocation.
Functions for the execution of the genetic algorithm are a modified
version of the functions in the 'genalg' package.
A complete illustration of all features and functions can be found at the link:
https://barcaroli.github.io/SamplingStrata/
Download the SamplingStrata cheatsheet from:
https://rstudio.com/resources/cheatsheets/
## Installation
You can install SamplingStrata from github with:
```{r gh-installation, eval = FALSE}
install.packages("devtools")
devtools::install_github("barcaroli/SamplingStrata")
```
<img src="cheat_sheet_page1.png" />
<img src="cheat_sheet_page2.png" />
## Three different methods for the optimization step
The optimization can be run by indicating three different methods, on the basis of the following:
A. if stratification variables are categorical (or reduced to) then the method is the "atomic";
B. if stratification variables are continuous, then the method is the "continuous";
C. if stratification variables are continuous, and there is spatial correlation among units in the sampling frame, then the required method is the "spatial".
## Complete example
Jupyter notebook: [](https://mybinder.org/v2/gh/barcaroli/SamplingStrata_binder/HEAD?filepath=SamplingStrata.ipynb)
## Example with the "atomic" method
```{r, eval=FALSE, echo=TRUE}
library(SamplingStrata)
# Load data ---------------------------------------------------------------------------------
data("swissmunicipalities")
head(swissmunicipalities[,c(2:6,9,22)])
# REG COM Nom HApoly Surfacesbois Airbat POPTOT
# 1 4 261 Zurich 8781 2326 2884 363273
# 2 1 6621 Geneve 1593 67 773 177964
# 3 3 2701 Basel 2391 97 1023 166558
# 4 2 351 Bern 5162 1726 1070 128634
# 5 1 5586 Lausanne 4136 1635 856 124914
# 6 4 230 Winterthur 6787 2807 972 90483
# Define the sampling frame -----------------------------------------------------------------
frame <-buildFrameDF(df= swissmunicipalities,
id = "COM", # unique identifier of sampling units
domainvalue= "REG", # domain variable (region)
X = c("POPTOT","HApoly"), # stratification variables
Y =c("Surfacesbois","Airbat")) # target variables
head(frame)
# id X1 X2 Y1 Y2 domainvalue
# 1 261 363273 8781 2326 2884 4
# 2 6621 177964 1593 67 773 1
# 3 2701 166558 2391 97 1023 3
# 4 351 128634 5162 1726 1070 2
# 5 5586 124914 4136 1635 856 1
# 6 230 90483 6787 2807 972 4
# Define precision constraints ------------------------------------------------------------
ndom <- length(unique(frame$domainvalue))
cv <- as.data.frame(list(DOM = rep("DOM1",ndom),
CV1 = rep(0.10,ndom), # precision (cv=10%) for 'Surfacesbois'
CV2 = rep(0.10,ndom), # precision (cv=10%) for 'Airind'
domainvalue= c(1:ndom))) # same precision constraints for all domains
cv
# DOM CV1 CV2 domainvalue
# 1 DOM1 0.1 0.1 1
# 2 DOM1 0.1 0.1 2
# 3 DOM1 0.1 0.1 3
# 4 DOM1 0.1 0.1 4
# 5 DOM1 0.1 0.1 5
# 6 DOM1 0.1 0.1 6
# 7 DOM1 0.1 0.1 7
# Build atomic strata ---------------------------------------------------------------------
strata <- buildStrataDF(frame)
# Number of strata: 2895
# ... of which with only one unit: 2894> head(strata)
head(strata)
# STRATO N M1 M2 S1 S2 COST CENS DOM1 X1 X2
# 100*305 100*305 1 59 0 0 0 1 0 1 100 305
# 1010*1661 1010*1661 1 983 0 0 0 1 0 1 1010 1661
# 102*306 102*306 1 65 0 0 0 1 0 1 102 306
# 1020*5351 1020*5351 1 1375 2 0 0 1 0 1 1020 5351
# 10227*571 10227*571 1 73 48 0 0 1 0 1 10227 571
# 10230*330 10230*330 1 15 2 0 0 1 0 1 10230 330
# Find an initial solution and a suitable number of final strata in each domain -----------
solutionKmean <- KmeansSolution(strata = strata, # atomic strata
errors = cv, # precision constraints
maxclusters = 10) # max number of strata to be evaluated
# number of strata to be obtained in each domain in final solution:
nstrat <- tapply(solutionKmean$suggestions, solutionKmean$domainvalue,
FUN=function(x) length(unique(x)))
nstrat
# 1 2 3 4 5 6 7
# 9 8 10 9 10 9 10
# Optimization step ------------------------------------------------------------------------
solution <- optimStrata(method = "atomic", # method
framesamp = frame, # sampling frame
errors = cv, # precision constraints
nStrata = nstrat, # strata to be obtained in the final stratification
suggestions = solutionKmean, # initial solution
iter = 50, # number of iterations
pops = 10) # number of stratifications evaluated at each iteration
# Number of strata: 2895
# ... of which with only one unit: 2894
# *** Starting parallel optimization for 7 domains using 5 cores
# |++++++++++++++++++++++++++++++++++++++++++++++++++| 100% elapsed=20s
#
# *** Sample size : 362
# *** Number of strata : 59
head(solution$aggr_strata)
# STRATO M1 M2 S1 S2 N DOM1 COST CENS SOLUZ
# 1 1 61.07407 17.37778 41.87780 13.22224 270 1 1 0 9.141966
# 2 2 1114.66667 64.80392 555.75540 53.48631 51 1 1 0 6.985276
# 3 3 57.05128 110.12821 50.51679 35.55146 39 1 1 0 3.550527
# 4 4 477.31472 33.92386 351.59986 37.68313 197 1 1 0 19.010081
# 5 5 3226.14286 184.00000 540.04720 80.64561 7 1 1 0 2.000000
# 6 6 1805.21429 150.28571 256.07733 210.69830 14 1 1 0 7.553702
# Sample selection --------------------------------------------------------------------------
s <- selectSample(frame = solution$framenew, # frame with the indication of optimized strata
outstrata = solution$aggr_strata) # optimized strata with sampling units allocation
# *** Sample has been drawn successfully ***
# 362 units have been selected from 59 strata
#
# ==> There have been 6 take-all strata
# from which have been selected 9 units
head(s)
# DOMAINVALUE STRATO STRATUM ID X1 X2 Y1 Y2 LABEL WEIGHTS FPC
# 1 1 1 195*201 5534 195 201 37 10 1 30 0.03333333
# 2 1 1 172*193 5801 172 193 14 4 1 30 0.03333333
# 3 1 1 349*398 5499 349 398 19 15 1 30 0.03333333
# 4 1 1 2939*460 5582 2939 460 67 50 1 30 0.03333333
# 5 1 1 186*309 5663 186 309 65 10 1 30 0.03333333
# 6 1 1 290*421 5463 290 421 11 14 1 30 0.03333333
```