optimization_essay/husmann.tex at master · kaihusmann/optimization_essay · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
% !TeX root = RJwrapper.tex
\title{Flexible Global Optimization with Simulated-Annealing}
\author{by Kai Husmann, Alexander Lange and Elmar Spiegel}

\maketitle

\abstract{
Standard numerical optimization approaches require several restrictions. So do exact optimization methods such as the Linear Programming approach appeal for linearity and Nelder-Mead for unimodality of the loss function. One method to relax these assumptions is the Simulated Annealing approach, which reduces the risk of getting trapped in a local optimum. However, the standard implementation still requires regular parameter spaces and continuous loss functions. To address this issue, we implemented a version of the Simulated Annealing method that is able to deal with irregular and complex parameter spaces as well as with non-continuous and sophisticated loss functions. Moreover, in order to gain fast but reliable solutions, we included steps to shrink the parameter space during the iterations. All these steps are summarized in the R-package \CRANpkg{optimization}, which we will introduce in the following article. We also included generic and real world applications in order to test our approach.
}

\section{Introduction}
As early computer-based optimization methods developed simultaneously with the first digital computers \citep{corana_1987}, numerous optimization methods for various purposes are available today \citep{wegener_2005}. One of the main challenges in Operations Research is therefore to match the optimization problem with a reasonable method. The complexity of the problem determines possible methods. Optimization procedures in general can be distinguished into exact methods and heuristics \citep{kirkpatrick_1983}. For simple optimization problems, exact methods are often meaningful tools of choice. If all assumptions on model loss and restrictions are met, these methods will obligatorily find the exact solution without need for further parameters. They are the easiest way of solving optimization problems. The \dfn{Linear Simplex-Method} \citep{dantzig_1959} is one example that only needs the loss-function and optional restrictions as model input. If, however, any of the model assumptions, e.g. linearity or unimodality, is violated, exact methods are unable to solve problems in a valid way. With developing computer power, heuristics like the \dfn{Savings-Algorithm} \citep{clarke_1964} and metaheuristics like \dfn{Simulated Annealing} (SA) \citep{kirkpatrick_1983} became popular. They enable solving more complex optimization problems. Metaheuristics are a generalization of heuristics with aim to be even more flexible and efficient \citep{blum_2003} so that they can solve complex optimization problems, such as nonlinear problems. Direct search methods like the \dfn{Nelder-Mead} (NM) algorithm are comparatively efficient methods which directly converge to the functions optimum and need relatively fewer settings \citep{geiger_1999}. Random search methods are able to cope with multimodal objective functions. On the one hand, depending on the method of choice, more or fewer assumptions on the loss function can be neglected. On the other hand, heuristics and metaheuristics will always solve problems approximately. Precision of the solution depends on the optimization method and further parameters. There is even no guarantee of approximating the actual optimum since the solution also depends, contrary to exact methods, on parameterization \citep{blum_2003}. Defining proper parameters is thus a crucial point of those methods. The complexity of parameterization will by trend increase with the flexibility of the method while efficiency tends to decrease. The efficiency and accuracy of such models is strongly sensitive to their parameter specification \citep{corana_1987}. Heuristics are often programmed for multi-purpose usage such that there is a suitable method for many optimization problems. For complex optimization problems, however, multi-purpose optimizers often fail to find solutions. Additionally, multi-purpose optimizers are usually not suitable or not efficient for highly complex problems like problems with restricted parameter space. Whenever general-purpose optimizers are too restrictive or inflexible to solve a problem properly, specific methods are advantageous. They offer many variable parameters and can thus be parameterized in a way that is very specific to the optimization problem. They represent the most flexible and the most complex optimization methods \citep{blum_2003}.

SA \citep{kirkpatrick_1983} is known to be one of the oldest and most flexible metaheuristic methods, though the term metaheuristic was established after initial publication of SA \citep{blum_2003}. It is known to be better suited to multimodal loss functions with a very high number of covariates than many other methods \citep{corana_1987}. The method has been applied in many studies of several fields such as chemistry \citep{agostini_2006}, econometrics \citep{ingber_1993} or forest sciences \citep{baskent_2002, boston_1999}. Since its first implementation by \citet{kirkpatrick_1983}, many authors have modified the algorithm in order to adopt it for specific problems \citep[e.g.][]{desarbo_1989, goffe_1996} or more general applications \citep[e.g.][]{xiang_2013}. It combines systematic and stochastic components, and thus enables escaping local optima. It is hence typically used for global optimization of multimodal functions. As it offers many options, SA can be seen a hybrid method between a general optimizer (when default values are chosen) and a problem specific optimization algorithm \citep{wegener_2005}. \citet{corana_1987} developed a dynamic adoption method for the variation of the stochastic component during the optimization process. Their modification affects the efficiency as well as the accuracy of the SA algorithm. It has potential to substantially improve the method. \citet{pronzato_1984} suggests decreasing the search-domain of the stochastic component with increasing number of iterations. The stochastic component in general is the most sensitive part of the method since it determines the loss variables modification during the iterations.

The R software environment provides a platform for the simple and effective distribution of statistical models to a large user community \citep{xiang_2013}. Thus, not surprisingly, several optimization packages of high quality can currently be purchased via \dfn{Comprehensive R Archive Network} \citep{theussl_2016}, where even the SA method is recently listed five times. However, we believe that there is need for a specific stochastic optimization package for complex optimization problems. A package coping with very flexible user definable loss functions with multiple options could be an advantageous extension for R. There is demand for specifically definable optimization problems. We therefore present the package \CRANpkg{optimization} which is basically a modified version of SA. We used the properties of SA to program a stochastic optimization method for specific purposes. We therefore focused on flexibility and implemented many user specifiable parameters. Our method is, in its default configuration, usually not immediately efficient but flexibly adoptable to specific purposes. The main advantages of the package are the possibilities to specifically adjust covariate changing rules as well as the robustness of the loss function. For example, the changing rule allows the user to define an integer parameter space. The loss function can return any value; even \code{NA} or \code{NaN} are possible. Several further user specifications help to parameterize the model in a problem-specific way, so that the user can influence accuracy and speed in very detailed ways. It is also the first R function where the improvements of \citet{corana_1987} and \citet{pronzato_1984} are implemented into an SA based optimization software. This means that the search domain of the stochastic component of SA dynamically shrinks with increasing iterations. We also implemented a generic plot function for post-hoc inspection of the model convergence assessment and the solution quality.

In the following, we briefly introduce the algorithm methodologically and explain the most relevant parameters. We show in four examples in which ways our model is favorable against standard methods and explain how it can be parameterized. We develop two examples illustrating the basic model behavior with a focus on the covariate changing rule. Additionally, we include suggestions on how to adopt the numerous options to specific problems. Two practical examples where our function is recently used underpin the relevance of our specific-purpose optimization method. One of them, the optimization of forest harvesting schedules, is a relatively complex example which cannot be solved with any other optimization function in the R framework.

\section{The package optimization}
In this section, we explain the theory of our SA interpretation and the resulting parameters.

\subsection{Method}
\label{subsec:method}
Since the basic idea of classic SA is derived from the physical process of metal annealing, the nomenclature of SA comes particularly from metallurgy. Just as the classic SA, our function is composed of an inner for loop and an outer while loop \citep{kirkpatrick_1983}. The number of iterations in both loops can be defined by the user. For better overview, we displayed the important steps of our function in a pseudocode (Algorithm~\ref{alg:alg1}).

\subsubsection{Inner loop}
The function of the inner for loop (Algorithm~\ref{alg:alg1}, lines 4 to 29) is to draw covariate combinations stochastically and to compare the returns. The loop repeats $n_{inner}$ times. The first operation of the inner loop (line 5) is saving the covariate combinations of the last inner iteration as $x_j$. In the first iteration $x_j$ is the vector with user defined initial covariates.

In the second step (line 6), the covariates are changed. This changing process marks an essential difference between our approach and the other SA based method in the R framework. The shape and behavior of the variation process may be defined by the user and may be dynamic. Suggestions and examples for specifying this variation, which is a user defined R function, will be given in the following sections. The variation function \textit{vf} is used to create a temporary vector of covariates $x_{i*}$. Besides the former covariate combination $x_j$, the variation function can depend on a vector with random factors \textit{rf} and the temperature \textit{t}. Since \textit{rf} and \textit{t} change over time, the variation function can have dynamic components. Adjustment of \textit{rf} and \textit{t} is done in the outer loop which will be explained explicitly in the following subsection. In the classical SA approach, the covariates $x_{i*}$ are generated by adding or subtracting a uniformly distributed random number to $x_j$ \citep{kirkpatrick_1983}. The range of the uniform random number is, in our function, determined by \textit{rf} whereas \textit{rf} is relative to $x_j$. A random factor of 0.1 and a covariate expression of three e.g. leads to a uniform random number between 2.7 and 3.3. This standard variation function is also default in our approach. A very simple exemplary modification of the variation function could be a normally distributed random number with mean $x_j$ and standard deviation \textit{rf}.

\SetAlCapSkip{2ex}
\begin{algorithm}[H]
	initialize \textit{t}, \textit{vf} with user specifications\\
	calculate $f(x_0)$ with initial parameter vector $x_0$\\
	\While{\textit{t} $>$ $t_{min}$}{
		\For{i in 1: $n_{inner}$}{
			$x_j \gets x_{(i-1)}$\\
			call the variation function to generate $x_{i*}$ in dependence of $x_{j}$, \textit{rf} and \textit{t}\\
			check if all entries in $x_{i*}$ are within the boundaries\\
			\eIf{all $x_i$ valid}{calculate $f(x_{i*})$}{
				\While{any($x_{i*}$ invalid)}{
					call the variation function again\\
					count invalid combinations}
			}
			\eIf{$f(x_{i*}) < f(x_j)$}{
				$x_{i} \gets x_{i*}; f(x_{i}) \gets f(x_{i*})$
			}{calculate Metropolis Probability \textit{M} (Equation~\eqref{eq:eq1})\\ \eIf{uniformly distributed random number [0,1] $<$ \textit{M}}{$x_{i} \gets x_{i*}; f(x_{i}) \gets f(x_{i*})$}{$x_{i} \gets x_j; f(x_{i}) \gets f(x_j)$}}
			\If{threshold accepting criterion fulfilled}{break inner loop}
		}

		{
			reduce \textit{t} for the next iteration\\
			\textit{rf} adaptation for the next iteration\\
		}
	}
	\textbf{return} optimized parameter vector, function value and some additional information \\
	\caption{Pseudocode of the \code{optim\_sa} function in the \textbf{optimization} package exemplary for a minimization.}
	\label{alg:alg1}
\end{algorithm}

After generating $x_{i*}$, the boundaries are checked (lines 7 to 15). If all entries in $x_{i*}$ are within their respective boundaries, the response is calculated. Otherwise the invalid entries of $x_{i*}$ are drawn again until all entries are valid. According to \citet{corana_1987}, the number of invalid trials can be useful information in order to assess the quality of the search domain. The numbers of invalid trials are thus counted and stored (line 13) in order to make this information accessible for the outer loop. The count of valid and invalid trials is not reset after each inner loop repetition. Both are only initialized in iteration one of the inner loop and increase until the last iteration.

Next step is the comparison of loss function returns (lines 16 and 17). If the return of current variables combination $f(x_{i*})$ is better than $f(x_j)$, $x_{i*}$ and $f(x_{i*})$ are stored into $x_{i}$ and $f(x_{i})$, so $x_{i}$ are the initial covariates for the next iteration. The core idea of the classical SA approach is to cope with the problem of local optima. Thus, even if $f(x_{i*})$ is worse than $f(x_{j})$, there is a chance of storing $x_{i*}$ into $x_i$ (lines 18 to 25). The likelihood of keeping worse responses depends on the Metropolis probability \textit{M} \citep{metropolis_1953}.
\begin{equation}
\label{eq:eq1}
M = exp \left(-\frac{\mid f(i_*)-f(j)\mid}{kt}\right),
\end{equation}
% No line between eq. and text
with \textit{k} being a user definable constant. We adopted this strategy from classic SA without modifications. \textit{M} decreases with decreasing temperature t. The likelihood of keeping worse responses thus depends on the same parameters for the whole inner loop (lines 19 and 24). \textit{t} does not change during the entire inner for loop. The likelihood of keeping worse values is thus equal for each response until the inner loop is completed. Modification of \textit{t} is part of the outer loop which will be explained in the next paragraph. If a worse result is chosen the former optimal covariate combination is, of course, stored before it is overwritten since otherwise there is a sound chance of overwriting that actual global optimum. More details of the Metropolis probability can i.a. be found in \citet{kirkpatrick_1983} and \citet{metropolis_1953}.

Storing the information on development of covariates and response can help improving the perfor- mance of SA \citep{lin_1995, hansen_2012}. We implemented a threshold accepting strategy \citep{dueck_1990} into our SA interpretation (lines 26 to 28). This criterion is the only module that allows reducing the inner loop repetitions without direct user influence. It is simply a vector where the absolute differences of $f(x_{i})$ and $f(x_j)$ are stored. If the response oscillates for a user defined number of repetitions within a user defined threshold, the inner loop breaks.

\subsubsection{Outer loop}
The main functions of the outer while loop (lines 3 to 32) are calling the inner loop (lines 3 to 29) and modifying the parameters that are needed in the inner loop (lines 30 and 31). Therefore \textit{t} and \textit{rf} only change after completely finishing an inner loop. The outer loop repeats until \textit{t} is smaller than the user defined minimum temperature $t_0$.

After finishing the inner loop, firstly \textit{t} is adjusted (line 30). \textit{t} is necessary for the stochastic part in the inner loop (line 19, Equation~\eqref{eq:eq1}). In our function, \textit{t} decreases per definition as it is calculated by multiplying the temperature of the current iteration by \textit{r} which is a user defined real number between 0 and 1. The number of outer loop repetitions is thus implied by initial temperature $t_0$, $t_{min}$ and \textit{r}.

Afterwards \textit{rf} changes (line 31). The dynamic adaption of \textit{rf} after \citet{corana_1987} and \citet{pronzato_1984} is another major novelty of our function. As each covariate can have its own random factor, \textit{rf} is a vector of the same size as $x_i$. \textit{rf} is needed for the covariate variation in the inner loop (lines 6 and 12). Dividing the numbers of invalid trials distinctively for each covariate by the total number of trials of the respective covariate gives the ratio of invalid trials for each covariate. The numbers of invalid trials are counted in line 13. According to \citep{corana_1987}, this ratio of invalid trials can be used to find a trade-off between accuracy and the size of the search domain. They argue that if only valid covariate combinations are drawn, the search domain could be too small for multimodal problems. For this, the ratio of invalid trials in the current iteration is used to generate the \textit{rf} for the following outer loop repetition. They suggest ratios between 0.4 and 0.6. If any observed ratio of invalid trials is < 0.4 or > 0.6, the respective random factors are modified following the suggested equation by \citet{corana_1987}. This strategy allows an adjustment of \textit{rf} for the next iteration. \citet{pronzato_1984} who developed the \dfn{Adaptive Random Search method}, propose a time decreasing search domain. Thus they suggest a search domain adjustment that does not depend on former information, as \citet{corana_1987} did, but on the number of iterations. They argue that the search domain should be wide at the beginning to give the algorithm the chance to cope with local optima, and small in the end to allow higher precisions since the algorithm should converge to the global optimum at the end. Later iterations thus require smaller search domains than earlier iterations. We integrated the idea of \citet{pronzato_1984} into the dynamic search domain adjustment of \citep{corana_1987} by linearly shrinking the favorable range of ratios from the suggested ratios (0.4-0.6) to 0.04-0.06. As mentioned in the inner loop explanations, the variation function can be user defined (lines 6 and 12). The user thus has the option to define flexibly in which way \textit{t} and \textit{rf} influence the covariate variation. Per default, the search domain around the covariates shrinks by trend as the number of outer loop iterations increases.

\subsection{The function optim\_sa}
As \code{optim\_sa} shall be able to solve very specific optimization problems, several parameters can be defined by the user. The quality of solution and speed of convergence will thus substantially depend on accurate parametrization. In the following, we will explain the most important parameters briefly and make suggestions for useful specification. A complete parameter list can be found in the vignette of \CRANpkg{optimization}.
\begin{itemize}
\item \code{fun}: Loss function to be optimized. The statement is without default. The function must depend on a vector of covariates and return one numeric value. There are no assumptions on covariates and return. The covariates do not even need to be continuous. Missing (\code{NA}) or undefined (\code{NaN}) returns are also allowed. Any restriction on the parameter space, e.g. specific invalid covariate values within the boundaries, can be directly integrated into the loss function by simply returning \code{NA}. We will include more specific information on this in the practical examples.
\item \code{start}: Numeric vector with initial covariate combination. This statement has no default. It must be ensured that at least the initial covariate combination leads to a defined numeric response. The loss function at the initial variables combination must therefore return a defined numeric value. This might be relevant when the starting values are determined stochastically.
\item \code{trace}: If \code{TRUE}, the last inner loop iteration of each outer loop iteration is stored in a row in the trace matrix. This might help evaluating the solutions quality. However, storing interim results increases calculation time by up to 10 \%. Disabling \code{trace} can thus improve efficiency when the convergence of an optimization problem is known to be stable.
\item \code{lower, upper}: Numeric vector with lower boundaries of the covariates. The boundaries are needed since the dynamic \code{rf} adjustment \citep{corana_1987, pronzato_1984} depends on the number of invalid covariate combinations.
\item \code{control}: A list with optional further parameters.
\end{itemize}
All parameters in the list with \code{control} arguments have a default value. They are pre-parameterized for loss functions of medium complexity. \code{control} arguments are:
\begin{itemize}
\item \code{vf}: Variation function that allows the user to restrict the parameter space. This is one of the most important differences to classic SA. The function determines the variation of covariates during the iterations. It is allowed to depend on \code{rf}, \code{temperature} and the vector of covariates of the current iteration. The variation function is a crucial element of \code{optim\_sa} which enables flexible programming. It is (next to the loss function itself) the second possibility to define restrictions. The parameter space of the optimization program can be defined by \code{vf}. Per default, the covariates are changed by a continuous, uniformly distributed random number. It must be considered that defining specific \code{rf} can increase the calculation time. The default \code{rf} is a compiled C++ function whereas user specified \code{rf} must be defined as R functions. User specified \code{rf} are useful for optimization problems with non-continuous parameter space.
\item \code{rf}: Numeric vector with random factors. The random factors determine the range of the random number in the variation function \code{vf} relative to the dimension of the function variables. The \code{rf} can be stated separately for each variable. Default is a vector of ones. If \code{dyn\_rf} is enabled, the entries in \code{rf} change dynamically over time.
\item \code{dyn\_rf}: Boolean variable that indicates if the \code{rf} shall change dynamically over time to ensure increasing precision with increasing numbers of iterations. \code{rf} determines whether the adjustments of \citet{corana_1987} and \citet{pronzato_1984} are enabled (see method section for theoretical background). \code{dyn\_rf} ensures a relatively wide search domain at the beginning of the optimization process that shrinks over time. Disabling \code{dyn\_rf} can be useful when \code{rf} with high performance are known. The development of \code{rf} is documented in the \code{trace} matrix. Evaluation of former optimizations with dynamic \code{rf} can thus help finding efficient and reasonable fixed \code{rf}. Self-specified \code{vf} may not depend on \code{rf}. In this cases activating \code{dyn\_rf} will not have any advantage.
\item \code{t0}: Initial temperature. The temperature directly influences the likelihood of accepting worse responses and thus the stochastic part of the optimization. \code{t0} should be adopted to the loss function complexity. Higher temperatures lead to a higher ability of coping with local optima but also to more time-consuming function calls.
\item \code{t\_min}: Numeric value that determines the temperature where outer loop stops. As there is practically no chance of leaving local optima in iterations with low temperature \code{t\_min} mainly affects accuracy of the solution. Higher \code{t\_min} yields to lower accuracy and fewer function calls.
\item \code{nlimit}: Integer value which determines the maximum number of inner loop iterations. If the break criterion in the inner loop is not fulfilled, \code{nlimit} is the exact number of inner loop repetitions. It is therefore an important parameter for determining the number of iterations.
\item \code{r}: Numeric value that determines the reduction of the temperature at the end of each outer loop. Slower temperature reduction leads to an increasing number of function calls. It should be parameterized with respect to \code{nlimit}. High \code{nlimit} in combination with low \code{r} lead to many iterations with the same acceptance likelihood of worse responses. Low \code{nlimit} in combination with \code{r} near 1, by contrast, lead to a continuously decreasing acceptance likelihood of worse responses. It is thus the second crucial parameter for determining the number of iterations.
\end{itemize}

\section{Examples}
To show the benefits of our \CRANpkg{optimization} package we build four examples explaining where the \code{optim\_sa} function can be advantageous.

\subsection{Himmelblau Function with continuous parameter space}
Himmelblau’s function (Equation~\eqref{eq:eq2}) \citep{himmelblau_1972} was chosen as an initial example since it is a very simple multimodal equation and widely known in Operations Research. It has four equal minimum values ($min(f(x_1,x_2))=0)$) at \{-2.8, 3.1\}, \{3.0, 2.0\}, \{3.6, -1.8\} and \{-3.8, -3.3\}. In order to display the basic behavior of \CRANpkg{optimization}, it was compared with two other SA methods from the \pkg{stats} package. Himmelblau’s function is relatively simple. Therefore we also included the NM optimization method \citep{nelder_1965} which is a default of \code{optim} from the \pkg{stats} package, to examine the advantages of stochastic search against direct search.

\begin{equation}
\label{eq:eq2}
f(x_1,x_2)=(x_1^2+x_2-11)^2+(x_1+x_2^2-7)^2
\end{equation}

We performed 10,000 repetitions with each function in order to investigate the quality and speed of the solutions using parameters for relatively simple optimization problems for all examined methods. The optimizations were performed having the following parameters:

    \begin{example}
# stats package: optim (NM)
stats::optim(fn = hi, par = c(10, 10), method = "Nelder-Mead")

# optimization package: optim_sa
optimization::optim_sa(fun = hi, start = (c(10, 10)), trace = TRUE,
                       lower = c(-40, -40), upper = c(40, 40),
                       control = list(t0 = 500, nlimit = 50,r = 0.85,
                                      rf = 3, ac_acc = 0.1, dyn_rf = TRUE))

# stats package optim (SA)
stats::optim(fn = hi, par = c(10, 10), method = "SANN",
             control = list(tmax = 500, reltol = 0.1, temp = 50, trace = TRUE))
\end{example}
%% No line break
Since we have a multimodal optimization problem with multiple equal solutions, the evaluation of solutions quality is composed of response accuracy and covariate combination. With fixed starting parameters, the methods should be able to find all possible solutions. We parameterized the functions from the example such that thy fulfill a defined accuracy threshold. We defined the optimization problem to be solved properly when the responses of the 10,000 repetitions were in mean $\leq$ 0.01. We also looked at the frequency distribution of the covariate combination after minimization. Subsequent to the investigation of quality, we also compared the efficiencies by measuring calculation times using \CRANpkg{microbenchmark} and the iterations frequencies.

It became clear that parameterization of NM was quite simple. It only needed a vector with starting values. The other functions required a larger number of settings. With view to accuracy each method performed well. All functions returned in mean of 10,000 calls responses with values $\leq$ 0.01. Regarding frequency distribution, the functions performed differently (Table~\ref{tab:tab1}). \code{optim\_sa} and \code{optim (SA)} returned all possible solutions. The combination \{-3.8, -3.3\} was consistently least frequent. As \code{optim (NM)} is a direct search method, it only returned one solution \{-3.8, -3.3\}. Further investigation revealed the solutions of optim (NM) to be sensitive to the starting values. If \code{optim (NM)} was parameterized with randomly drawn starting values, all four results would have been possible. Thus one advantage of \code{optim\_sa} and \code{optim (SA)} against direct search methods, like \code{optim (NM)}, is its independence from starting values. Random search methods are advantageous for solving multimodal optimization problems.

% latex table generated in R 3.2.3 by xtable 1.8-2 package - Modified after pasting
% Tue Mar 7 14:26:30 2017
\begin{table}[]
\centering
\caption{Relative frequencies of covariate combinations in \% after optimization of Example 1 for the four examined methods. Number of repetitions: 10,000. We used the parameters given in the example, only the \code{trace} options was deactivated.}
\label{tab:tab1}
\begin{tabular}{cccccc} \cline{3-6}
& \multicolumn{1}{c}{} & \multicolumn{4}{c}{Result (rounded)}                    \\ \cline{3-6}
&                      & \{-2.8, 3.1\} & \{3.0, 2.0\} & \{3.6, -1.8\} & \{-3.8, -3.3\} \\ \cline{2-6}
\multirow{4}{*}{Method} & optim\_sa          & 22.19     & 33.49    & 28.05     & 16.27      \\
& optim (SA)            & 25.91     & 30.89    & 24.08     & 19.12      \\
& optim (NM)          & 0.00      & 0.00     & 0.00      & 100.00     \\ \cline{2-6}
\end{tabular}
\end{table}

As all functions were practically able to minimize Equation~\eqref{eq:eq2}, comparison of calculation times appears to be another important point for quality assessment. As expected, the direct search method \code{optim (NM)} was by far faster than all stochastic methods (Figure \ref{fig:fig1}, left). The two functions able to cope with equally valued optima (\code{optim (SA)} and \code{optim\_sa}) were significantly slower (Table~\ref{tab:tab1}; Figure~\ref{fig:fig1}, left). We parameterized the functions such that they fulfill our accuracy requirements. The parameterizations in the examples should thus be near to the most efficient parameterizations for the given problem with the given accuracy.

Another way of comparing algorithms' efficiency is to examine the frequency of necessary iterations solving the optimization problem. Again, \code{optim (NM)} performed best (Figure~\ref{fig:fig1}, right). \code{optim (SA)} required the most repetitions by far. Only \code{optim\_sa} showed varying frequencies because \code{optim\_sa} was, given the displayed parameters, the only function where the inner loop broke due to the threshold accepting criterion. \code{optim (SA)} always iterated till the user specified limit. It could be seen that \code{optim\_sa} performed well when compared to established SA based optimization function. \code{optim\_sa} is slower but it requires less iterations. The calculation of valid parameter combinations (Algorithm~\ref{alg:alg1}, lines 7 to 15) leads to multiple function calls within one iteration. For this, one iteration of \code{optim\_sa} can last longer than one iteration of the other methods. The number of function calls is not necessarily equal to the number of iterations when \code{dyn\_rf} is activated. The \code{dyn\_rf} which adopts the search domain dynamically to specific loss functions on the one hand thus leads to longer calculation times on the other hand. Improved flexibility therefore again corresponds with longer calculation time.

To conclude, all functions were generally able to solve the problem. The flexible stochastic search-grid of \code{optim (SA)} and \code{optim\_sa} enabled archiving each of the four solutions. Practically, if users are in doubt whether a problem has multiple solutions with equal responses, \code{optim (SA)} and \code{optim\_sa} can simply be repeated without re-parameterization. If well-specified, they will return all possible solutions. They thus have advantages for complex multimodal problems of that kind but are also slower. Due to the multiple additional options, which all have to be called in the code, \code{optim\_sa} is slowest. This example clearly reveals that higher flexibility and generality leads to significantly higher calculation times. Specific algorithms have advantages for more complex problems while more general functions are useful for optimization problems with simple responses.

\textsc{\begin{figure}[htbp]
	\centering
	\resizebox{1.04\linewidth}{!}{\input{Fig/fig1_ex1.tex}}
	\caption{Calculation times and frequency of iterations of the four examined optimization algorithms in Example 1. Note that the y-axis of the left diagram is truncated. \code{optim\_sa} showed 101 and \code{optim (SA)} 70 outliers between four and seven milliseconds. The frequency of iterations represents the total number of iterations. Thus, for the SA methods, all inner loops repetitions are counted. The packages parameterization are shown in example code.}
	\label{fig:fig1}
\end{figure}}


\subsection{Himmelblau Function with discrete parameter space}
A second example illustrating the advantage of flexibly defining the parameter spaces also bases on the function by \citet{himmelblau_1972}. It is an exemplary optimization problem which cannot be solved with any other examined method. If a user is interested in integer covariate combinations only, simply rounding the solution is usually not satisfying since the solution could be non-feasible or non-optimal \citep{cardoso_1997}. The flexible variation function \code{vf} of our models allows searching for integer temporary solutions only and thus searching for the global integer solution. \code{vf} can be used to restrict the parameter space to the specific problem. The following simple example shows a variation rule for an exemplary integer programming problem. The simple \code{var\_fun\_int} from the example returns a vector of integer covariates varying in dimension of \code{vf} around the passed covariates \code{para\_0} The \code{fun\_length} must be passed explicitly, although it is implicitly given by the length of \code{para\_0} because it might help simplifying the function \code{vf}. Dependence of \code{rf} and \code{temp} can be implemented. Optimization problems do not always need a dynamic component. If \code{vf} need not be dynamic, i.e. if the stochastic part shall be fixed during the entire iterations, \code{temp} and \code{rf} can separately be disabled by passing \code{NA} for the respective parameter. If \code{vf} does not depend on \code{rf}, the option \code{dyn\_rf} will not be useful any more. It should be disabled in such cases to save computation time.

\begin{example}
# Define vf
var_fun_int <- function (para_0, fun_length, rf, temp = NA) {
    ret_var_fun <- para_0 + sample.int(rf, fun_length, replace = TRUE) *
        ((rbinom(fun_length, 1, 0.5) * -2) + 1)
    return (ret_var_fun)
}

# Call optim_sa
int_programming <- optimization::optim_sa(fun = hi, start = c(10, 10), trace = TRUE,
                                          lower = c(-40, -40), upper=c(40, 40),
                                          control = list(t0 = 500, nlimit = 50,
                                          r = 0.85, rf = 3, ac_acc = 0.1,
                                          dyn_rf = TRUE, vf = var_func_int))
\end{example}

Calling the minimization function times led to the only integer solution \{3, 2\}. The generic plot function of \CRANpkg{optimization} (Figure~\ref{fig:fig2}) helps interpreting the convergence and thus the solution quality as well as the algorithm behavior. Since the example is 2-dimensional, a \code{contour\_plot} (Figure~\ref{fig:fig2}. Right) could be created. It shows the state space of the Himmelblau function in continuously changing colors for parameter spaces between -4 and 4. The results at the end of each outer loop iteration are shown as points within the parameter space. It became obvious that only integer covariate combinations were calculated during the entire optimization (Figure~\ref{fig:fig2}, right). Figure~\ref{fig:fig2} additionally helps explaining the stochastic part of the SA method. It becomes clear that though the response of iteration 10 (parameter combination \{0, 0\}) was already quite near 0, following iteration ended with relatively worse intermediate results. With response > 100, iteration 19 (parameter combination \{2, -3\}) was much worse than iteration 12. The likelihood of keeping worse solutions shrank over time till it became practically 0 after iteration 40 (Figure~\ref{fig:fig2}, left). For problems of such kind, 40 iterations therefore should be far enough. This information could help the user parameterizing the function for the next problem.

\code{optim (SA)} also provides possibility of defining a variation function via the \code{gr}. Using the variation function from the example code with fixed \code{rf}, however, it was not possible to define robust parameters for solving an integer programming problem. \code{optim\_sa} is thus advantageous against all other examined functions when the parameter space of the optimization problem underlies extended restrictions. It is the only function that enables global optimum integer search. To achieve this, the user must define a changing rule in form of a R function. Users must thus examine their optimization problem very carefully and translate it into suitable functions. Parameterization of \code{optim\_sa} is quite complex and time extensive but enables adopting the optimizer to problem specific needs. Other restrictions such as mixed integer problems may be defined analogously. \code{vf} can thus be used to flexibly restrict the parameter space. Further hints for interpretation can be found in the package documentation.

\begin{figure}[htbp]
\centering
\includegraphics[width=1.025\textwidth]{Fig/fig2-ex2-plot-eps-converted-to.pdf}
\caption{Exemplary examination plots created with the generic plot function. The left diagram shows the current optimal response over iteration of the outer loop. The left diagram displays the succession of the covariate values. The star points the covariate combination at optimum. The actual parameter space of the optimization is reduced for presentation purposes.}
\label{fig:fig2}
\end{figure}

\subsection{Structural vector autoregressive models with least dependent innovations}
\dfn{Vector autoregressive} (VAR) models are used to examine dynamic structures of several time series endogenously within one model. Usually, impulse response functions come to use for investigating structural shocks in multivariate time series systems. Those impulse responses need a decomposition of the covariance matrix, and hence, they are not unique. The interaction between the time series variables in a VAR model are not directly calculable without further assumptions or informations. The idea behind \dfn{structural} VAR (SVAR) is to define strategies to overcome this problem and enable unique impulse response definitions \citep{lutkepohl_2006}.

Least dependent innovations is one of plenty possibilities to obtain unique decompositions of the covariance matrix. The idea is to minimize a Cramer-von-Mises test statistic for stochastically independent structural errors \citep{genest_2007}. Basically, the structural errors are multiplied with $K(K-1)/2$ Givens rotation matrices with angles $\theta$ between \textit{0} and $2\pi$, until the test statistic from equation \eqref{eq: cvm} becomes minimal. \textit{K} is the number of time series in the SVAR model, \textit{T} the number of observations and $\tilde{\varepsilon}$ are the rotated structural errors. An advantage of this method is, that it is very liberal in sense of distribution assumptions \citep{herwartz_2014}.
\begin{equation}\label{eq: cvm}
\mathcal{B}_\theta = \int_{(0,1)^K}\left[\sqrt{T}\left(C(\tilde{\varepsilon}) - \prod^K_{i = 1}U(\tilde{\varepsilon}_i)\right)\right]^2d\tilde{\varepsilon}
\end{equation}
However, optimizing the test statistic is challenging, because the loss function has a large and unpredictable number of local optima and an irregular state space. Moreover, the function shows a relatively broad parameter space with very sensitive responses. It often reacts dramatically to small changes in the rotation angles. One approach to solve the optimization problem is the \dfn{grid optimization} \citep{herwartz_2015}. Grid optimization is, however, inefficient in large samples and -- depending on the step size -- sometimes unable to find the global optimum. Monte Carlo simulations have shown that direct search and gradient based optimization algorithms usually fail to find the optimal rotation angles. Performing several direct search optimizations sequentially with randomly changing starting values can be lead to satisfying solutions. The strategy is, however, also very inefficient.

The irregular response pattern and the high demands on the solution precision are the main challenges. The combination of SA with the adaptive variation of the search seems to be promising for the optimization problem. The relatively broad parameter variation at the beginning of the optimization process should ensure adequate search-grid size while the stochastic component of the SA should be able to tackle the irregular responses. Dynamically decreasing parameter variation is particular advantageous for optimizing the rotation angles as sufficient parameter accuracy is crucial for the reliability of the solution.

Practical application of \code{optim\_sa} revealed its advantages. Using parameterizations for low performance and high precision, e.g. an initial temperature of 100,000, a maximum number of 10,000 inner loops and temperature reductions between 0.7 and 0.9, still converged faster than all other investigated methods. Sensitivity analysis of exemplary optimization problems confirmed the general applicability of \code{optim\_sa}. The function was able to find the actual global optimum sufficiently often.

\subsection{Forest harvesting schedule optimization}
Forestry is traditionally a knowledge-based field with optimization playing only a minor role. However, optimization methods are very interesting for intermediate-term forest planning. Their popularity is increasing \citet{mohring_2010}. While Linear Programs are nowadays used in some states, e.g. Finland \citep{redsven_2012}, stochastic optimization programs are quite novel in forestry for optimization of harvesting intensity \citep{kangas_2015}. Our function is an integral part of the first stochastic optimization software of forest operation in Germany and one of the first software solutions worldwide on single tree scale. Optimization of forest harvesting planning represents an interesting and innovative practical example where \code{optim\_sa} is recently used. \CRANpkg{optimization} is part of the Forest Planner based decision support software for forest enterprises by \citet{hansen_2014}. Forest Planner is a user front end for \dfn{Tree Growth Open Source Software} (TreeGrOSS), which is a complex Java written tree growth and yield simulation software used to forecast the developments of forest management areas (forest sites). It is a tool able to simulate forest enterprises with hundreds of forest stands simultaneously where the smallest simulation element is the single tree. Each tree in the system is thus simulated individually. Optimization of forest activities is accordingly not trivial since TreeGrOSS is a complex network of rules and functions which are predominantly nonlinear. The entire optimization process is composed of TreeGrOSS, \CRANpkg{optimization}, an interface function to enable communication between TreeGrOSS and the optimizer as well as a data warehouse. This multi-factorial composition implies high demand for flexibility of the optimization function. The loss function, which represents in this example the interface between the three elements of the optimization system, must be composed of R, Java (using \CRANpkg{rJava}) and SQL (using \CRANpkg{RPostgreSQL}). Main tasks of the interface are enabling communication between TreeGrOSS and optimization algorithm and rating the TreeGrOSS output in terms of costs and revenue such that the TreeGrOSS output is translated into a response readable by the optimization algorithm. Flexibility of loss and variation functions is hence a prerequisite for forest harvesting schedule optimization via TreeGrOSS simulations. Each loss function call causes a TreeGrOSS simulation and a database operation. In order to save time, parts of the loss are therefore programmed parallel. The response is, accordingly, nonlinear and non-continuous. Random search methods are thus favorable for the problem. The concept of sustainability plays a central role in forestry. Forests must be treated such that recent and further generations can benefit from their economic, ecologic and social functions. Particular harvesting operations must not exceed the sustainable wood growth. Each function call returns, next to the actual response, a sustainability index. This index is used to restrict the optimization by returning \code{NA} responses whenever sustainability is violated. Sustainability is hence an important restriction forest schedule optimization that further increases complexity of the loss function.

The example again reveals the complexity of the problem and further reinforces the need for specific stochastic methods. Only \CRANpkg{optimization} was able to solve the complex loss function of the forest harvesting schedule problem. Sensitivity analysis using an exemplary forest enterprise comprised of five forest sites, with known global maximum, reinforced reliability of \code{optim\_sa} for harvesting optimization. A solution sufficiently near the global maximum was found in arguable time on a standard personal computer. To test the practical usability, the optimization system was additionally tested on a real forest enterprise with 100 forest sites. The problem was solved using a high performance cluster computer. Repeating the optimization three times led to equal responses but different harvesting volumes (thus covariate combinations).

\section{Discussion and outlook}
In conclusion, SA methods have considerable advantages over classical optimization procedures if the loss function is non-linear and multimodal. Our simulated annealing approach optimizes these functions quite well. It might not be faster than the standard simulated annealing approach of the \pkg{stats} package, but that was not its purpose. However, due to the shrinkage of the parameter space and the extra stopping criteria it manages to require fewer steps than the standard approach. Additionally, the main advantage of our new approach is to deal with irregular parameter spaces, as well as with non-continuous and sophisticated loss functions, where the standard approach reaches its limits. In our examples we show that problems with these characteristics exist and that our algorithm solves these problems. Furthermore there are several examples of loss functions in natural science where some combinations of variables are restricted and our approach shows its benefits.

\section{Acknowledgements}
The authors are grateful to the Project Management Jülich of the Federal Ministry of Education and Research (Germany) for funding this research through the Cluster \textit{Bio-Economy} (031A294). We also want to thank Michael Hill for reviewing the language of the manuscript.
\bibliography{husmann}

\address{Kai Husmann\\
	Department of Forest Economics and Forest Management\\
    University of Goettingen\\
    Germany\\}
\email{khusman@gwdg.de}

\address{Alexander Lange\\
	Chair of Econometrics\\
    University of Goettingen\\
    Germany\\}
\email{alexander.lange@stud.uni-goettingen.de}

\address{Elmar Spiegel\\
	Chair of Statistics\\
    University of Goettingen\\
    Germany\\}
\email{espiege@uni-goettingen.de}