\documentclass[article, nojss]{jss}

%\VignetteIndexEntry{Introduction to genlasso}
%% need no \usepackage{Sweave.sty}

\usepackage{tikz}
\usetikzlibrary{shapes,arrows}
\usepackage[utf8x]{inputenc}

\usepackage{amsfonts,amsmath,amssymb,amsthm}
\usepackage{verbatim,float}
\usepackage{graphicx,subfigure,url}
\usepackage{dsfont,bm,color,multirow}

\newtheorem{algorithm}{Algorithm}
\newtheorem{theorem}{Theorem}
\newtheorem{lemma}{Lemma}
\newtheorem{corollary}{Corollary}
\newtheorem{conjecture}{Conjecture}
\newcommand{\argmin}{\mathop{\mathrm{argmin}}}
\newcommand{\argmax}{\mathop{\mathrm{argmax}}}
\newcommand{\minimize}{\mathop{\mathrm{minimize}}}
\newcommand{\Hrule}{\noindent\rule{\linewidth}{0.1mm}}

\def\E{\mathrm{E}}
\def\P{\mathrm{P}}
\def\Cov{\mathrm{Cov}}
\def\sign{\mathrm{sign}}
\def\tr{\mathrm{tr}}
\def\col{\mathrm{col}}
\def\row{\mathrm{row}}
\def\nul{\mathrm{null}}
\def\rank{\mathrm{rank}}
\def\nuli{\mathrm{nuli}}
\def\half{\frac{1}{2}}
\def\hbeta{\hat\beta}
\def\hu{\hat{u}}
\def\hy{\hat{y}}
\def\tbeta{\tilde{\beta}}
\def\tu{\tilde{u}}
\def\ty{\tilde{y}}
\def\tA{\widetilde{A}}
\def\tD{\widetilde{D}}
\def\tG{\widetilde{G}}
\def\tP{\widetilde{P}}
\def\tQ{\widetilde{Q}}
\def\tR{\widetilde{R}}
\def\lone{1}
\def\ltwo{2}
\def\linf{\infty}
\def\cA{\mathcal{A}}
\def\cB{\mathcal{B}}
\def\cF{\mathcal{F}}
\def\cG{\mathcal{G}}
\def\cL{\mathcal{L}}
\def\cM{\mathcal{M}}
\def\cN{\mathcal{N}}
\def\cS{\mathcal{S}}
\def\cT{\mathcal{T}}
\def\T{^T}
\def\R{\mathbb{R}}
\def\old{\mathrm{old}}
\def\wBox{{\color{white} \Box}}
\def\sw{\setlength{\arraycolsep}{1pt}}
\def\sh{\renewcommand{\arraystretch}{0.75}}

\DefineVerbatimEnvironment{Sinput}{Verbatim} {xleftmargin=2em}
\DefineVerbatimEnvironment{Soutput}{Verbatim}{xleftmargin=2em}
\DefineVerbatimEnvironment{Scode}{Verbatim}{xleftmargin=2em}
\fvset{listparameters={\setlength{\topsep}{0pt}}}
\renewenvironment{Schunk}{\vspace{\topsep}}{\vspace{\topsep}}

\author{Taylor B. Arnold, Ryan Tibshirani}
\Plainauthor{Taylor Arnold, Ryan Tibshirani}

\title{Introduction to the \pkg{genlasso} package}
\Plaintitle{Introduction to the genlasso package}
\Abstract{
We present a short tutorial and introduction to using the \proglang{R} package
\pkg{genlasso}, which is used for computing the solution path of the generalized lasso
problem discussed in \cite{genlasso}. Use cases of the generalized lasso include the
fused lasso over an arbitrary graph, and trend fitting of any given polynomial
order. Our implementation includes a function to solve the generalized lasso is its most
general form, as well as special functions to cover the fused lasso and trend filtering
subproblems. The general implementation maintains and updates a matrix factorization to
successively solve related linear systems; the specialized implementations forego this
update routine and exploit subproblem structure, which improves both stability and speed
of the computation. Standard S3 methods such as \code{plot}, \code{predict}, and \code{coef}
are also included to assist in studying the output solution path objects.
}
\Keywords{generalized lasso, path algorithm, fused lasso, trend filtering}

\Address{
  Taylor B. Arnold\\
  AT\&T Labs Research\\
  33 Thomas Street\\
  New York, NY 10007\\
  Email: \email{taylor@research.att.com}

  Ryan Tibshirani\\
  Department of Statistics\\
  Carnegie Mellon University\\
  Email: \email{ryantibs@cmu.edu}\\
}

<<initial_settings, echo=FALSE>>=
options(stringsAsFactors=FALSE)
options(width=65)
options(prompt="> ")
@

\begin{document}

\section{Introduction}
\label{sec:intro}

A recent paper by \cite{genlasso} introduced the generalized lasso path
algorithm, which computes the solution path to the following optimization problem:
\begin{equation}
\label{eq:genlasso}
\hbeta \in \argmin_{\beta \in \R^p} \, \half \|y-X\beta\|_\ltwo^2 +
\lambda \|D\beta\|_\lone,
\end{equation}
where $y\in\R^n$ is a response vector, $X\in\R^{n\times p}$ is a matrix
of predictor variables, $D\in\R^{m\times p}$ is a penalty matrix, and
$\lambda \geq 0$ is a regularization parameter.
In other words, the path algorithm computes the solution $\hbeta=\hbeta(\lambda)$
as a function of the parameter $\lambda$. This is done by solving the equivalent
Lagrange dual problem,
\begin{equation}
\label{eq:xdual}
\hu \in \argmin_{u \in \R^m} \, \|XX^+ y - (X^+)\T D\T u\|_\ltwo^2 \;\;
\text{subject to}\; \|u\|_\linf \leq \lambda,
\end{equation}
where $X \in \R^{n\times p}$ is assumed to have full column rank\footnote{If the
predictor matrix $X$ does not have full column rank (occurring, say, if $p>n$)
then a small ridge penalty can be added to the criterion in \eqref{eq:genlasso},
and the problem can be subsequently rewritten as a generalized lasso problem with
a new full column rank predictor matrix.} and $X^+ = (X\T X)^{-1} X\T$ is the
Moore-Penrose generalized inverse of $X$. The relationship between the primal and dual
solutions is
\begin{equation}
\label{eq:xpd}
\hbeta = (X\T X)^{-1} (X\T y - D\T \hu),
\end{equation}
and hence solving for $\hu$ in \eqref{eq:xdual} yields the solution $\hbeta$
in \eqref{eq:genlasso} via the above relationship. When $X=I$, often called the
``signal approximator'' case, the dual and primal-dual relationship can be simplified,
\begin{equation}
\label{eq:dual}
\hu \in \argmin_{u \in \R^m} \, \|y - D\T u\|_\ltwo^2 \;\;
\text{subject to}\; \|u\|_\linf \leq \lambda,
\end{equation}
and
\begin{equation}
\label{eq:pd}
\hbeta = y - D\T \hu.
\end{equation}
The dual problem \eqref{eq:dual}, and more generally \eqref{eq:xdual} when $X\not=I$, is
``easier'' to consider because loosely speaking the non-differentiable box constraint is free
of any linear transformation. Computing the solution path of \eqref{eq:pd} (and of \eqref{eq:xpd})
essentially reduces to solving several linear systems that differ by one more or one less
variable. This is much like the lasso solution path. See \cite{genlasso} for details.

The problem \eqref{eq:genlasso} is solvable at a fixed value of $\lambda$ (or a fixed sequence
of $\lambda$ values) by several convex optimization tools, both open source and proprietary.
The novelty here is that the solution is computed for all values of the tuning parameter $\lambda$
simultaneously. Perhaps not surprisingly, computation of the full path does not scale as
efficiently as does computation at individual $\lambda$ values. The purpose of this package is
not to provide a solver for large scale generalized lasso problems, but instead to deliver the
entire solution path when computationally amenable.

There are three main functions: \code{genlasso},
\code{fusedlasso}, and \code{trendfilter}. The first function computes the solution path of
any problem of the general form \eqref{eq:genlasso}; the latter two are specialized functions
designed to compute the solution path of the fused lasso and trend filtering subproblems,
respectively (additionally, wrapper functions \code{fusedlasso1d} and \code{fusedlasso2d} are
available to fit the fused lasso over 1d and 2d grids, respectively).
These latter functions should be used whenever possible, as they provide a significant
improvement in both computational efficiency and numerical stability.

This vignette is intended to get new users quickly up to speed on using the
\pkg{genlasso} package for statistical modelling. Sections \ref{sec:lasso}--\ref{sec:tf}
give short code snippets for common use cases of the package, solving fused lasso problems,
trend filtering problems, and using cross-validation.
We discuss the details of the various implementation strategies used in
this package, as well as provide empirical results on computationally accuracy and run times
of the various strategies to help assist users in deciding what problems can and cannot be
reasonably solved with the path algorithm, in a seperate paper \cite{arnold2016efficient}.

\section{The standard lasso}
\label{sec:lasso}

As a first toy example to using the \pkg{genlasso} package, we show how to compute the
solution path of the usual lasso optimization problem. (This is intended as an example only,
and we do not recommend this strategy for solving lasso problems!)

We first generate some data, consisting of a predictor matrix $X$ with $10$ variables and
$100$ observations generated by independent draws from a standard normal distribution.
The response vector $y$ is made to be a noisy version of the first column of the data
matrix X:
<<eq1a, echo=TRUE>>=
library("genlasso")
set.seed(1)
n = 100
p = 10
X = matrix(rnorm(n*p), ncol=p)
y = X[,1] + rnorm(n)
@
In order to write the standard lasso as a generalized lasso problem \eqref{eq:genlasso},
we construct a penalty matrix $D$ equal to the $10$-dimensional identity matrix:
<<eq1b, echo=TRUE>>=
D = diag(1,p)
@
Now we can run the path solution for the (generalized) lasso:
<<eq1c, echo=TRUE>>=
out = genlasso(y, X=X, D=D)
@
Like the \code{lm} function in the \pkg{stats} package, the output of the generalized lasso
has a compact standard plot output. It gives the function call information as well as the
length of the computed path:
<<eq1d, echo=TRUE>>=
out
@
More information about each step, including an unbiased estimate of the degrees of freedom
of the fit and the residual sum of squares, can be obtained by printing the summary of the
output:
<<eq1e, echo=TRUE>>=
summary(out)
@
A simple plot of the solution path can be produced by calling the \code{plot} function on
the output object:
<<eq1f, include=FALSE>>=
plot(out)
@
% PLOT START
\setkeys{Gin}{width=4.5in,height=4.5in}
\begin{center}
<<label=figeq1f,fig=TRUE,echo=FALSE>>=
<<eq1f>>
@
\end{center}
% PLOT END
Each color shows a coordinate of the solution path over $\lambda$, Note that as
$\lambda$ increases, all coordinates of the primal solution are quickly shrunken to
zero, except the first coordinate, drawn in red.

Finally, it is easy to extract the coefficients $\beta$ for a particular value (or values) of
$\lambda$. Here we calculate the coefficients for the tuning parmeter $\lambda = \sqrt{n\log(p)}$,
as is suggested for model selection consistency by recent theory:
<<eq1g include=FALSE>>=
coef(out, lambda=sqrt(n*log(p)))
@
This is reassuring, as it essentially recovers the true model from which we had simulated
the data. Also, as expected, the lasso has somewhat underestimated the magnitude of the
nonzero coefficient due to the shrinking nature of the $\ell_1$ penalty.

\section{The fused lasso} \label{fgraph}
\label{sec:fused}

\subsection{The 1d fused lasso}
\label{sec:fused1d}

A simple example of a problem which fits naturally into the generalized lasso structure
is the 1d fused lasso. In the common signal approximator case, $X=I$, we assume that the
observed data $y=(y_1,\ldots y_n) \in \R^n$ is generated from a process whose mean changes
at only a smaller number of locations, when ordered sequentially from $1$ to $n$. The goal
is hence to find a piecewise constant vector of coefficients, of course, fitting well to $y$.
This is done by solving the following minimization problem:
\begin{equation}
\label{eq:fused1d}
\hbeta = \argmin_{\beta \in \R^n} \, \half \sum_{i=1}^n (y_i-\beta_i)^2 +
\lambda \sum_{i=1}^{n-1} |\beta_{i+1}-\beta_i|,
\end{equation}
which is a version of \eqref{eq:genlasso} with $D$ equal to the matrix of first
differences:
\begin{equation}
\label{eq:d1d}
D = \left[\begin{array}{rrrrrr}
-1 & 1 & 0 & \ldots & 0 & 0 \\
0 & -1 & 1 & \ldots & 0 & 0 \\
& & & \ldots & & \\
0 & 0 & 0 & \ldots & -1 & 1
\end{array}\right].
\end{equation}
We generate a data set with four change points:
<<eq2a, include=FALSE>>=
set.seed(1)
n = 100
i = 1:n
y = (i > 20 & i < 30) + 5*(i > 50 & i < 70) +
  rnorm(n, sd=0.1)
@
Now to fit the 1d fused lasso, we simply call the \code{fusedlasso1d} function
(and we pass no $X$ matrix, with indicates that $X=I$):
<<eq2b, include=FALSE>>=
out = fusedlasso1d(y)
@
An alternative method would be to call the \code{genlasso} function with argument
\code{D=getD1d(n)}, where \code{getD1d} is a convenience function to construct
the first difference matrix in \eqref{eq:d1d}. However, this is much less efficient
and less numerically stable, and it is highly recommended to use the speciality
implementations \code{fusedlasso} (wrappers \code{fusedlasso1d} and
\code{fusedlasso2d}) and \code{trendfilter} whenever possible.

The model output works essentially the same as that produced by the \code{genlasso}
function discussed in Section \ref{sec:lasso}, giving printing, summaries, and
coefficients in the exactly the same way. Additionally, a new plot
function is available to show how the piecewise constant model fits the data
at specified values of $\lambda$:
<<eq2c, include=FALSE, echo=TRUE>>=
plot(out, lambda=1)
@
% PLOT START
\setkeys{Gin}{width=4in}
\begin{center}
<<label=fig2c,fig=TRUE,echo=FALSE>>=
<<eq2c>>
@
\end{center}
% PLOT END
The estimate at $\lambda=1$ fits the data fairly well, and also provides
a good estimate of the change point locations. To retrieve coordinate plots as in
Section \ref{sec:lasso}, use the \code{plot} function with argument \code{style="path"}.

\subsection{The 2d fused lasso and arbitrary graphs}

The 1d fused lasso can be extended to an arbitrary graph structure, where the absolute
difference between the coefficients between neighboring nodes is penalized:
\begin{equation}
\label{eq:fused}
\hbeta = \argmin_{\beta \in \R^n} \, \half \sum_{i=1}^n (y_i-\beta_i)^2 +
\lambda \sum_{(i,j)\in E} |\beta_i-\beta_j|,
\end{equation}
where $E$ is the edge set of the graph. Note that this is still a generalized lasso
problem \eqref{eq:genlasso}, with the penalty matrix $D$ still a difference matrix
between the appropriate pairs of coefficients.
In the graph framework, underlying the 1d fused lasso is the chain graph, or 1d grid.
Another common structure is the 2d grid, and with this underlying graph problem
\eqref{eq:fused} is called the 2d fused lasso, a technique used for image denoising.

We generate data in the form of a 16 by 16 grid of points; the mean of the data is
equal to $2$ for points within a distance of $4$ from the middle of the grid, and $1$ for
all other points.
<<eq8a, echo=true>>=
set.seed(1)
y = matrix(runif(256), 16, 16)
i = (row(y) - 8.5)^2 + (col(y) - 8.5)^2 <= 4^2
y[i] = y[i] + 1
@
To compute the solution path of this problem, we call the \code{fusedlasso2d} function:
<<eq8b, echo=true>>=
out = fusedlasso2d(y)
@
Since the input data \code{y} is a matrix, the function knows the dimensions of the 2d grid.
Otherwise we could specify the dimensions as inputs to the function call, \code{dim1} and
\code{dim2}. We can extract 5 solutions along the solution path (evenly spaced on the log
scale):
<<eq8x, echo=TRUE>>=
co = coef(out, nlam=5)
@
and plotting them, along with the original data $y$ (in the top-left corner),
gives a rough understanding of what the solution path looks like:
<<eq8c, echo=false, include=false, height=4.5, width=6>>=
par(mar=c(1,1,2,1),mfrow=c(2,3))
cols = terrain.colors(30)
zlim = range(c(co$beta,y))
image(y,main=expression(y),col=cols,zlim=zlim,axes=FALSE)
for (i in 1:5) {
  image(matrix(co$beta[,i],nrow=16),col=cols,zlim=zlim,
  axes=FALSE)
  mtext(bquote(lambda==.(sprintf("%.3f",co$lambda[i]))))
}
@
% PLOT START
\setkeys{Gin}{width=6in,height=4.5in}
\begin{center}
<<label=fig8c,fig=TRUE,echo=FALSE>>=
<<eq8c>>
@
\end{center}
% PLOT END
We can see that the algorithm does a pretty good job of finding the region with a
different mean, particularly for the 2nd largest value of $\lambda$ displayed in the
plot. With more regularization, the entire image is assigned a constant coefficient;
with less, the coefficients are overly noisy.

Finally,
it is possible to specify a generic underlying graph for the fused lasso problem.
The function \code{fusedlasso} (for which both \code{fusedlasso1d} and
\code{fusedlasso2d} simply act as wrappers) takes either a generic difference matrix
$D$---i.e., a matrix such that each row contains a single $-1$ and $1$ and all $0$s
otherwise---or an \code{igraph} graph object from the \pkg{igraph} package \citep{igraph}.
The function \code{fusedlasso} (as well as \code{fusedlasso1d} and \code{fusedlasso2d})
also takes an optional argument \code{X} in the case that a non-identity predictor matrix
$X$ should be included. As with the \code{genlasso}, the predictor matrix $X$
should have full column rank; however, this is not check here, for efficiency.

\subsection{Sparse fused lasso and soft-thresholding}

A common variant of the fused lasso employs an additional $\ell_1$ penalty on
the coefficients themselves; e.g., the signal approximator fused lasso problem
in \eqref{eq:fused} can be extended as in
\begin{equation}
\label{eq:fusedl1}
\hbeta = \argmin_{\beta \in \R^n} \, \half \sum_{i=1}^n (y_i-\beta_i)^2 +
\lambda \sum_{(i,j)\in E} |\beta_i-\beta_j| + \gamma \cdot \lambda \sum_{i=1}^n |\beta_i|.
\end{equation}
Here $\gamma \geq 0$ is another parameter that controls the ratio between the fusion
and sparsity penalty terms. Note that \eqref{eq:fusedl1} also fits into the generalized
lasso framework, as it simply concatenates (a multiple of) the identity matrix to the rows
of a fused lasso penalty matrix. For a single fixed value of $\gamma$, the solution path of
the sparse fused lasso problem \eqref{eq:fusedl1} can be computed using by setting the
\code{gamma} argument in \code{fusedlasso} (or \code{fusedlasso1d}, \code{fusedlasso2d}).
(Additionally including a non-identity predictor matrix poses no problems, and is done by
setting the \code{X} argument, as before.)

When $X=I$, a particularly simple relationship exists between the solutions of
the problem \eqref{eq:fused} and its sparse variant \eqref{eq:fusedl1}: at any $\lambda$,
the solution of \eqref{eq:fusedl1} is simply given by soft-thresholding the solution of
\eqref{eq:fused} by an amount $\gamma\cdot\lambda$, as shown in \cite{pco}. If the solution
path of \eqref{eq:fusedl1} is desired at several levels of $\gamma$ (or even at a single
nonzero level, actually) it is more efficient to solve the problem with $\gamma=0$ and
then soft-threshold to yield the solutions.

We demonstrate this by revisiting the 1d fused lasso example of Section \ref{sec:fused1d}:
<<eq6a, echo=true>>=
set.seed(1)
n = 100
i = 1:n
y = (i > 20 & i < 30) + 5*(i > 50 & i < 70) +
  rnorm(n, sd=0.1)
out = fusedlasso1d(y)
beta1 = coef(out, lambda=1.5)$beta
@
To compute the solution at $\lambda=1.5$ and $\gamma=1$, we could call \code{fusedlasso1d}
with \code{gamma=1}, or simply soft-threshold:
<<eq6b, echo=true>>=
beta2 = softthresh(out, lambda=1.5, gamma=1)
@
The effect of soft-thresholding is apparent when looking at the original and thresholded
estimates (with the dashed line showing the thresholding level $\gamma\cdot\lambda=1.5$):
<<eq7x, echo=false, include=false>>=
plot(1:n, y, xlab="Position", ylab="Estimates")
abline(h=1.5, lty="dashed")
lines(1:n, beta1)
lines(1:n, beta2, col="red")
legend("topleft",lty=1,col=c("black","red"),
       legend=c(expression(gamma==0),expression(gamma==0.5)))
@
% PLOT START
\setkeys{Gin}{width=4in}
\begin{center}
<<label=fig7x,fig=TRUE,echo=FALSE>>=
<<eq7x>>
@
\end{center}
% PLOT END

\section{Trend filtering}
\label{sec:tf}

\subsection{Beyond piecewise constant fits}

Like the 1d fused lasso, trend filtering in the signal approximator case $X=I$ assumes
that the data $y=(y_1,\ldots y_n) \in \R^n$ is meaningfully ordered from $1$ to $n$,
and fits a piecewise polynomial of a specified degree. For example, linear trend filtering
(with $X=I$) solves the following minimization problem:
\begin{equation}
\label{eq:ltf}
\hbeta = \argmin_{\beta \in \R^n} \, \half \sum_{i=1}^n (y_i-\beta_i)^2 +
\lambda \sum_{i=1}^{n-2} |\beta_i - 2\beta_{i+1} + \beta_{i+1}|.
\end{equation}
Notice that here the discrete second deriative is penalized, as opposed to the discrete
first derivative in the 1d fused lasso criterion \eqref{eq:fused1d}. Quadratic and cubic
trend filtering are defined similarly, by penalizing the discrete third and fourth
derivative, respectively. (In this light, we can think of the 1d fused lasso as constant
or zeroth order trend filtering.)

We generate a data set with a piecewise linear mean:
<<eq3a, include=FALSE>>=
set.seed(1)
n = 100
y = 50 - abs(1:n-n/2) + rnorm(n, sd=5)
@
We use the function \code{trendfilter} to fit trend filtering model of order $1$
(i.e., linear). It has the same plotting functionality as the \code{fusedlasso1d}
function:
<<eq3b, include=FALSE>>=
out = trendfilter(y, ord=1)
plot(out, lambda=n)
@
% PLOT START
\setkeys{Gin}{width=4in}
\begin{center}
<<label=fig3b,fig=TRUE,echo=FALSE>>=
<<eq3b>>
@
\end{center}
% PLOT END
Higher order plots are just as easy, as shown by the following cubic fit
to data generated with a mean given by a sine curve:
<<eq4gib, echo=true, include=FALSE>>=
n = 100
y = sin(1:n/n*2*pi) + rnorm(n, sd=0.3)
out = trendfilter(y, ord=3)
plot(out, lambda=n)
@
% PLOT START
\setkeys{Gin}{width=4in}
\begin{center}
<<label=fig4abc,fig=TRUE,echo=FALSE>>=
<<eq4gib>>
@
\end{center}
% PLOT END

%% For a visual demonstrating of the how the estimates change with increasing
%% order, see Figure \ref{fig:highord}.

%% \begin{figure}
%% \begin{center}
%% \includegraphics[width=0.95\textwidth]{figures/article-trendPlot}
%% \end{center}
%% \caption[fig:highord]{Trend filtering example. Estimates correspond to the
%%   $35$th step of the algorithm.}
%% \label{fig:highord}
%% \end{figure}

Note that trend filtering fits can also be produced by using the \code{genlasso}
with \code{D=getDtf(n,k)}, where \code{getDtf} is a convenience function to
construct a trend filtering penalty matrix of a specified order (\code{k=1} being
linear). However, this is not recommended because the specialized implementation
\code{trendfilter} is significantly more efficient and stable. Trend filtering can
also be fit with a non-identity predictor; simply use the \code{X} argument.

\subsection{Cross-validation}

For automatic choice of $\lambda$ in trend filtering problems there is an easy to use,
pre-built cross-validation function. Here, we again generate some data which that has a
piecewise constant mean, and fit a piecewise constant model:
<<eq5a, echo=true>>=
set.seed(1)
n = 100
y = rep(sample(1:8,5), each=n/5) + rnorm(n, sd=0.8)
out = trendfilter(y, ord=0)
@
(Here we demonstrate that \code{trendfilter} with \code{ord=0} is equivalent to calling
\code{fusedlasso}.) Now we use the function \code{cv.trendfilter} to perform $k$-fold
cross-validation in to choose $\lambda$. This places every $k$th point in the same fold,
so the folds are non-random and calling \code{cv.trendfilter} twice will yield the same
result. The default is $k=10$ folds.
<<eq5b, echo=true, results=hide>>=
cv = cv.trendfilter(out)
@
Now we plot the estimate at the value of $\lambda$ the minimizes the cross-validated error:
<<eq5c, echo=TRUE>>=
plot(out, lambda=cv$lambda.min, main="Minimal CV error")
@
% PLOT START
\setkeys{Gin}{width=4in}
\begin{center}
<<label=fig5c,fig=TRUE,echo=FALSE>>=
<<eq5c>>
@
\end{center}
% PLOT END
Note that there is a considerable amount of noise here and this does not provide a good
estimate of the change points. Using the value of $\lambda$ chosen by the one standard
error rule produces a more regularized estimate:
<<eq5d, echo=TRUE>>=
plot(out, lambda=cv$lambda.1se, main="One standard error rule")
@
\begin{center}
\setkeys{Gin}{width=4in}
<<label=fig5d,fig=TRUE,echo=FALSE>>=
<<eq5d>>
@
\end{center}
This gives a cleaner estimate of the change points.

%% \section{Computational accuracy and speed} \label{comp}

%% We give here a brief overview of the package's computational accuracy and speed. This is by no means
%% exhaustive, with a more thorough analysis of the algorithm's computational complexity and scaling
%% being compiled for use in another document.

%% In terms of accuracy, we compared the output of the path solution at various tuning parameters along the
%% path, with the pointwise solution for the box constrained quadratic program as given by the commercial solver
%% \code{lssol} \citep{gill1986user}. Table \ref{acc} gives the maximum squared $\ell_2$-norm difference between
%% the predicted $\widehat{\beta}$ between two solvers, as well as the worse (for the \code{genlasso} function)
%% ratio between the critical values in the dual problem. We see that the worst solution occurs for a cubic fit, where
%% the generalized lasso has a critical value which is only $0.2$ percent larger than that of the \code{lssol} solver.
%% For the tall and dense matrix, there is a fairly large difference between the predicted $\widehat{\beta}$ of
%% $2^{-1.42}$ and similarly for the two-dimensional fused lasso, $2^{0.684}$;
%% however, it is the generalized lasso which has a lower critical value in both cases so therefore this discrepancy
%% is not of particular concern.

%% \begin{table}
%% \begin{tabular}{l | c |c}
%% & Squared difference in $\beta$ $(\log_2)$ & Critical Value Ratio \\
%% \hline \hline
%% Dense wide ($n = 400, m = 200$) & -88.5 & 1.11e-16 \\
%% Dense tall ($n = 200, m = 400$) & -1.42 & 3.44e-15 \\
%% Sparse wide ($n = 400, m = 200$) & -88.7 & 3.00e-15 \\
%% Trend filter ($n = 400, k = 0$) & -88.2 & 2.62e-14\\
%% Trend filter ($n = 400, k = 3$) & -33.9 & 2.80e-3 \\
%% 2d Fused ($n=m=30$) & 0.684 & 1.89e-15 \\
%% \hline
%% \end{tabular}
%% \label{acc}
%% \caption{Comparison at $50$ points along the solution path between the \pkg{genlasso} package
%% and the \code{lssol} function. The worst difference (for the \pkg{genlasso}) along the solution path is given.}
%% \end{table}

%% For large problems, computing the entire solution path quickly becomes computationally infeasible (see Section
%% \ref{algo} for details). Therefore, in our study of the empirical computational complexity of our implementation,
%% we fix the number of steps the algorithm traverses to be $100$. We then look at the scaling rate of computing
%% this first part of the path, which is (for large problems) often the most important part anyway. Our study conducted
%% 50 simulations from 6 different scenarios. The results of the computational tests are given in Figure \ref{comp}.


%% \begin{figure}
%% \begin{center}
%% \includegraphics[width=0.95\textwidth]{figures/article-compTime}
%% \end{center}
%% \caption{Empirical computational complexity for various generalized lasso ensembles on the log-log scale. The
%% x-axis gives the number of observations; wide matrices are twice as wide as tall and tall matrices are twice as tall
%% as wide. The red line is the log-log least squares fit, and the printed number is the estimated slope of the line. }
%% \label{comp}
%% \end{figure}

%% \section{The path algorithm} \label{algo}

%% The dual path algorithm for solving Equation \ref{dual} for all positive
%% values of $\lambda$, can be described as follows:
%% \begin{algorithm}[\textbf{Dual path algorithm for the generalized lasso}]
%% \label{alg:dualpath}
%% \hfill\par
%% \smallskip
%% \smallskip
%% Given $y \in \R^n$ and $D \in \R^{m\times n}$.
%% \begin{enumerate}
%% \item Compute the minimal $\ell_2$ norm minimizer of $\|y-D\T u\|^2$,
%% call this $\hu$.
%% \item As $\lambda$ decreases (from $\infty$), determine the value of
%% $\lambda$ at which a coordinate of the solution $\hu$ will hit the boundary
%% of the constraint region (the box $[-\lambda,\lambda]^m$).
%% Denote this by $\lambda_1$, and the hitting coordinate by $i$. Let
%% $\cB=\{i\}$, $s=\sign(\hu_i)$, and $k=1$.
%% \item While $\lambda_k>0$:
%% \begin{enumerate}
%% \item Compute the minimal $\ell_2$ norm minimizers of
%% \begin{equation*}
%% \|y-D_{-\cB}\T \alpha\|^2 \;\;\;\text{and}\;\;\;
%% \|D_\cB\T s - D_{-\cB}\T \gamma\|^2,
%% \end{equation*}
%% call these $\hat{\alpha}$ and $\hat{\gamma}$, respectively.
%% \item As $\lambda$ decreases (from $\lambda_k$), set $\hu_{-\cB}=\hat{\alpha}-
%% \lambda\hat{\gamma}$ and $\hu_\cB=\lambda s$. Determine the first value of $\lambda$
%% at which either: (i) a coordinate of the solution will hit the boundary,
%% or (ii) a coordinate of the solution must leave the boundary. Call this
%% $\lambda_k$, and update $\cB,s$ appropriately:
%% in the first case, add the hitting coordinate to $\cB$ and its sign to $s$,
%% else remove the leaving coordinate from $\cB$ and its sign from $s$. Increment
%% $k$.
%% \end{enumerate}
%% \end{enumerate}
%% \end{algorithm}

%% The main computational effort lies in Steps 1 and 3(a). To summarize: starting
%% with $\cB=\emptyset$, we repeatedly minimize least squares problems of the form
%% $\|c-D_{-\cB}\T x\|^2$ in $x$---which is the same as solving $D_{-\cB}D_{-\cB}\T
%% x = D_{-\cB} c$---as elements are added to or deleted from $\cB$, that is,
%% $D_{-\cB}$ either decreases by one row or increases by one row.
%% A caveat is that we always require the minimal $\ell_2$ norm
%% solution (this distinction is really only important when the solution is not
%% unique). From a computational point of view, Steps 2 and 3(b) are straightforward,
%% since they utilize the results of Steps 1 or 3(a) in a simple way.

%% A naive implementation would simply solve each of these least squares problem
%% independently, as we move from one iteration to the next. Each iteration would then
%% require $O(|\cB|^2 n)$ operations if $|\cB|\leq n$, or $O(|\cB|n^2)$ operations if
%% $|\cB|>n$. Alternatively, we could compute a QR decomposition of $D$ or $D\T$
%% to solve the initial least squares problem (depending on the dimensions of $D$),
%% and then update this decomposition as rows are removed from or added to $D_{-\cB}$
%% in order to solve the subsequent problems. With this strategy, each
%% iteration requires $O(\max\{|\cB|^2,n^2\})$ or $O(|\cB|n)$ operations
%% (depending on whether we are updating a
%% QR decomposition of $D$ or $D\T$). This improves upon the cost of the naive
%% strategy by essentially an order of magnitude.

%% The QR-based strategy assumes nothing about the penalty matrix $D$.
%% For certain problems, we can use more specialized implementations
%% that take advantage of the structure of $D$. We present
%% two such specialized implementations, for trend filtering problems in, and for fused lasso
%% problems.
%% In the former case, each iteration of the specialized implementation
%% requires $O(|\cB|)$ operations. In the latter case,
%% the specialized implementation does not admit a tight bound on the cost of
%% each iteration [a very loose upper bound is $O(n^3)$], but tends to work
%% efficiently in practice. It is important to
%% remind the reader that, in the presence of a (full column-rank) design
%% matrix $X$ in Equation \ref{primal}, the dual path algorithm
%% operates on the penalty matrix $DX^+$ (in place of $D$).
%% Generally speaking, multiplication by $X^+$ does not preserve structure of
%% $D$, and therefore the implementation in for arbitrary penalty matrices must be used.

\bibliography{bibfile}

\end{document}