Introduction
Below are my notes for the course 21-690: Methods of Optimization taught in the Spring 2025 semester by Professor Nicholas Boffi at Carnegie Mellon University.
Convex Sets
Affine and Convex Sets
Definition (Affine). A set is said to be affine if for all , we have that for all .
Geometrically, affine sets are sets in which the line formed by any two points in the set is entirely contained in the set as well.
Definition (Affine Combination). An affine combination of is a linear combination
where for all and .
Proposition. Let be an affine set. Then, any affine combination of is contained in .
Proof. By induction on .
Definition (Affine Hull). The affine hull of a set is the set
Exercise. Prove that the affine hull of is the smallest affine set containing .
Affine sets are certainly unbounded, as lines are unbounded. We can consider bounded sets by considering only line segments rather than entire lines. Thus lies the idea behind convexity.
Definition (Convex). A set is said to be convex if for all , we have that for all .
Geometrically, convex sets are sets in which the line segment formed by any two points in the set is entirely contained in the set as well. Note that we only consider line segments by restricting our coefficients to .
We can similarly extend the idea of affine combinations and affine hulls to convexity.
Definition (Convex Combination). A convex combination of is a linear combination
where for all and .
Proposition. Let be a convex set. Then, any convex combination of is contained in .
Proof. By induction on .
Definition (Convex Hull). The convex hull of a set is the set
Exercise. Prove that the convex hull of is the smallest convex set containing .
Examples
Cones
Definition (Cone). A set is a cone if for all , for all .
The traditional image of a “cone” is itself a cone, if extended to infinity.
Not all cones are convex: the union of two different lines is a cone, but not convex.
Hyperplanes and Halfspaces
Definition (Hyperplane). Let and . The set
is a hyperplane.
Geometrically, hyperplanes are lines in and planes in .
Proposition. All hyperplanes are affine.
Proof. Consider the hyperplane
Consider and . Then,
Hence, and so is affine.
Definition (Halfspace). Let and . The set
is a halfspace.
Geometrically, halfspaces are one side of a hyperplane.
Proposition. All halfspaces are convex.
Balls and Ellipsoids
Definition (Ball). A ball is a set
where .
Proposition. All balls are convex.
Proof. Consider some ball . Take and . Let . We wish to show that . Observe that by triangle inequality,
hence .
Definition (Ellipsoid). An ellipsoid is a set
where and .
Proposition. All ellipsoids are convex.
Proof. Consider some ellipsoid
Take , and . Let . We wish to show that . Observe that
Hence, , and so all ellipsoids are convex.
Polyhedra
Definition (Polyhedra). A polyhedra is a set
Polyhedra are typically presented with the notation
where .
Geometrically, polyhedra are the intersection of a finite number of hyperplanes and halfspaces.
Operations that Preserve Convexity
Intersection
Proposition. Intersection preserves convexity.
Proof. Let be an index set such that for all , is a convex set. Let . We claim that is convex.
Consider , and . Set . It suffices to show that .
By definition of , for all . By convexity of , we have that for all . Hence, , as desired.
Exercise. The positive semidefinite cone, , is convex.
Solution. Observe that
Every set on the right hand side is convex. The intersection of convex sets is convex, hence the positive semidefinite cone is convex.
Affine Functions
Definition (Affine Function). A function is affine if it is of the form
Proposition. The image of a convex set over an affine function is convex.
Proof. Let be a convex set and an affine function via . We wish to show that is convex.
Consider , and . Define . It suffices to show that .
As , there exists such that and .
Let . As is convex, we have that .
Thus,
as desired.
The above proposition implies that scaling, translation, and projection preserve convexity.
Proposition. The pre-image of a convex set over an affine function is convex.
Proof. Let be a convex set and an affine function via . We wish to show that is convex.
Consider , and . It suffices to show that , which in turn we must show that .
Observe that
by convexity of .
Perspective Function
Definition (Perspective Function). The perspective function is defined via
Proposition. The image of a convex set over the perspective function is convex.
Proof. Let be a convex set. We wish to show that is convex.
Consider . We wish to show that for any , . The proof is difficult if done directly, hence we will take a slightly different approach.
Fix . Since , there exists such that and .
By convexity of , we have that
Hence,
Let
Through substitution, we have that
As we vary , varies from to , implying that is convex.
Proposition. The pre-image of a convex set over the perspective function is convex.
Proof. Let be a convex set. We wish to show that is convex.
Consider , and . It suffices to show that . Thus, we wish to show that .
Observe that
Then, note that , and
Hence, by convexity of ,
implying that .
Separating Hyperplanes
Theorem (Separating Hyperplane). Two non-empty and disjoint convex sets can be separated by a hyperplane.
Formally, if there are non-empty and disjoint convex sets , there exists , such that
Geometrically, there exists a hyperplane between and .
Proof. We only prove the theorem for the special case in which are closed and bounded, hence compact.
Note that is then compact. We may define the distance function via . By compactness, there exists such that is minimized.
Define and . We claim that the hyperplane separates .
Assume for the sake of contradiction not. Then, without loss of generality, there exists such that . Implying that
Clearly, , meaning that .
Intuitively, we can move towards the direction and minimize the distance from .
Formally, we can take the derivative
per the above.
Thus, for some small , we have that
By convexity of , , hence the above is a contradiction by definition of .
Definition. We say that two convex sets are strictly separated if there exists , such that
Example. Let be a closed convex set and a point not in . Then, and are strictly separated.
To see why, note that as is closed, is open. Hence, we can find such that . Clearly, is convex. By the separating hyperplane theorem, we may find , such that for all , and for all . In particular, the last statement means that for any where ,
The left hand side is minimized when , hence
Thus, the hyperplane strictly separates and .
We can use this result to show the following.
Proposition. Let be a closed convex set, and be the set of all halfspaces that contain entirely. Then,
Proof.
Trivial by definition of .
Take . Assume for the sake of contradiction that . Then, we may find a strictly separating hyperplane between and . Implying that , a contradiction.
Supporting Hyperplanes
Definition. For a convex set , we say that a hyperplane
is a supporting hyperplane if and for all . Geometrically, the hyperplane is tangent to a point on the boundary of , and its halfspace contains the entirety of .
Convex Functions
Basic Properties and Definitions
Definition. A function is convex if for all , , we have that
If the inequality is strict, i.e.
for , then the function is said to be strictly convex.
Intuitively, convex functions are those in which the epigraph of the function (the area above the function) is a convex set.
Remark. A function is convex if and only if it is convex when restricted to any line in its domain, i.e. for all ,
is convex.
Theorem (First Order Characterization). A function in is convex if and only if
for all .
Proof.
Case:
First assume that is convex. Then for any , we have that
as desired.
Now instead assume that for all ,
We wish to show that is convex. Fix and . Let . Then,
Similarly, we can see that
Combining these,
as desired.
Case:
We can study the one-dimensional function which varies in the direction of , i.e. . The result then follows by the case.
Remark. The above inequality is strict if and only if the function is strictly convex.
Corollary. Let be a convex function, and a point such that . Then, is a global minimizer of .
Proof. Consider some . Per the first order characterization of convex functions,
Theorem (Second Order Characterization). A function in is convex if and only if
for all .
Proof.
First assume that is convex. Assume for the sake of contradiction that there exists some such that is not positive semi-definite. By definition, there exists some eigenvector such that where .
Define . Note that
Hence,
By definition of the second derivative, for some small , for all .
Then,
by the mean value theorem. Hence, a contradiction since inherits convexity from and the above violates the first order characterization of convexity. Thus, for all .
Now assume that for all . We claim that is convex.
Fix . Define
Then,
by mean value theorem. The inequality follows from the fact that is always non-negative, implying that is non-decreasing.
Since are arbitrary, we have that is convex by the first order characterization of convexity.
Remark. The above inequality is strict if and only if the function is strictly convex.
Operations that Preserve Convexity
Nonnegative Weighted Sum
Proposition. Let be a sequence of convex functions. Then,
is convex.
Proof. Fix and . Then,
as desired.
Remark. The above proposition generalizes to infinite sums (if they converge) as well as integrals. Specifically, if a function is convex in for all , then
is convex.
Affine Composition
Proposition. Let , , and . Then,
is convex if is convex.
Proof. Fix and . Then,
Maximum and Supremum
Proposition. Let be a sequence of convex functions. Then,
is convex.
Proof. Fix and . Then,
Remark. The above proposition generalizes to the supremum. Specifically, if is convex in for all , then
is convex.
Representation as Supremum of Affine Functions
Proposition. Let be a convex function. Then,
Proof.
Define
We claim that is convex. Fix , and . Then let . So,
By convexity of ,
implying that , as desired. Hence, is convex.
Now fix some . We shall show that indeed,
Observe that , hence we may find , such that
for all .
Then,
for all . Implying that .
We can then write the above as
for all .
Hence, is an affine function that underestimates over all , and achieves . We thus have the result.
Follows by definition.
Perspective of a Function
Definition. Consider a function . We define the perspective of as via
Proposition. If is convex, then is convex.
Proof. It suffices to show that the epigraph of is convex. Note that the epigraph of is the preimage of the epigraph of over the perspective function, hence is convex.
Convex Conjugates
Definition. Let . The conjugate of , is defined via
Geometrically, the conjugate of a function is the greatest distance between and the hyperplane .
Proposition. For any function , the conjugate is always convex.
Proof. Observe that is convex in , hence is convex in .
Basic Properties
Proposition (Fenchel’s Inequality). Let be a function. Then,
for all .
Proof. Note that
for all . The result is then immediate.
Convex Optimization Problems
Optimization Problems
Definition. An optimization problem is of the form
Its domain is the intersection of domains for each function, i.e.
Definition. The feasible set of an optimization problem is the set
We say that an optimization problem is feasible if its feasible set is not empty.
Definition. We define the optimal value of an optimization problem as the value
If , then .
Definition. If there exists a sequence such that as , we say that the optimization problem is unbounded below and .
Definition. We say that is - suboptimal if .
Definition. We say that is locally optimal if there exists some such that for all .
Definition. If for some and , we say that constraint is active at .
Definition. We call an optimization problem of the form
a feasibility problem.
Remark. Note that maximization problem can be formulated as optimization problems by taking .
Slack Variables
Slack variables allow us to express inequalities as equalities.
In particular, note that
for some (in particular, ). Here, is a slack variable.
More generally, we can reformulate the optimization problem
as
Convex Problems
Definition. A convex problem is an optimization problem of the form
where is convex for all .
Remark. The feasible set of a convex optimization problem
is convex as it is the intersection of convex sets.
Proposition. Any local solution to a convex optimization problem is a global solution.
Proof. Let be a local solution to the typical convex optimization problem. Then, there exists such that is optimal in the ball around it.
Assume for the sake of contradiction that is not locally optimal. Then, there exists some such that .
We can find some such that . Then,
a contradiction.
The above proposition provides some intuition as to why convex optimization problems are particularly nice to work with.
Proposition. Let be a convex function. Then, is optimal if and only if
for all .
Proof. First assume that is optimal. Assume for the sake of contradiction that for some . Define
Observe then that
Thus, for some , for all , we have that , implying that
a contradiction.
Now assume that
holds for all . We claim that is optimal. Observe that
as desired.
Corollary. Consider some convex optimization problem where . If is open and is the optimal point, then
for all .
Proof.
As is open, we can find small enough such that . Then,
which is only true if the gradient is zero.
Proposition: Consider the convex optimization problem
where . Then, a point is an optimal point if and only if
for some , and
Proof. We first prove the forwards direction. Assume that is optimal. Then, trivially. Furthermore, we know that for all ,
As , we know that as well. Thus, we must have that where . We may then rewrite the above as
for all .
Thus, is orthogonal to , hence . Similarly, its negative is in the range of . Meaning that there must exist some such that
We now prove the backwards direction.
If , then clearly is feasible.
If
then . Hence, it is orthogonal to , and so
for all .
Thus, is optimal.
Linear Problems
Definition. A linear problem (LP) is an optimization problem of the form
One may introduce inequalities in the constraints via slack variables.
Quadratic Problems
Definition. A quadratic problem (QP) is an optimization problem of the form
where .
Definition. A quadratically constrained quadratic problem (QCQP) is an optimization problem of the form
where .
Definition. A second-order cone program (SOCP) is an optimization problem of the form
Duality
Lagrange Dual Function
Definition. The Lagrangian of a (not necessarily convex) optimization problem
is a function defined via
Definition: The Lagrange dual of an optimization problem is the function defined via
where is the Lagrangian of the optimization problem and is the domain of the optimization problem (not necessarily feasible).
Proposition. The Lagrange dual of any optimization problem is concave.
Proof. Observe that the Lagrangian is affine in . Concavity is preserved under point-wise infimum.
Proposition (Weak Duality). For some optimization problem, let be its optimal value and be its Lagrange dual. Then, for any and any ,
Proof. Fix and . Consider some feasible . Then,
as desired.
Definition. If and , then we say that is dual feasible.
Lagrange Dual Problem
In light of weak duality, we can think of finding the best lower bound on the optimal value using the Lagrange dual function. Thus is the motivation for the Lagrange dual problem, defined below. Note that the Lagrange dual problem is particularly nice due to the fact that the Lagrange dual function is concave, established previously.
Definition. Let be the Lagrange dual for an optimization problem. We define the Lagrange dual problem for the optimization problem as
We call this problem the dual, and the original problem the primal.
Remark. Let be the optimal value for the primal problem, and be the optimal value to the dual problem. Then, by weak duality,
There may be more implicit constraints, particularly when is unbounded below (recall that it is an infimum).
Definition. The optimal duality gap of a problem is the value
Definition. We say that strong duality holds if
Geometric Intuition
We now build some geometric intuition regarding duality.
Fix some optimization problem and define the set
essentially expresses all value combinations of the constraints and objective. We now interpret many prior results, along with some new results, using the geometry of .
Optimal Value
It is easy to see that
i.e. we restrict the set we consider to only feasible points.
Lagrange Dual Function
We can also see that
and
Thus, for any , we have that
If the infimum in is attained, we can think of as a supporting hyperplane to .
Weak Duality
Say that . Then,
Thus, for any where , we have that . As is the maximum value of over all where , we thus have that
proving weak duality.
Epigraph Variation
Define
can be thought of as sort of an epigraph of , with the exception that we enforce equality on the equality constraints .
Once again, the optimal value can be expressed as
For , note that
as is a subset of , but points in do not decrease the value of .
Once again, we may say that if the infimum is attained, then is a supporting hyperplane to since for all , we have
Note that is in the boundary of , hence
once again gives us weak duality.
Slater’s Condition
Proposition (Slater’s Condition). If there exists an “interior” to the inequality constraints of a convex optimization problem, i.e.
and is feasible, then strong duality holds.
Proof. We will use the geometric interpretation of duality to prove Slater’s condition.
First note that strong duality holds if and only if
for some . In other words, there exists a supporting hyperplane to (defined above) with tangent to .
The idea behind the proof is that we will separate from the set
with a hyperplane that proves strong duality.
In doing so, we will make the following simplifying assumptions: , , and (otherwise by weak duality).
Note that as we are only considering a convex optimization problem, is convex as it is the Cartesian product of convex sets.
Furthermore, see that
since is optimal.
As are convex and disjoint, we can separate them. More precisely, there exists such that
and
From the first inequality, note that , otherwise we could scale the left-hand side to negative infinity.
We can rewrite the last inequality as
meaning that
Thus,
For now, assume that . We will address the case later.
Dividing both sides by , we have that
for all (recall that was an arbitrary element of ).
We can then minimize over the left-hand side to recover
Weak duality, however, grants us
Hence, when , we have strong duality.
We now consider the case. Then,
As is an arbitrary element of , we have that for all ,
Then let be a Slater point. Plugging this into the above inequality, we have that
but for all . Hence, . But, , thus .
Returning to the original inequality, we have that for all ,
Note that as but .
Let once again be the Slater point. Then, we have that but
As is in the interior, there must exist some other point such that
unless . But, we stated that , hence we have a contradiction. Thus, and strong duality holds by the other case.
Optimality Conditions
Certificate of Suboptimality
The dual function provides us with a method to “certify” the suboptimality of a solution. In particular, say that we are given a solution to some optimization problem and wish to provide a guarantee on how suboptimal it is. We can use a dual solution as our certificate. By weak duality, we have that
Hence,
Thus, our certificate tells us that the suboptimality is at most
which is typically called the duality gap.
Complementary Slackness
Proposition (Complementary Slackness). Consider some optimization problem in which strong duality holds. Let be primal optimal and be dual optimal. Then for all ,
Proof. Observe that
meaning that
i.e.
As and , we immediately recover complementary slackness.
One can also see from the equalities that is the minimizer to .
KKT Conditions
Definition. Consider an optimization problem in which are differentiable and strong duality holds. The KKT conditions for primal solution and dual solutions refer to the following conditions:
- is primal feasible.
- is dual feasible.
- for all .
- .
Proposition. Consider an optimization problem with the above conditions. Furthermore, let be an optimal primal solution and be an optimal dual solution. Then, these optimal points satisfy the KKT conditions.
Proof. It is clear that is primal feasible and is dual feasible. Complementary slackness holds from earlier. Furthermore, see that
meaning that
as desired.
Proposition. Consider a convex optimization problem with differentiable functions. Furthermore, let be points that satisfy the KKT conditions. Then, is primal optimal, is dual optimal, and strong duality holds.
Proof. Clearly, is primal feasible and is dual feasible by the KKT conditions. We now show optimality.
As we are considering a convex optimization problem, see that is convex in . Hence, if the gradient vanishes for any , that must be a global minimizer. Implying that
by invoking complementary slackness.
As the point has the dual equal to the primal, there is zero optimality gap, implying that is primal optimal and is dual feasible. Furthermore, strong duality holds.
Unconstrained Minimization
We now discuss methods to solve unconstrained minimization problems, i.e. problems of the form
where is convex and twice differentiable.
As we are in the unconstrained setting, a point is optimal if and only . There are typically no analytical solutions. Hence, the general idea of these algorithms is to iteratively solve for such a point, i.e. find a sequence such that and consequently .
Strong Convexity
Definition. We say that a function is strongly convex on if there exists such that
which means that . We also say that is strongly convex.
Proposition. Let be an strongly convex function on . Then, for all ,
i.e. we have stronger guarantees on the first-order characterization of convexity.
Proof. By Taylor’s and mean value theorem, there exists some on the line segment such that
By strong convexity,
as desired.
Proposition. Let be an strongly convex function on and the minimum value of . Then for any ,
In words, one can bound the suboptimality of a point using its gradient.
Proof. We know that for all ,
We will find that minimizes the right-hand side, and we have that it serves as a lower-bound for any on the left-hand side.
The right-hand side is a convex function in , hence we take the gradient and solve for :
Plugging this in,
This holds for any , hence we set and see that
as desired.
Corollary. Let be an strongly convex function on and the minimum value of . Then for any , if
we have that
We can also bound a point’s distance from the minimizer using the gradient.
Proposition. Let be an strongly convex function on and the minimizer of . Then for any ,
Proof. By the first-order characterization of strong convexity,
via Cauchy Schwarz.
As ,
hence
meaning that
as desired.
Smoothness
Strong convexity imposes a lower bound on the Hessian of a function. We can similarly impose an upper bound.
Definition. We say that a function is smooth on if
for all .
Proposition. Let be an smooth function. Then for all ,
Proof. Once again by Taylor’s and mean value theorem, we have that
for some on the line segment .
By smoothness, we then have that
Proposition. Let be an smooth function with optimal value . Then for any ,
Proof. We will employ a similar strategy as in the convexity case, with some changes. We know that for all ,
We first find which minimizes the right-hand side. From before, we found that
Plugging this in,
Conditioning
Definition (Condition Number). Consider an unconstrained optimization problem with an objective that is strongly convex and smooth. We call the condition number of the problem.
Definition (Width). We define the width of a set in direction with unit-norm as
Definition. We define the maximum width of a set as
We define the minimum width of a set as
Definition. The condition number of a set is
Definition. The sublevel set of is the set
Proposition. Consider a function that is strongly convex and smooth. Then, for any ,
Proof. Observe that by the first-order characterizations,
Hence, defining
and
we have that
Dividing the squared radii of the balls, we have an upper-bound on the condition number of :
as desired.
Descent Methods
Descent methods are algorithms that solve unconstrained minimization problems by iteratively computing a direction to perturb the current solution, and the scale of said direction. Formally, on iteration , they update the current solution via
where . and are chosen such that , i.e. we gradually aproach the optimal solution.
Note that by the first-order characterization of convexity, we require that
otherwise there is no hope of finding a more optimal solution.
How the direction is computed depends on the specific algorithm. Several methods exist to compute , some of which are covered below.
Exact Line Search
Simply set such that the objective is minimized:
While it is true that we must solve another optimization problem, this problem is one-dimensional and in practice is very easy to solve.
Backtracking
A simplified backtracking algorithm to compute would be to initially set to 1, then while , halve .
Gradient Descent
Gradient descent provides one natural option to choose the descent direction: the negative of the gradient. Until the stopping criterion is satisfied (e.g. the current gradient is small enough), we update the current solution via
where is computed by either exact line search or backtracking.
Proposition. Consider some unconstrained minimization problem over , where is strongly convex and smooth. If we were to perform gradient descent with exact line search beginning at , we would reach an optimal solution at step where
note that .
Proof. By smoothness and the first-order characterization, we have that
As we use exact line search, we can improve our bound by finding that minimizes the right-hand side, which is convex. We take the gradient with respect to and set it to 0:
Substituting this in,
Hence,
i.e. we always improve our optimality gap by
Recall that with strong convexity, the gradient provides us with a bound on our suboptimality:
Hence, we can restate our bound as
By induction,
To have that , it is sufficient to have
meaning that it is sufficient for
as desired.