21-690: Methods of Optimization

Introduction

Below are my notes for the course 21-690: Methods of Optimization taught in the Spring 2025 semester by Professor Nicholas Boffi at Carnegie Mellon University.

Convex Sets

Affine and Convex Sets

Definition (Affine). A set $C \subseteq R^{n}$ is said to be affine if for all $x, y \in C$ , we have that $θ x + (1 - θ) y \in C$ for all $θ \in R$ .

Geometrically, affine sets are sets in which the line formed by any two points in the set is entirely contained in the set as well.

Definition (Affine Combination). An affine combination of ${x_{i}}_{i = 1}^{k} \subseteq R^{n}$ is a linear combination

i = 1 \sum k θ_{i} x_{i}

where $θ_{i} \in R$ for all $i \in [k]$ and $\sum_{i = 1}^{k} θ_{i} = 1$ .

Proposition. Let $C \subseteq R^{n}$ be an affine set. Then, any affine combination of ${x_{i}}_{i = 1}^{k} \subseteq C$ is contained in $C$ .

Proof. By induction on $k$ .

Definition (Affine Hull). The affine hull of a set $C \subseteq R^{n}$ is the set

aff (C) = {θ x + (1 - θ) y : x, y \in C, θ \in R} .

Exercise. Prove that the affine hull of $C \subseteq R^{n}$ is the smallest affine set containing $C$ .

Affine sets are certainly unbounded, as lines are unbounded. We can consider bounded sets by considering only line segments rather than entire lines. Thus lies the idea behind convexity.

Definition (Convex). A set $C \subseteq R^{n}$ is said to be convex if for all $x, y \in C$ , we have that $θ x + (1 - θ) y \in C$ for all $θ \in [0, 1]$ .

Geometrically, convex sets are sets in which the line segment formed by any two points in the set is entirely contained in the set as well. Note that we only consider line segments by restricting our coefficients to $[0, 1]$ .

We can similarly extend the idea of affine combinations and affine hulls to convexity.

Definition (Convex Combination). A convex combination of ${x_{i}}_{i = 1}^{k} \subseteq R^{n}$ is a linear combination

i = 1 \sum k θ_{i} x_{i}

where $θ_{i} \in [0, 1]$ for all $i \in [k]$ and $\sum_{i = 1}^{k} θ_{i} = 1$ .

Proposition. Let $C \subseteq R^{n}$ be a convex set. Then, any convex combination of ${x_{i}}_{i = 1}^{k} \subseteq C$ is contained in $C$ .

Proof. By induction on $k$ .

Definition (Convex Hull). The convex hull of a set $C \subseteq R^{n}$ is the set

conv (C) = {θ x + (1 - θ) y : x, y \in C, θ \in [0, 1]} .

Exercise. Prove that the convex hull of $C \subseteq R^{n}$ is the smallest convex set containing $C$ .

Examples

Cones

Definition (Cone). A set $C$ is a cone if for all $x \in C$ , $λ x \in C$ for all $λ \geq 0$ .

The traditional image of a “cone” is itself a cone, if extended to infinity.

Not all cones are convex: the union of two different lines is a cone, but not convex.

Hyperplanes and Halfspaces

Definition (Hyperplane). Let $a \in R^{n}$ and $b \in R$ . The set

{x \in R^{n} : a^{T} x = b}

is a hyperplane.

Geometrically, hyperplanes are lines in $R^{2}$ and planes in $R^{3}$ .

Proposition. All hyperplanes are affine.

Proof. Consider the hyperplane

S = {x \in R^{n} : a^{T} x = b} .

Consider $x, y \in S$ and $θ \in R$ . Then,

a^{T} (θ x + (1 - θ) y) = θ (a^{T} x) + (1 - θ) (a^{T} y) = θ b + (1 - θ) b = b .

Hence, $θ x + (1 - θ) y \in S$ and so $S$ is affine.

Definition (Halfspace). Let $a \in R^{n}$ and $b \in R$ . The set

{x \in R^{n} : a^{T} x \leq b}

is a halfspace.

Geometrically, halfspaces are one side of a hyperplane.

Proposition. All halfspaces are convex.

Balls and Ellipsoids

Definition (Ball). A ball is a set

B (x_{c}, r) = {x \in R^{n} : ∣∣ x - x_{c} ∣∣ \leq r},

where $r > 0$ .

Proposition. All balls are convex.

Proof. Consider some ball $B (x_{c}, r)$ . Take $x, y \in B (x_{c}, r)$ and $θ \in [0, 1]$ . Let $z = θ x + (1 - θ) y$ . We wish to show that $z \in B (x_{c}, r)$ . Observe that by triangle inequality,

∣∣ z - x_{c} ∣∣ = ∣∣ θ x + (1 - θ) y - x_{c} ∣∣ \leq ∣∣ θ x - θ x_{c} ∣∣ + ∣∣ (1 - θ) y - (1 - θ x_{c}) ∣∣ \leq θ r + (1 - θ) r = r,

hence $z \in B (x_{c}, r)$ .

Definition (Ellipsoid). An ellipsoid is a set

{x \in R^{n} : (x - x_{c})^{T} P^{- 1} (x - x_{c}) \leq 1}

where $x_{c} \in R^{n}$ and $P \in S_{++}^{n}$ .

Proposition. All ellipsoids are convex.

Proof. Consider some ellipsoid

C = {x \in R^{n} : (x - x_{c})^{T} P^{- 1} (x - x_{c}) \leq 1} .

Take $x, y \in C$ , and $θ \in [0, 1]$ . Let $z = θ x + (1 - θ) y$ . We wish to show that $z \in C$ . Observe that

(z - x_{c})^{T} P^{- 1} (z - x_{c}) = (θ x + (1 - θ) y - x_{c})^{T} P^{- 1} (θ x + (1 - θ) y - x_{c}) = (θ x + (1 - θ) y - θ x_{c} - (1 - θ) x_{c})^{T} P^{- 1} (θ x + (1 - θ) y - θ x_{c} - (1 - θ) x_{c}) = (θ (x - x_{c}) + (1 - θ) (y - x_{c}))^{T} P^{- 1} (θ (x - x_{c}) + (1 - θ) (y - x_{c})) = θ^{2} (x - x_{c})^{T} P^{- 1} (x - x_{c}) + θ (1 - θ) ((x - x_{c})^{T} P^{- 1} (y - x_{c}) + (y - x_{c})^{T} P^{- 1} (x - x_{c})) + (1 - θ)^{2} (y - x_{c})^{T} P^{- 1} (y - x_{c}) = θ^{2} (x - x_{c})^{T} P^{- 1} (x - x_{c}) + 2 θ (1 - θ) (x - x_{c})^{T} P^{- 1} (y - x_{c}) + (1 - θ)^{2} (y - x_{c})^{T} P^{- 1} (y - x_{c}) \leq θ^{2} (x - x_{c})^{T} P^{- 1} (x - x_{c}) + 2 θ (1 - θ) (x - x_{c})^{T} P^{- 1} (x - x_{c}) (y - x_{c})^{T} P^{- 1} (y - x_{c}) + (1 - θ)^{2} (y - x_{c})^{T} P^{- 1} (y - x_{c}) = (θ (x - x_{c})^{T} P^{- 1} (x - x_{c}) + (1 - θ) (y - x_{c})^{T} P^{- 1} (y - x_{c}))^{2} \leq (θ + (1 - θ))^{2} = 1.

Hence, $z \in C$ , and so all ellipsoids are convex.

Polyhedra

Definition (Polyhedra). A polyhedra is a set

P = {x \in R^{n} : (\forall i \in [m], a_{i}^{T} x \leq b_{i}) \land (\forall i \in [p], c_{i}^{T} x = d_{i})} .

Polyhedra are typically presented with the notation

P = {x \in R^{n} : A x ⪯ b, C x = d}

where $A \in R^{m \times n}, b \in R^{m}, C \in R^{p \times n}, d \in R^{p}$ .

Geometrically, polyhedra are the intersection of a finite number of hyperplanes and halfspaces.

Operations that Preserve Convexity

Intersection

Proposition. Intersection preserves convexity.

Proof. Let $I$ be an index set such that for all $i$ , $C_{i} \subseteq R^{n}$ is a convex set. Let $C = ⋂_{i \in I} C_{i}$ . We claim that $C$ is convex.

Consider $x, y \in C$ , and $θ \in [0, 1]$ . Set $z = θ x + (1 - θ) y$ . It suffices to show that $z \in C$ .

By definition of $C$ , $x, y \in C_{i}$ for all $i \in I$ . By convexity of $C_{i}$ , we have that $z \in C_{i}$ for all $i \in I$ . Hence, $z \in C$ , as desired.

Exercise. The positive semidefinite cone, $S_{+}^{n}$ , is convex.

Solution. Observe that

S_{+}^{n} = z \in R^{n} ⋂ {X \in R^{n \times n} : z^{T} X z \geq 0} .

Every set on the right hand side is convex. The intersection of convex sets is convex, hence the positive semidefinite cone is convex.

Affine Functions

Definition (Affine Function). A function $f : R^{n} \to R^{m}$ is affine if it is of the form

f (x) = A x + b .

Proposition. The image of a convex set over an affine function is convex.

Proof. Let $C \subseteq R^{n}$ be a convex set and $f : R^{n} \to R^{m}$ an affine function via $f (x) = A x + b$ . We wish to show that $f [C]$ is convex.

Consider $x, y \in f [C]$ , and $θ \in [0, 1]$ . Define $z = θ x + (1 - θ) y$ . It suffices to show that $z \in f [C]$ .

As $x, y \in f [C]$ , there exists $s, t \in C$ such that $f (s) = x$ and $f (t) = y$ .

Let $u = θ s + (1 - θ) t$ . As $C$ is convex, we have that $u \in C$ .

Thus,

z = θ x + (1 - θ) y = θ f (s) + (1 - θ) f (t) = θ A s + θ b + (1 - θ) A t + (1 - θ) b = A (θ s + (1 - θ) t) + b = A u + b = f (u) \in f [C],

as desired.

The above proposition implies that scaling, translation, and projection preserve convexity.

Proposition. The pre-image of a convex set over an affine function is convex.

Proof. Let $C \subseteq R^{m}$ be a convex set and $f : R^{n} \to R^{m}$ an affine function via $f (x) = A x + b$ . We wish to show that $f^{- 1} [C]$ is convex.

Consider $x, y \in f^{- 1} [C]$ , and $θ \in [0, 1]$ . It suffices to show that $z = θ x + (1 - θ) y \in f^{- 1} [C]$ , which in turn we must show that $f (z) \in C$ .

Observe that

f (z) = f (θ x + (1 - θ) y)) = θ A x + θ b + (1 - θ) A y + (1 - θ) b = θ f (x) + (1 - θ) f (y) \in C

by convexity of $C$ .

Perspective Function

Definition (Perspective Function). The perspective function $P : R^{n} \times R_{++} \to R^{n}$ is defined via

P (s, t) = \frac{s}{t} .

Proposition. The image of a convex set over the perspective function is convex.

Proof. Let $C \subseteq R^{n} \times R_{++}$ be a convex set. We wish to show that $P [C]$ is convex.

Consider $x, y \in P [C]$ . We wish to show that for any $θ \in [0, 1]$ , $θ x + (1 - θ) y \in P [C]$ . The proof is difficult if done directly, hence we will take a slightly different approach.

Fix $θ \in [0, 1]$ . Since $x, y \in P [C]$ , there exists $(a, s), (b, t) \in C$ such that $P (a, c) = x$ and $P (b, t) = y$ .

By convexity of $C$ , we have that

(θ a + (1 - θ) b, θ s + (1 - θ) t) = θ (a, s) + (1 - θ) (b, t) \in C .

Hence,

\frac{θ s P ( x )}{θ s + ( 1 - θ ) t} + \frac{( 1 - θ ) tP ( y )}{θ s + ( 1 - θ ) t} = \frac{θ a + ( 1 - θ ) b}{θ s + ( 1 - θ ) t} = P (θ a + (1 - θ) b, θ s + (1 - θ) t) \in P [C]

Let

μ = \frac{θ s}{θ s + ( 1 - θ ) t} .

Through substitution, we have that

μ P (x) + (1 - μ) P (y) \in P [C] .

As we vary $θ \in [0, 1]$ , $μ$ varies from $0$ to $1$ , implying that $P [C]$ is convex.

Proposition. The pre-image of a convex set over the perspective function is convex.

Proof. Let $C \subseteq R^{n}$ be a convex set. We wish to show that $P^{- 1} [C]$ is convex.

Consider $(x, s), (y, t) \in P^{- 1} [C]$ , and $θ \in [0, 1]$ . It suffices to show that $z = θ (x, s) + (1 - θ) (y, t) \in P^{- 1} [C]$ . Thus, we wish to show that $P (z) \in C$ .

Observe that

P (z) = P (θ (x, s) + (1 - θ) (y, t)) = P (θ x + (1 - θ) y, θ s + (1 - θ) t) = \frac{θ x + ( 1 - θ ) y}{θ s + ( 1 - θ ) t} = \frac{θ s}{θ s + ( 1 - θ ) t} P (x, s) + \frac{( 1 - θ ) t}{θ s + ( 1 - θ ) t} P (y, t) .

Then, note that $P (x, s), P (y, t) \in C$ , and

\frac{θ s}{θ s + ( 1 - θ ) t}, \frac{( 1 - θ ) t}{θ s + ( 1 - θ ) t} \in [0, 1] .

Hence, by convexity of $C$ ,

P (z) = \frac{θ s}{θ s + ( 1 - θ ) t} P (x, s) + \frac{( 1 - θ ) t}{θ s + ( 1 - θ ) t} P (y, t) \in C,

implying that $z \in P^{- 1} [C]$ .

Separating Hyperplanes

Theorem (Separating Hyperplane). Two non-empty and disjoint convex sets can be separated by a hyperplane.

Formally, if there are non-empty and disjoint convex sets $C, D \subseteq R^{n}$ , there exists $a \in R^{n}$ , $b \in R^{n}$ such that

\forall x \in C, a^{T} x \leq b \forall x \in D, a^{T} x \geq b .

Geometrically, there exists a hyperplane between $C$ and $D$ .

Proof. We only prove the theorem for the special case in which $C, D$ are closed and bounded, hence compact.

Note that $C \times D$ is then compact. We may define the distance function $D : C \times D \to R$ via $D (x, y) = ∣∣ x - y ∣∣$ . By compactness, there exists $u, v \in C, D$ such that $D (u, v) = ∣∣ u - v ∣∣$ is minimized.

Define $a = v - u$ and $b = (v - u)^{T} (v + u) /2$ . We claim that the hyperplane ${a^{T} x = b}$ separates $C, D$ .

Assume for the sake of contradiction not. Then, without loss of generality, there exists $x \in D$ such that $a^{T} x < b$ . Implying that

⟺ ⟺ ⟺ ⟺ ⟺ a^{T} x < b (v - u)^{T} x - \frac{( v - u ) ^{T} ( v + u )}{2} < 0 (v - u)^{T} (x - \frac{v + u}{2}) < 0 (v - u)^{T} (x + \frac{- v - u}{2}) < 0 (v - u)^{T} (x + \frac{v - u}{2} - v) < 0 (v - u)^{T} (x - v) + ∣∣ v - u ∣∣/2 < 0.

Clearly, $∣∣ v - u ∣∣ > 0$ , meaning that $(v - u)^{T} (x - v) < 0$ .

Intuitively, we can move $v$ towards the direction $x - v$ and minimize the distance from $u$ .

Formally, we can take the derivative

\frac{d}{d t} ∣∣ (v + t (x - v)) - u ∣ ∣^{2}_{t = 0} = 2 (v - u)^{T} (x - v) < 0

per the above.

Thus, for some small $t > 0$ , we have that

∣∣ (v + t (x - v)) - u ∣∣ < ∣∣ v - u ∣∣ .

By convexity of $D$ , $v + t (x - v) = (1 - t) v + t x \in D$ , hence the above is a contradiction by definition of $u, v$ .

Definition. We say that two convex sets $C, D \subseteq R^{n}$ are strictly separated if there exists $a \in R^{n}$ , $b \in R$ such that

\forall x \in C, a^{T} x < b \forall x \in D, a^{T} x > b .

Example. Let $C \subseteq R^{n}$ be a closed convex set and $x_{0} \in R^{n}$ a point not in $C$ . Then, $C$ and ${x_{0}}$ are strictly separated.

To see why, note that as $C$ is closed, $R^{n} ∖ C$ is open. Hence, we can find $r > 0$ such that $B (x_{0}, r) \cap C = \emptyset$ . Clearly, $B (x_{0}, r)$ is convex. By the separating hyperplane theorem, we may find $a \in R^{n}$ , $b \in R$ such that $a^{T} x \leq b$ for all $x \in C$ , and $a^{t} x \geq b$ for all $x \in B (x_{0}, r)$ . In particular, the last statement means that for any $u \in R^{n}$ where $∣∣ u ∣∣ \leq r$ ,

a^{T} x_{0} + a^{T} u = a^{T} (x_{0} + u) \geq b .

The left hand side is minimized when $u = - \frac{a}{r ∣∣ a ∣∣}$ , hence

a^{T} x_{0} - r \geq b ⟹ a^{T} x_{0} \geq b + r > b .

Thus, the hyperplane strictly separates $C$ and ${x_{0}}$ .

We can use this result to show the following.

Proposition. Let $C \subseteq R^{n}$ be a closed convex set, and $H$ be the set of all halfspaces that contain $C$ entirely. Then,

C = ⋂ H .

Proof.

$C \subseteq ⋂ H$

Trivial by definition of $H$ .

$⋂ H \subseteq C$

Take $x \in ⋂ H$ . Assume for the sake of contradiction that $x \in / C$ . Then, we may find a strictly separating hyperplane between ${x}$ and $C$ . Implying that $x \in / ⋂ H$ , a contradiction.

Supporting Hyperplanes

Definition. For a convex set $C \subseteq R^{n}$ , we say that a hyperplane

{x \in R^{n} : a^{T} x = a^{T} x_{0}}

is a supporting hyperplane if $x_{0} \in \partial C$ and $a^{T} x \leq a^{T} x_{0}$ for all $x \in C$ . Geometrically, the hyperplane is tangent to a point on the boundary of $C$ , and its halfspace contains the entirety of $C$ .

Convex Functions

Basic Properties and Definitions

Definition. A function $f : R^{n} \to R$ is convex if for all $x, y \in f$ , $θ \in [0, 1]$ , we have that

f (θ x + (1 - θ) y) \leq θ f (x) + (1 - θ) f (y) .

If the inequality is strict, i.e.

f (θ x + (1 - θ) y) < θ f (x) + (1 - θ) f (y),

for $x \neq = y$ , then the function is said to be strictly convex.

Intuitively, convex functions are those in which the epigraph of the function (the area above the function) is a convex set.

Remark. A function $f : R^{n} \to R$ is convex if and only if it is convex when restricted to any line in its domain, i.e. for all $v \in R^{n}$ ,

g (t) = f (x + t v)

is convex.

Theorem (First Order Characterization). A function $f : R^{n} \to R$ in $C^{1}$ is convex if and only if

f (y) \geq f (x) + \nabla f (x)^{T} (y - x)

for all $x, y \in R^{n}$ .

Proof.

Case: $n = 1$

First assume that $f$ is convex. Then for any $x, y \in R$ , we have that

⟹ ⟹ ⟹ ⟹ ⟹ ⟹ f (x + θ (y - x)) \leq (1 - θ) f (x) + θ f (y) f (x + θ (y - x)) \leq f (x) - θ f (x) + θ f (y) f (x) + \frac{f ( x + θ ( y - x )) - f ( x )}{θ} \leq f (y) θ \to 0 lim f (x) + \frac{f ( x + θ ( y - x )) - f ( x )}{θ} \leq f (y) f (x) + θ \to 0 lim \frac{f ( x + θ ( y - x )) - f ( x )}{θ} \leq f (y) f (x) + (y - x) θ \to 0 lim \frac{f ( x + θ ( y - x )) - f ( x )}{θ ( y - x )} \leq f (y) f (x) + f^{'} (x) (y - x) \leq f (y) .

as desired.

Now instead assume that for all $x, y \in R$ ,

f (x) + f^{'} (x) (y - x) \leq f (y) .

We wish to show that $f$ is convex. Fix $x, y \in R$ and $θ \in [0, 1]$ . Let $z = θ x + (1 - θ) y$ . Then,

⟹ f (z) + f^{'} (z) (x - z) \leq f (x) f (z) + (1 - θ) f^{'} (z) (x - y) \leq f (x) .

Similarly, we can see that

f (z) + θ f^{'} (z) (y - x) \leq f (y) .

Combining these,

f (z) = θ f (z) + (1 - θ) f (z) \leq θ f (x) + (1 - θ) f (y),

as desired.

Case: $n > 1$

We can study the one-dimensional function which varies $x$ in the direction of $y - x$ , i.e. $g (t) = f (x + t (y - x))$ . The result then follows by the $n = 1$ case.

Remark. The above inequality is strict if and only if the function is strictly convex.

Corollary. Let $f : R^{n} \to R$ be a convex function, and $x \in R^{n}$ a point such that $\nabla f (x) = 0$ . Then, $x$ is a global minimizer of $f$ .

Proof. Consider some $y \in R^{n}$ . Per the first order characterization of convex functions,

f (x) = f (x) + \nabla f (x)^{T} (y - x) \leq f (y) .

Theorem (Second Order Characterization). A function $f : dom (f) \to R$ in $C^{2}$ is convex if and only if

\nabla^{2} f (x) ⪰ 0

for all $x \in dom (f)$ .

Proof.

First assume that $f$ is convex. Assume for the sake of contradiction that there exists some $x \in dom (f)$ such that $\nabla^{2} f (x)$ is not positive semi-definite. By definition, there exists some eigenvector $v \in R^{n}$ such that $\nabla^{2} f (x) v = λ v$ where $λ < 0$ .

Define $g (t) = f (x + t v)$ . Note that

g^{''} (t) = v^{T} \nabla^{2} f (x + t v) v .

Hence,

g^{''} (0) = v^{T} \nabla^{2} f (x) v = v^{T} λ v = λ ∣∣ v ∣ ∣^{2} < 0.

By definition of the second derivative, for some small $ϵ > 0$ , $g^{'} (c) < g^{'} (0)$ for all $c \in (0, ϵ)$ .

Then,

g (ϵ) = g (0) + (g (ϵ) - g (0)) = g (0) + g^{'} (c) ϵ < g (0) + g^{'} (0) ϵ,

by the mean value theorem. Hence, a contradiction since $g$ inherits convexity from $f$ and the above violates the first order characterization of convexity. Thus, $\nabla^{2} f (x) ⪰ 0$ for all $x \in dom (f)$ .

Now assume that $\nabla^{2} f (x) ⪰ 0$ for all $x \in dom (f)$ . We claim that $f$ is convex.

Fix $x, y \in dom (f)$ . Define

g (t) = f (x + t (y - x)) .

Then,

f (y) = g (1) = g (0) + (g (1) - g (0)) = g (0) + g^{'} (c) \geq g (0) + g^{'} (0) = f (x) + \nabla f (x)^{T} (y - x)

by mean value theorem. The inequality follows from the fact that $g^{''}$ is always non-negative, implying that $g^{'}$ is non-decreasing.

Since $x, y$ are arbitrary, we have that $f$ is convex by the first order characterization of convexity.

Remark. The above inequality is strict if and only if the function is strictly convex.

Operations that Preserve Convexity

Nonnegative Weighted Sum

Proposition. Let ${f_{i}}_{i = 1}^{n}$ be a sequence of convex functions. Then,

i = 1 \sum n ω_{i} f_{i}, ω_{i} \geq 0

is convex.

Proof. Fix $x, y \in R^{n}$ and $θ \in [0, 1]$ . Then,

(i = 1 \sum n ω_{i} f_{i}) (θ x + (1 - θ y)) = i = 1 \sum n ω_{i} f_{i} (θ x + (1 - θ) y) \leq i = 1 \sum n ω_{i} (θ f_{i} (x) + (1 - θ) f_{i} (y)) = θ (i = 1 \sum n ω_{i} f_{i} (x)) + (1 - θ) (i = 1 \sum n ω_{i} f_{i} (y)) = θ (i = 1 \sum n ω_{i} f_{i})) (x) + (1 - θ) (i = 1 \sum n ω_{i} f_{i}) (y)

as desired.

Remark. The above proposition generalizes to infinite sums (if they converge) as well as integrals. Specifically, if a function $f (x, y)$ is convex in $x$ for all $y \in A$ , then

g (x) = \int_{A} ω (y) f (x, y) d y, ω (y) \geq 0

is convex.

Affine Composition

Proposition. Let $f : R^{n} \to R$ , $A \in R^{n \times m}$ , and $b \in R^{n}$ . Then,

g (x) = f (A x + b)

is convex if $f$ is convex.

Proof. Fix $x, y \in R^{n}$ and $θ \in [0, 1]$ . Then,

g (θ x + (1 - θ) y) = f (A (θ x + (1 - θ) y) + b) = f (θ (A x + b) + (1 - θ) (A y + b)) \leq θ f (A x + b) + (1 - θ) f (A y + b) = θ g (x) + (1 - θ) g (y) .

Maximum and Supremum

Proposition. Let ${f_{i}}_{i = 1}^{n}$ be a sequence of convex functions. Then,

g (x) = i max f_{i} (x)

is convex.

Proof. Fix $x, y \in R^{n}$ and $θ \in [0, 1]$ . Then,

g (θ x + (1 - θ) y) = i max f_{i} (θ x + (1 - θ) y) \leq i max (θ f_{i} (x) + (1 - θ) f_{i} (y)) \leq θ i max f_{i} (x) + (1 - θ) i max f_{i} (y) = θ g (x) + (1 - θ) g (y) .

Remark. The above proposition generalizes to the supremum. Specifically, if $f (x, y)$ is convex in $x$ for all $y$ , then

g (x) = y sup f (x, y)

is convex.

Representation as Supremum of Affine Functions

Proposition. Let $f : R^{n} \to R$ be a convex function. Then,

f (x) = sup {g (x) : g affine, g (y) \leq f (y) \forall y \in R^{n}} .

Proof.

$f (x) \leq sup {g (x) : g affine, g (y) \leq f (y) \forall y \in R^{n}}$

Define

epi (f) = {(x, y) : f (x) \leq y} .

We claim that $epi (f)$ is convex. Fix $(x_{1}, y_{1}), (x_{2}, y_{2}) \in epi (f)$ , and $θ \in [0, 1]$ . Then let $z = θ (x_{1}, y_{1}) + (1 - θ) (x_{2}, y_{2})$ . So,

z = (θ x_{1} + (1 - θ) x_{2}, θ y_{1} + (1 - θ) y_{2}) .

By convexity of $f$ ,

f (θ x_{1} + (1 - θ) x_{2}) \leq θ f (x_{1}) + (1 - θ) f (x_{2}) \leq θ y_{1} + (1 - θ) y_{2}

implying that $z \in epi (f)$ , as desired. Hence, $epi (f)$ is convex.

Now fix some $x \in R^{n}$ . We shall show that indeed,

f (x) \leq sup {g (x) : g affine, g (y) \leq f (y) \forall y \in R^{n}} .

Observe that $(x, f (x)) \in \partial epi (f (x))$ , hence we may find $a \in R^{n}$ , $b \in R$ such that

[a b] [x - z f (x) - t] \leq 0

for all $(z, t) \in epi (f)$ .

Then,

⟹ a^{T} (x - z) + b (f (x) - z) \leq 0 a^{T} (x - z) + b (f (x) - f (z) - s) \leq 0

for all $s$ . Implying that $b > 0$ .

We can then write the above as

g (z) = \frac{a ^{T}}{b} (x - z) + f (x) \leq f (z)

for all $z \in R^{n}$ .

Hence, $g$ is an affine function that underestimates $f$ over all $z$ , and achieves $g (x) = f (x)$ . We thus have the result.

$f (x) \geq sup {g (x) : g affine, g (y) \leq f (y) \forall y \in R^{n}}$

Follows by definition.

Perspective of a Function

Definition. Consider a function $f : R^{n} \to R$ . We define the perspective of $f$ as $g_{f} : R^{n} \times R_{> 0} \to R$ via

g_{f} (x, t) = t f (x / t) .

Proposition. If $f : R^{n} \to R$ is convex, then $g_{f}$ is convex.

Proof. It suffices to show that the epigraph of $g_{f}$ is convex. Note that the epigraph of $g_{f}$ is the preimage of the epigraph of $f$ over the perspective function, hence is convex.

Convex Conjugates

Definition. Let $f : R^{n} \to R$ . The conjugate of $f$ , $f^{⋆} : R^{n} \to R$ is defined via

f^{⋆} (y) = x \in dom (f) sup {y^{T} x - f (x)} .

Geometrically, the conjugate of a function $f$ is the greatest distance between $f$ and the hyperplane $y^{T} x$ .

Proposition. For any function $f : R^{n} \to R$ , the conjugate $f^{⋆}$ is always convex.

Proof. Observe that $y^{T} x - f (x)$ is convex in $y$ , hence $f^{⋆} (y) = sup_{x \in dom (f)} {y^{T} x - f (x)}$ is convex in $y$ .

Basic Properties

Proposition (Fenchel’s Inequality). Let $f : R^{n} \to R$ be a function. Then,

f (x) + f^{⋆} (y) \geq x^{T} y

for all $x, y$ .

Proof. Note that

f^{⋆} (y) \geq x^{T} y - f (x)

for all $x, y$ . The result is then immediate.

Convex Optimization Problems

Optimization Problems

Definition. An optimization problem is of the form

x min s.t. f_{0} (x) f_{i} (x) \leq 0, i = 1, \dots, m h_{i} (x) = 0, i = 1, \dots, p .

Its domain is the intersection of domains for each function, i.e.

D = i = 0 ⋂ m dom (f_{i}) \cap i = 1 ⋂ p dom (h_{i}) .

Definition. The feasible set of an optimization problem is the set

Ω = {x \in D : f_{i} (x) \leq 0, h_{i} (x) = 0} .

We say that an optimization problem is feasible if its feasible set is not empty.

Definition. We define the optimal value of an optimization problem as the value

p^{⋆} = x \in Ω in f f_{0} (x) .

If $Ω = \emptyset$ , then $p^{⋆} = \infty$ .

Definition. If there exists a sequence ${x_{i}}_{i = 1}^{\infty} \subseteq Ω$ such that $f_{0} (x_{k}) \to - \infty$ as $k \to \infty$ , we say that the optimization problem is unbounded below and $p^{⋆} = - \infty$ .

Definition. We say that $x \in Ω$ is $ϵ$ - suboptimal if $f_{0} (x) \leq p^{⋆} + ϵ$ .

Definition. We say that $x \in Ω$ is locally optimal if there exists some $R > 0$ such that $f_{0} (z) \geq f_{0} (x)$ for all $z \in B (x, R)$ .

Definition. If $f_{i} (x) = 0$ for some $i \in [m]$ and $x \in Ω$ , we say that constraint $i$ is active at $x$ .

Definition. We call an optimization problem of the form

x min s.t. 0 f_{i} (x) \leq 0, i = 1, \dots, m h_{i} (x) = 0, i = 1, \dots, p .

a feasibility problem.

Remark. Note that maximization problem can be formulated as optimization problems by taking $- f_{0}$ .

Slack Variables

Slack variables allow us to express inequalities as equalities.

In particular, note that

f_{i} (x) \leq 0 ⟺ f_{i} (x) + ξ = 0

for some $ξ \geq 0$ (in particular, $ξ = - f_{i} (x)$ ). Here, $ξ$ is a slack variable.

More generally, we can reformulate the optimization problem

x min s.t. f_{0} (x) f_{i} (x) \leq 0, i = 1, \dots, m h_{i} (x) = 0, i = 1, \dots, p .

x, ξ min s.t. f_{0} (x) f_{i} (x) + ξ_{i} = 0, i = 1, \dots, m h_{i} (x) = 0, i = 1, \dots, p ξ_{i} \geq 0, i = 1, \dots, m .

Convex Problems

Definition. A convex problem is an optimization problem of the form

x min s.t. f_{0} (x) f_{i} (x) \leq 0, i = 1, \dots, m a_{i}^{T} x = b_{i}, i = 1, \dots, p .

where $f_{i}$ is convex for all $i$ .

Remark. The feasible set of a convex optimization problem

Ω = i = 1 ⋂ m {x : f_{i} (x) \leq 0} \cap dom (f_{0}) \cap {x : A x = b}

is convex as it is the intersection of convex sets.

Proposition. Any local solution to a convex optimization problem is a global solution.

Proof. Let $x \in Ω$ be a local solution to the typical convex optimization problem. Then, there exists $R > 0$ such that $x$ is optimal in the $R$ ball around it.

Assume for the sake of contradiction that $x$ is not locally optimal. Then, there exists some $y \in Ω$ such that $f_{0} (y) < f_{0} (x)$ .

We can find some $θ \in [0, 1]$ such that $x + θ (y - x) \in B (x, R)$ . Then,

f_{0} (x + θ (y - x)) \leq (1 - θ) f_{0} (x) + θ f_{0} (y) < f_{0} (x),

a contradiction.

The above proposition provides some intuition as to why convex optimization problems are particularly nice to work with.

Proposition. Let $f_{0} \in C^{1} (Ω)$ be a convex function. Then, $x \in Ω$ is optimal if and only if

\nabla f_{0} (x)^{T} (y - x) \geq 0

for all $y \in Ω$ .

Proof. First assume that $x \in Ω$ is optimal. Assume for the sake of contradiction that $\nabla f_{0} (x)^{T} (y - x) < 0$ for some $y$ . Define

g (t) = f (x + t (y - x)) .

Observe then that

g^{'} (0) = \nabla f_{0} (x)^{T} (y - x) < 0.

Thus, for some $ϵ > 0$ , for all $c \in (0, ϵ)$ , we have that $g (c) < g (0)$ , implying that

f (x + c (y - x)) < f (x),

a contradiction.

Now assume that

\nabla f_{0} (x)^{T} (y - x) \geq 0

holds for all $x$ . We claim that $x$ is optimal. Observe that

f (x) \leq f (x) + \nabla f (x)^{T} (y - x) \leq f (y)

as desired.

Corollary. Consider some convex optimization problem where $f_{0} \in C^{1} (Ω)$ . If $Ω$ is open and $x \in Ω$ is the optimal point, then

\nabla f (x)^{T} (y - x) = 0

for all $y \in Ω$ .

Proof.

As $Ω$ is open, we can find small enough $θ > 0$ such that $y = x - θ \nabla f_{0} (x) \in Ω$ . Then,

- θ ∣∣\nabla f_{0} (x) ∣ ∣^{2} = \nabla f_{0} (x)^{T} ((x - θ \nabla f_{0} (x)) - x) \geq 0

which is only true if the gradient is zero.

Proposition: Consider the convex optimization problem

x min s.t. f_{0} (x) A x = b

where $f_{0} \in C^{1} (Ω)$ . Then, a point $x^{⋆}$ is an optimal point if and only if

\nabla f_{0} (x^{⋆}) + A^{T} v = 0

for some $v$ , and

A x^{⋆} = b .

Proof. We first prove the forwards direction. Assume that $x^{⋆}$ is optimal. Then, $A x^{⋆} = b$ trivially. Furthermore, we know that for all $y \in Ω$ ,

\nabla f_{0} (x^{⋆})^{T} (y - x^{⋆}) \geq 0.

As $y \in Ω$ , we know that $A y = b$ as well. Thus, we must have that $y = x + v$ where $v \in N (A)$ . We may then rewrite the above as

\nabla f_{0} (x^{⋆})^{T} v \geq 0

for all $v \in N (A)$ .

Thus, $\nabla f_{0} (x^{⋆})$ is orthogonal to $N (A)$ , hence $\nabla f_{0} (x^{⋆}) \in R (A^{T})$ . Similarly, its negative is in the range of $A^{T}$ . Meaning that there must exist some $v$ such that

\nabla f_{0} (x^{⋆}) + A^{T} v = 0.

We now prove the backwards direction.

If $A x^{⋆} = b$ , then clearly $x^{⋆}$ is feasible.

\nabla f_{0} (x^{⋆}) + A^{T} v = 0,

then $\nabla f_{0} (x^{⋆}) \in R (A^{T})$ . Hence, it is orthogonal to $N (A)$ , and so

\nabla f_{0} (x^{⋆})^{T} (y - x^{⋆}) = 0

for all $y \in Ω$ .

Thus, $x^{⋆}$ is optimal.

Linear Problems

Definition. A linear problem (LP) is an optimization problem of the form

x min s.t. c^{T} x A x = b x ⪰ 0.

One may introduce inequalities in the constraints via slack variables.

Quadratic Problems

Definition. A quadratic problem (QP) is an optimization problem of the form

x min s.t. \frac{1}{2} x^{T} P x + q^{T} x + r G x ⪯ h A x = b,

where $P \in S_{+}^{n}$ .

Definition. A quadratically constrained quadratic problem (QCQP) is an optimization problem of the form

x min s.t. \frac{1}{2} x^{T} P_{0} x + q_{0}^{T} x + r_{0} \frac{1}{2} x^{T} P_{i} x + q_{i}^{T} x + r_{i} \leq 0 A x = b,

where $P_{i} \in S_{+}^{n}$ .

Definition. A second-order cone program (SOCP) is an optimization problem of the form

x min s.t. f_{0} (x) ∣∣ A_{i} x + b_{i} ∣ ∣_{2} \leq c_{i}^{T} x + d_{i} F x = g .

Duality

Lagrange Dual Function

Definition. The Lagrangian of a (not necessarily convex) optimization problem

x min s.t. f_{0} (x) f_{i} (x) \leq 0, i = 1, \dots, m h_{i} (x) = 0, i = 1, \dots, p .

is a function $L : R^{n} \times R^{m} \times R^{p}$ defined via

L (x, λ, ν) = f_{0} (x) + λ^{T} f (x) + ν^{T} h (x) .

Definition: The Lagrange dual of an optimization problem is the function $g : R^{m} \times R^{p}$ defined via

g (λ, ν) = x \in D in f L (x, λ, ν)

where $L$ is the Lagrangian of the optimization problem and $D$ is the domain of the optimization problem (not necessarily feasible).

Proposition. The Lagrange dual of any optimization problem is concave.

Proof. Observe that the Lagrangian is affine in $λ, ν$ . Concavity is preserved under point-wise infimum.

Proposition (Weak Duality). For some optimization problem, let $p^{⋆}$ be its optimal value and $g (λ, ν)$ be its Lagrange dual. Then, for any $λ \geq 0$ and any $ν$ ,

g (λ, ν) \leq p^{⋆} .

Proof. Fix $λ ⪰ 0$ and $ν$ . Consider some feasible $x$ . Then,

g (λ, ν) \leq L (x, λ ν) = f_{0} (x) + λ^{T} f (x) + ν^{T} h (x) = f_{0} (x) + λ^{T} f (x) \leq f_{0} (x) \leq p^{⋆}

as desired.

Definition. If $(λ, ν) \in dom (g)$ and $λ ⪰ 0$ , then we say that $(λ, ν)$ is dual feasible.

Lagrange Dual Problem

In light of weak duality, we can think of finding the best lower bound on the optimal value using the Lagrange dual function. Thus is the motivation for the Lagrange dual problem, defined below. Note that the Lagrange dual problem is particularly nice due to the fact that the Lagrange dual function is concave, established previously.

Definition. Let $g (λ, ν)$ be the Lagrange dual for an optimization problem. We define the Lagrange dual problem for the optimization problem as

λ, ν max s.t. g (λ, ν) λ ⪰ 0

We call this problem the dual, and the original problem the primal.

Remark. Let $p^{⋆}$ be the optimal value for the primal problem, and $d^{⋆}$ be the optimal value to the dual problem. Then, by weak duality,

d^{⋆} \leq p^{⋆} .

There may be more implicit constraints, particularly when $g$ is unbounded below (recall that it is an infimum).

Definition. The optimal duality gap of a problem is the value

p^{⋆} - d^{⋆} \geq 0.

Definition. We say that strong duality holds if

p^{⋆} = d^{⋆} .

Geometric Intuition

We now build some geometric intuition regarding duality.

Fix some optimization problem and define the set

G = {(f (x), h (x), f_{0} (x)) \in R^{m} \times R^{p} \times R ∣ x \in D} .

$G$ essentially expresses all value combinations of the constraints and objective. We now interpret many prior results, along with some new results, using the geometry of $G$ .

Optimal Value

It is easy to see that

p^{⋆} = in f {t : (u, v, t) \in G, u \leq 0, v = 0},

i.e. we restrict the set we consider to only feasible points.

Lagrange Dual Function

We can also see that

L (x, λ, ν) = f_{0} (x) + λ^{T} f (x) + ν^{T} h (x) = (λ, ν, 1)^{T} (f (x), h (x), f_{0} (x))

and

g (λ, ν) = x \in D in f L (x, λ, ν) = x \in D in f (λ, ν, 1)^{T} (f (x), h (x), f_{0} (x)) = in f {(λ, ν, 1)^{T} (u, v, t) ∣ (u, v, t) \in G} .

Thus, for any $(u, v, t) \in G$ , we have that

(λ, ν, 1)^{T} (u, v, t) \geq g (λ, ν) .

If the infimum in $g$ is attained, we can think of $(λ, ν, 1), g (λ, ν)$ as a supporting hyperplane to $G$ .

Weak Duality

Say that $λ \geq 0$ . Then,

p^{⋆} = in f {t : (u, v, t) \in G, u \leq 0, v = 0} \geq in f {(λ, ν, 1)^{T} (u, v, t) : (u, v, t) \in G, u \leq 0, v = 0} \geq in f {(λ, ν, 1)^{T} (u, v, t) : (u, v, t) \in G} \geq g (λ, ν) .

Thus, for any $λ, ν$ where $λ \geq 0$ , we have that $p^{⋆} \geq g (λ, ν)$ . As $d^{⋆}$ is the maximum value of $g (λ, ν)$ over all $λ, ν$ where $λ \geq 0$ , we thus have that

p^{⋆} \geq d^{⋆},

proving weak duality.

Epigraph Variation

Define

A = {(u, v, t) ∣\exists x \in D, f_{i} (x) \leq u_{i}, h_{i} (x) = v_{i}, f_{0} (x) \leq t} .

$A$ can be thought of as sort of an epigraph of $G$ , with the exception that we enforce equality on the equality constraints $h$ .

Once again, the optimal value can be expressed as

p^{⋆} = in f {t : (0, 0, t) \in A} .

For $λ \geq 0$ , note that

g (λ, ν) = in f {(λ, ν, 1)^{T} (u, v, t) ∣ (u, v, t) \in G} = in f {(λ, ν, 1)^{T} (u, v, t) ∣ (u, v, t) \in A},

as $G$ is a subset of $A$ , but points in $A$ do not decrease the value of $(λ, ν, 1)^{T} (u, v, t)$ .

Once again, we may say that if the infimum is attained, then $(λ, ν, 1), g (λ, ν)$ is a supporting hyperplane to $A$ since for all $x \in A$ , we have

(λ, ν, 1)^{T} x \geq g (λ, ν) .

Note that $(0, 0, p^{⋆})$ is in the boundary of $A$ , hence

p^{⋆} = (λ, ν, 1)^{T} (0, 0, p^{⋆}) \geq g (λ, ν)

once again gives us weak duality.

Slater’s Condition

Proposition (Slater’s Condition). If there exists an “interior” to the inequality constraints of a convex optimization problem, i.e.

\exists x f_{i} (x) < 0 \forall i = 1, \dots, m

and $x$ is feasible, then strong duality holds.

Proof. We will use the geometric interpretation of duality to prove Slater’s condition.

First note that strong duality holds if and only if

p^{⋆} = (λ, ν, 1)^{T} (0, 0, p^{⋆}) = g (λ, ν)

for some $λ, ν$ . In other words, there exists a supporting hyperplane $(λ, ν, 1), g (λ, ν)$ to $A$ (defined above) with $λ \geq 0$ tangent to $(0, 0, p^{⋆})$ .

The idea behind the proof is that we will separate $A$ from the set

B = {(0, 0, s) \in R^{m} \times R^{p} \times R ∣ s < p^{⋆}}

with a hyperplane that proves strong duality.

In doing so, we will make the following simplifying assumptions: $int (D) \neq = \emptyset$ , $rank (A) = p$ , and $p^{⋆} > - \infty$ (otherwise $d^{⋆} = - \infty = p^{⋆}$ by weak duality).

Note that as we are only considering a convex optimization problem, $A$ is convex as it is the Cartesian product of convex sets.

Furthermore, see that

A \cap B = \emptyset

since $p^{⋆}$ is optimal.

As $A, B$ are convex and disjoint, we can separate them. More precisely, there exists $(λ, ν, μ) \neq = 0, α$ such that

(λ, ν, μ)^{T} (u, v, t) \geq α (u, v, t) \in A

and

(λ, ν, μ)^{T} (0, 0, s) \leq α (0, 0, s) \in B .

From the first inequality, note that $λ ⪰ 0, u \geq 0$ , otherwise we could scale the left-hand side to negative infinity.

We can rewrite the last inequality as

μ s \leq α s < p^{⋆},

meaning that

μ p^{⋆} \leq α .

Thus,

i = 1 \sum m λ_{i} u_{i} + i = 1 \sum p ν_{i} v_{i} + μ t = λ^{T} u + ν^{T} v + μ^{T} t = (λ, ν, μ)^{T} (u, v, t) \geq α \geq μ p^{⋆} .

For now, assume that $μ > 0$ . We will address the $μ = 0$ case later.

Dividing both sides by $μ$ , we have that

L (x, λ / μ, ν / μ) \geq p^{⋆}

for all $x$ (recall that $(u, v, t)$ was an arbitrary element of $A$ ).

We can then minimize $x$ over the left-hand side to recover

g (λ / μ, ν / μ) \geq p^{⋆} .

Weak duality, however, grants us

g (λ / μ, ν / μ) = p^{⋆} .

Hence, when $μ > 0$ , we have strong duality.

We now consider the $μ = 0$ case. Then,

i = 1 \sum m λ_{i} u_{i} + i = 1 \sum p ν_{i} v_{i} = i = 1 \sum m λ_{i} u_{i} + i = 1 \sum p ν_{i} v_{i} + μ t = (λ, ν, μ)^{T} (u, v, t) \geq α \geq μ p^{⋆} = 0

As $(u, v, t)$ is an arbitrary element of $A$ , we have that for all $x \in D$ ,

i = 1 \sum m λ_{i} f_{i} (x) + ν^{T} (A x - b) \geq 0.

Then let $x$ be a Slater point. Plugging this $x$ into the above inequality, we have that

i = 1 \sum m λ_{i} f_{i} (x) \geq 0,

but $f_{i} (x) < 0$ for all $i$ . Hence, $λ_{i} \leq 0$ . But, $λ ⪰ 0$ , thus $λ = 0$ .

Returning to the original inequality, we have that for all $x \in D$ ,

ν^{T} (A x - b) \geq 0.

Note that $ν \neq = 0$ as $(λ, ν, μ) \neq = 0$ but $λ, μ = 0$ .

Let $x$ once again be the Slater point. Then, we have that $ν \neq = 0$ but

ν^{T} (A x - b) = 0.

As $x$ is in the interior, there must exist some other point $y \in D$ such that

ν^{T} (A y - b) < 0

unless $ν^{T} A = 0$ . But, we stated that $rank (A) = p$ , hence we have a contradiction. Thus, $μ \neq = 0$ and strong duality holds by the other case.

Optimality Conditions

Certificate of Suboptimality

The dual function provides us with a method to “certify” the suboptimality of a solution. In particular, say that we are given a solution $x$ to some optimization problem and wish to provide a guarantee on how suboptimal it is. We can use a dual solution $(λ, ν)$ as our certificate. By weak duality, we have that

g (λ, ν) \leq p^{⋆} \leq f_{0} (x) .

Hence,

f_{0} (x) - p^{⋆} \leq f_{0} (x) - g (λ, ν) .

Thus, our certificate tells us that the suboptimality is at most

f_{0} (x) - g (λ, ν),

which is typically called the duality gap.

Complementary Slackness

Proposition (Complementary Slackness). Consider some optimization problem in which strong duality holds. Let $x^{⋆}$ be primal optimal and $(λ^{⋆}, ν^{⋆})$ be dual optimal. Then for all $i = 1, \dots, m$ ,

λ_{i}^{⋆} f_{i} (x^{⋆}) = 0.

Proof. Observe that

f_{0} (x^{⋆}) = g (λ^{⋆}, ν^{⋆}) = x \in D in f {f_{0} (x) + λ^{⋆ T} f (x) + ν^{⋆ T} h (x)} \leq f_{0} (x^{⋆}) + λ^{⋆ T} f (x^{⋆}) + ν^{⋆ T} h (x^{⋆}) = f_{0} (x^{⋆}) + λ^{⋆ T} f (x^{⋆}) \leq f_{0} (x^{⋆}),

meaning that

f_{0} (x^{⋆}) + λ^{⋆ T} f (x^{⋆}) = f_{0} (x^{⋆}),

i.e.

i = 1 \sum m λ_{i}^{⋆} f_{i} (x^{⋆}) = 0.

As $λ^{⋆} ⪰ 0$ and $f_{i} (x^{⋆}) \leq 0$ , we immediately recover complementary slackness.

One can also see from the equalities that $x^{⋆}$ is the minimizer to $L (x, λ^{⋆}, ν^{⋆})$ .

KKT Conditions

Definition. Consider an optimization problem in which $f_{0}, f_{1}, \dots, f_{m}, h_{1}, h_{2}, \dots, h_{p}$ are differentiable and strong duality holds. The KKT conditions for primal solution $x$ and dual solutions $(λ, ν)$ refer to the following conditions:

$x$ is primal feasible.
$(λ, ν)$ is dual feasible.
$λ_{i} f_{i} (x) = 0$ for all $i = 1, \dots, m$ .
$\nabla_{x} L (x, λ, ν) = 0$ .

Proposition. Consider an optimization problem with the above conditions. Furthermore, let $x^{⋆}$ be an optimal primal solution and $(λ^{⋆}, ν^{⋆})$ be an optimal dual solution. Then, these optimal points satisfy the KKT conditions.

Proof. It is clear that $x^{⋆}$ is primal feasible and $(λ^{⋆}, ν^{⋆})$ is dual feasible. Complementary slackness holds from earlier. Furthermore, see that

x \in D in f L (x, λ^{⋆}, ν^{⋆}) = L (x^{⋆}, λ^{⋆}, ν^{⋆})

meaning that

\nabla_{x} L (x^{⋆}, λ^{⋆}, ν^{⋆}) = 0,

as desired.

Proposition. Consider a convex optimization problem with differentiable functions. Furthermore, let $x, (λ, ν)$ be points that satisfy the KKT conditions. Then, $x$ is primal optimal, $(λ, ν)$ is dual optimal, and strong duality holds.

Proof. Clearly, $x$ is primal feasible and $(λ, ν)$ is dual feasible by the KKT conditions. We now show optimality.

As we are considering a convex optimization problem, see that $L (x, λ, ν)$ is convex in $x$ . Hence, if the gradient vanishes for any $x$ , that $x$ must be a global minimizer. Implying that

g (λ, ν) = L (x, λ, ν) = f_{0} (x) + i = 1 \sum m λ_{i} f_{i} (x) + i = 1 \sum p ν_{i} h_{i} (x) = f_{0} (x)

by invoking complementary slackness.

As the point has the dual equal to the primal, there is zero optimality gap, implying that $x$ is primal optimal and $(λ, ν)$ is dual feasible. Furthermore, strong duality holds.

Unconstrained Minimization

We now discuss methods to solve unconstrained minimization problems, i.e. problems of the form

x \in R^{n} min f (x)

where $f$ is convex and twice differentiable.

As we are in the unconstrained setting, a point $x^{⋆}$ is optimal if and only $\nabla f (x^{⋆}) = 0$ . There are typically no analytical solutions. Hence, the general idea of these algorithms is to iteratively solve for such a point, i.e. find a sequence ${x^{(k)}}_{k = 1}^{n}$ such that $\nabla f (x^{(k)}) \to 0$ and consequently $f (x^{(k)}) \to p^{⋆}$ .

Strong Convexity

Definition. We say that a function $f$ is strongly convex on $S$ if there exists $m > 0$ such that

\nabla^{2} f (x) ⪰ m I,

which means that $\nabla^{2} f (x) - m I ⪰ 0$ . We also say that $f$ is $m$ strongly convex.

Proposition. Let $f$ be an $m$ strongly convex function on $S$ . Then, for all $x, y \in S$ ,

f (y) \geq f (x) + \nabla f (x)^{T} (y - x) + \frac{m}{2} ∣∣ y - x ∣ ∣_{2}^{2},

i.e. we have stronger guarantees on the first-order characterization of convexity.

Proof. By Taylor’s and mean value theorem, there exists some $z$ on the line segment $[x, y]$ such that

f (y) = f (x) + \nabla f (x)^{T} (y - x) + \frac{1}{2} (y - x)^{T} \nabla^{2} f (z) (y - x) .

By $m$ strong convexity,

f (y) \geq f (x) + \nabla f (x)^{T} (y - x) + \frac{m}{2} (y - x)^{T} (y - x) = f (x) + \nabla f (x)^{T} (y - x) + \frac{m}{2} ∣∣ y - x ∣ ∣_{2}^{2},

as desired.

Proposition. Let $f$ be an $m$ strongly convex function on $S$ and $p^{⋆}$ the minimum value of $f$ . Then for any $x \in S$ ,

p^{⋆} \geq f (x) - \frac{m}{2} ∣∣\nabla f (x) ∣ ∣_{2}^{2} .

In words, one can bound the suboptimality of a point using its gradient.

Proof. We know that for all $y \in S$ ,

f (y) \geq f (x) + \nabla f (x)^{T} (y - x) + \frac{m}{2} ∣∣ y - x ∣ ∣_{2}^{2} .

We will find $\tilde{y}$ that minimizes the right-hand side, and we have that it serves as a lower-bound for any $y$ on the left-hand side.

The right-hand side is a convex function in $y$ , hence we take the gradient and solve for $0$ :

\nabla f (x) + m (\tilde{y} - x) = 0 ⟹ \tilde{y} = x - \frac{1}{m} \nabla f (x) .

Plugging this in,

f (y) \geq f (x) - \frac{1}{m} ∣∣\nabla f (x) ∣ ∣_{2}^{2} + \frac{1}{2 m} ∣∣\nabla f (x) ∣ ∣_{2}^{2} = f (x) - \frac{1}{2 m} ∣∣\nabla f (x) ∣ ∣_{2}^{2} .

This holds for any $y \in S$ , hence we set $y = x^{⋆}$ and see that

p^{⋆} \geq f (x) - \frac{1}{2 m} ∣∣\nabla f (x) ∣ ∣_{2}^{2}

as desired.

Corollary. Let $f$ be an $m$ strongly convex function on $S$ and $p^{⋆}$ the minimum value of $f$ . Then for any $x \in S$ , if

∣∣ \nabla f (x)∣ ∣_{2}^{2} \leq (2 m ϵ)^{1/2},

we have that

f (x) - p^{⋆} \leq ϵ .

We can also bound a point’s distance from the minimizer using the gradient.

Proposition. Let $f$ be an $m$ strongly convex function on $S$ and $x^{⋆}$ the minimizer of $f$ . Then for any $x \in S$ ,

∣∣ x^{⋆} - x ∣∣ \leq \frac{2}{m} ∣∣\nabla f (x) ∣ ∣_{2} .

Proof. By the first-order characterization of $m$ strong convexity,

p^{⋆} \geq f (x) + \nabla f (x)^{T} (x^{⋆} - x) + \frac{m}{2} ∣∣ x^{⋆} - x ∣ ∣_{2}^{2} \geq f (x) - ∣∣\nabla f (x) ∣ ∣_{2} ∣∣ x^{⋆} - x ∣ ∣_{2} + \frac{m}{2} ∣∣ x^{⋆} - x ∣ ∣_{2}^{2}

via Cauchy Schwarz.

As $p^{⋆} \leq f (x)$ ,

0 \geq - ∣∣\nabla f (x) ∣ ∣_{2} ∣∣ x^{⋆} - x ∣ ∣_{2} + \frac{m}{2} ∣∣ x^{⋆} - x ∣ ∣_{2}^{2}

hence

∣∣ \nabla f (x)∣ ∣_{2} ∣∣ x^{⋆} - x ∣∣ \geq \frac{m}{2} ∣∣ x^{⋆} - x ∣ ∣_{2}^{2}

meaning that

\frac{2}{m} ∣∣\nabla f (x) ∣ ∣_{2} \geq ∣∣ x^{⋆} - x ∣ ∣_{2}

as desired.

Smoothness

Strong convexity imposes a lower bound on the Hessian of a function. We can similarly impose an upper bound.

Definition. We say that a function $f$ is $M$ smooth on $S$ if

\nabla^{2} f (x) \leq M I

for all $x \in S$ .

Proposition. Let $f$ be an $M$ smooth function. Then for all $x, y \in S$ ,

f (y) \leq f (x) + \nabla f (x)^{T} (y - x) + \frac{M}{2} ∣∣ y - x ∣ ∣_{2}^{2} .

Proof. Once again by Taylor’s and mean value theorem, we have that

f (y) = f (x) + \nabla f (x)^{T} (y - x) + \frac{1}{2} (y - x)^{T} \nabla^{2} f (z) (y - x)

for some $z$ on the line segment $[x, y]$ .

By $M$ smoothness, we then have that

f (y) \leq f (x) + \nabla f (x)^{T} (y - x) + \frac{M}{2} ∣∣ y - x ∣ ∣_{2}^{2} .

Proposition. Let $f$ be an $M$ smooth function with optimal value $p^{⋆}$ . Then for any $x \in S$ ,

p^{⋆} \leq f (x) - \frac{1}{2 M} ∣∣\nabla f (x) ∣ ∣_{2}^{2} .

Proof. We will employ a similar strategy as in the convexity case, with some changes. We know that for all $y \in S$ ,

f (y) \leq f (x) + \nabla f (x)^{T} (y - x) + \frac{M}{2} ∣∣ y - x ∣ ∣_{2}^{2} .

We first find $\tilde{y}$ which minimizes the right-hand side. From before, we found that

\tilde{y} = x - \frac{1}{m} \nabla f (x) .

Plugging this in,

p^{⋆} \leq f (\tilde{y}) \leq f (x) - \frac{1}{2 M} ∣∣\nabla f (x) ∣ ∣_{2}^{2} .

Conditioning

Definition (Condition Number). Consider an unconstrained optimization problem with an objective that is $m$ strongly convex and $M$ smooth. We call $K = M / m$ the condition number of the problem.

Definition (Width). We define the width of a set $C$ in direction $q$ with unit-norm as

W (C, q) = z \in C sup q^{T} z - z \in C in f q^{T} z .

Definition. We define the maximum width of a set $C$ as

W_{ma x} = q, ∣∣ q ∣ ∣_{2} = 1 sup W (C, q) .

We define the minimum width of a set $C$ as

W_{min} = q, ∣∣ q ∣ ∣_{2} = 1 in f W (C, q) .

Definition. The condition number of a set $C$ is

cond (C) = \frac{W _{ma x}^{2}}{W _{min}^{2}} .

Definition. The $α$ sublevel set of $f$ is the set

C_{α} = {x : f (x) \leq α} .

Proposition. Consider a function that is $m$ strongly convex and $M$ smooth. Then, for any $α$ ,

cond (C_{α}) \leq K = M / m .

Proof. Observe that by the first-order characterizations,

p^{⋆} + \frac{m}{2} ∣∣ y - x^{⋆} ∣ ∣_{2}^{2} \leq f (y) \leq p^{⋆} + \frac{M}{2} ∣∣ y - x^{⋆} ∣ ∣_{2}^{2} .

Hence, defining

B_{inn er} = {y ∣∣∣ y - x^{⋆} ∣ ∣_{2} \leq (2 (α - p^{⋆}) / M)^{1/2}}

and

B_{o u t er} = {y ∣∣∣ y - x^{⋆} ∣ ∣_{2} \leq (2 (α - p^{⋆}) / m)^{1/2}}

we have that

B_{inn er} \subseteq C_{α} \subseteq B_{o u t er} .

Dividing the squared radii of the balls, we have an upper-bound on the condition number of $C_{α}$ :

cond (C_{α}) \leq \frac{M}{m}

as desired.

Descent Methods

Descent methods are algorithms that solve unconstrained minimization problems by iteratively computing a direction to perturb the current solution, and the scale of said direction. Formally, on iteration $k + 1$ , they update the current solution via

x^{(k + 1)} = x^{(k)} + t^{(k)} Δ x^{(k)}

where $t^{(k)} > 0$ . $t^{(k)}$ and $Δ x^{(k)}$ are chosen such that $f (x^{(k + 1)}) < f (x^{(k)})$ , i.e. we gradually aproach the optimal solution.

Note that by the first-order characterization of convexity, we require that

\nabla f (x^{(k)})^{T} Δ x^{(k)} < 0,

otherwise there is no hope of finding a more optimal solution.

How the direction $Δ x^{(k)}$ is computed depends on the specific algorithm. Several methods exist to compute $t^{(k)}$ , some of which are covered below.

Exact Line Search

Simply set $t^{(k)}$ such that the objective is minimized:

t^{(k)} = ar g s > 0 min f (x^{(k)} + s Δ x^{(k)}) .

While it is true that we must solve another optimization problem, this problem is one-dimensional and in practice is very easy to solve.

Backtracking

A simplified backtracking algorithm to compute $t$ would be to initially set $t^{(k)}$ to 1, then while $f (x^{(k)} + t^{(k)} Δ x^{(k)}) > f (x^{(k)})$ , halve $t^{(k)}$ .

Gradient Descent

Gradient descent provides one natural option to choose the descent direction: the negative of the gradient. Until the stopping criterion is satisfied (e.g. the current gradient is small enough), we update the current solution via

x^{(k + 1)} = x^{(k)} - t^{(k)} \nabla f (x^{(k)})

where $t^{(k)}$ is computed by either exact line search or backtracking.

Proposition. Consider some unconstrained minimization problem over $f$ , where $f$ is $m$ strongly convex and $M$ smooth. If we were to perform gradient descent with exact line search beginning at $x^{(0)}$ , we would reach an $ϵ$ optimal solution at step $N$ where

N \leq \frac{lo g (( f ( x ^{0} ) - p ^{⋆} ) / ϵ )}{lo g ( 1/ c )},

note that $c = 1 - m / M$ .

Proof. By $M$ smoothness and the first-order characterization, we have that

f (x - t \nabla f (x)) \leq f (x) - t ∣∣\nabla f (x) ∣ ∣_{2}^{2} + \frac{M t ^{2}}{2} ∣∣\nabla f (x) ∣ ∣_{2}^{2} .

As we use exact line search, we can improve our bound by finding $t$ that minimizes the right-hand side, which is convex. We take the gradient with respect to $t$ and set it to 0:

0 = - ∣∣\nabla f (x) ∣ ∣_{2}^{2} + Mt ∣∣\nabla f (x) ∣ ∣_{2}^{2} ⟹ t = \frac{1}{M} .

Substituting this in,

f (x - t \nabla f (x)) \leq f (x) - \frac{∣∣\nabla f ( x ) ∣ ∣ _{2}^{2}}{M} + \frac{∣∣\nabla f ( x ) ∣ ∣ _{2}^{2}}{2 M} \leq f (x) - \frac{∣∣\nabla f ( x ) ∣ ∣ _{2}^{2}}{2 M} .

Hence,

f (x - t \nabla f (x)) - p^{⋆} \leq (f (x) - p^{⋆}) - \frac{∣∣\nabla f ( x ) ∣ ∣ _{2}^{2}}{2 M},

i.e. we always improve our optimality gap by $\frac{1}{2 M} ∣∣\nabla f (x) ∣ ∣_{2}^{2} .$

Recall that with $m$ strong convexity, the gradient provides us with a bound on our suboptimality:

f (x) - p^{⋆} \leq \frac{1}{2 m} ∣∣\nabla f (x) ∣ ∣_{2}^{2} .

Hence, we can restate our bound as

f (x - t \nabla f (x)) - p^{⋆} \leq (f (x) - p^{⋆}) - \frac{m}{M} (f (x) - p^{⋆}) = c (f (x) - p^{⋆}) .

By induction,

f (x^{(k)}) - p^{⋆} \leq c^{k} (f (x^{(0)}) - p^{⋆}) .

To have that $f (x^{(k)} - p^{⋆}) \leq ϵ$ , it is sufficient to have

c^{k} \leq \frac{ϵ}{f ( x ^{(0)} ) - p ^{⋆}}

meaning that it is sufficient for

k \leq \frac{lo g ( \frac{ϵ}{f ( x ^{(0)} ) - p ^{⋆}} )}{lo g ( c )} = \frac{lo g ( ( f ( x ^{(0)} ) - p ^{⋆} ) / ϵ )}{lo g ( 1/ c )}

as desired.

Krish Matta

Explorer

Table of Contents

21-690: Methods of Optimization

Introduction

Convex Sets

Affine and Convex Sets

Examples

Cones

Hyperplanes and Halfspaces

Balls and Ellipsoids

Polyhedra

Operations that Preserve Convexity

Intersection

Affine Functions

Perspective Function

Separating Hyperplanes

Supporting Hyperplanes

Convex Functions

Basic Properties and Definitions

Operations that Preserve Convexity

Nonnegative Weighted Sum

Affine Composition

Maximum and Supremum

Representation as Supremum of Affine Functions

Perspective of a Function

Convex Conjugates

Basic Properties

Convex Optimization Problems

Optimization Problems

Slack Variables

Convex Problems

Linear Problems

Quadratic Problems

Duality

Lagrange Dual Function

Lagrange Dual Problem

Geometric Intuition

Optimal Value

Lagrange Dual Function

Weak Duality

Epigraph Variation

Slater’s Condition

Optimality Conditions

Certificate of Suboptimality

Complementary Slackness

KKT Conditions

Unconstrained Minimization

Strong Convexity

Smoothness

Conditioning

Descent Methods

Exact Line Search

Backtracking

Gradient Descent