# 1. 绪论

## 机器学习致力于研究通过计算的手段，利用经验来改进自己的性能。

Item1 Item2
data Different applications have different data.
model(function) Define the model according to specific problem.
loss(prediction) Using loss function to evaluation model

### 监督学习原理

#### 存在某个不为人知的规律，即y与X有关系，我们猜测是满足下面这种形式

$$y=f(X)=\omega X+b$$

#### 该公式代表了一系列的具体的公式，我们需要通过一些已知的y0和X0的关系来逐步修改公式来逼近真实情况

1. define a set of function
2. goodless of function
3. pick the best function

# 2. 模型评估与选择

### 经验误差与过拟合

error rate=错误率：分类错误的样本数占样本总数的比例
accuracy=精度：分类正确的样本数占样本总数的比例
error=误差：学习器的实际预测输出与样本的真实输出之间的差异
training error=训练误差/empirical error=经验误差：在训练集上的误差
generalization error=泛化误差：在新样本上的误差

### 评估方法

#### 留出法hold-out

k折交叉验证 k-fold cross validation

n个样本，分成n份的交叉验证。

### 调参与最终模型

parameter tuning=调参
validation set=验证集

### 查准率=精确率precision、查全率=召回率recall、F1

$$P = { TP \over TP + FN}.$$
$$R= { TP \over TP+FN}.$$

Break-Even Point平衡点：查准率=查全率 的取值。

### F1

$$F1 = { {2 \times P \times R} \over {P + R} } = { {2 \times TP} \over { 样例总数 + TP -TN } }$$

#### 当对于 P 和 R 有侧重时

$$F_\beta=\frac{(1 + \beta^2) \times P \times R}{(\beta^2 \times P) + R}$$

• 当 $\beta = 1$，退化为F1
• 当 $\beta > 1$，查全率比较重要
• 当 $\beta < 1$，查准率比较重要

### ROC 与 AUC

#### ROC = Receiver Operating Characteristic = 受试者工作特征

$$TPR = {TP \over TP + FN}$$

$$FPR = {FP \over TN + FP}$$

ROC底下面积，越大越好

## 线性模型

#### 基本形式

$$f(x)=\omega_1x_1+\omega_2x_2+…+\omega_dx_d+b$$

$$f(x)=\omega^Tx+b$$

# 复习提纲

## BasicConcept 都考

1. Basic Concepts about Machine Learning
Q1. What is Machine Learning?
A1. Mchine Learning compose of three parts:
• Data
• Model(function)
• Loss(prediction)
Item1 Item2
data Different applications have different data.
model(function) Define the model according to specific problem.
loss(prediction) Using loss function to evaluation model

Supervised learning is the machine learning task of inferring a function from labeled training data

1. Linear Regression
Simple linear regression describes the linear relationship between a predictor variable, plotted on the x-axis, and a response variable, plotted on the y-axis
简单线性回归描述了预测变量之间的线性关系，绘制在x轴上，并在y轴上绘制一个响应变量。

2. Closed-form solution 封闭解

Learning rate η has a large impact on convergence
Too large η → oscillatory and may even diverge
Too small η → too slow to converge

Set larger learning rate at the begining
Use relatively smaller learning rate in the later epochs
Decrease the learning rate: η t+1 = η t / t+1

## Linear Classfication： Dual SVM及其推导不考

1. Linear Classification
f(xi)
≥ 0 yi = +1
< 0 yi = −1

i.e. yi f(xi) > 0 for a correct classification

1. Support Vector Machine
What’s a Good Decision Boundary
什么是好的决策边界？
Maximum margin solution: most stable under perturbations of the inputs
最大间隔解：在输入扰动下最稳定

Select two parallel hyperplanes that separate the two classes of data and let the distance between them as large as possible
The region bounded by these two hyperplanes is called the ”margin”

In general there is a trade off between the margin and the number of mistakes on the training data

True gradient descent is a batch algorithm, slow but sure
真正的梯度下降是一个批处理算法，缓慢但肯定
Information is redundant
Sufficient samples means we can afford more frequent, noisy updates
Never-ending stream means we should not wait for all data
Tracking non-stationary data means that the target is moving
信息是多余的
充足的样本意味着我们能负担得起更频繁、更嘈杂的更新。
永不结束流意味着我们不应该等待所有数据。
跟踪非平稳数据意味着目标在移动。

Better for large datasets and often faster convergence
Hard to reach high accuracy
Best classical methods can not handle stochastic approximation
Theoretical definitions for convergence not as well defined

#### SGD的好处

1. Gradient is easy to calculate (instantaneous)
2. Less prone to local minima
3. Small memory footprint
4. Get to a reasonable solution quickly
5. Works for non-stationary environments as well as online settings
6. Can be used for more complex models and error surfaces
梯度很容易计算（瞬时）
不易发生局部极小
小内存占用
迅速找到合理的解决办法
适用于非平稳环境以及在线设置。
可用于更复杂的模型和错误表面。

#### Is convergence necessary?

Non-stationary: convergence may not be required
Stationary: learning rate should decrease with time
Robbins-Monroe sequence is adequate ηt = 1/t

小批量梯度下降

## SGD Recommenddations

7. Randomly shuffle training examples
Although theory says you should randomly pick examples, it is easier to make a pass through your training set sequentially
Shuffling before each iteration eliminates the effect of order
随机洗牌训练示例
虽然理论说你应该随机挑选例子，但是更容易通过你的训练集顺序。
每次迭代之前进行洗牌消除了订单的影响。
8. Monitor both training cost and validation error
Set aside samples for a decent validation set
Compute the objective on the training set and validation set (expensive but better than overfitting or wasting computation)
监控培训成本和验证错误
为正确的验证集预留样本
在训练集和验证集的目的计算（贵，但比过拟合或浪费计算）
9. Check gradient using finite differences
If computation is slightly incorrect can yield erratic and slow algorithm
Verify your code by slightly perturbing the parameter and inspecting differences between the two gradients
利用有限差分检验梯度
如果计算稍不正确，则会产生不稳定和缓慢的算法。
通过轻微的扰动参数和检验两者之间的梯度差异，验证你的代码
10. Experiment with the learning rates using small sample of training set
SGD convergence rates are independent from sample size
Use traditional optimization algorithms as a reference point
用训练样本的小样本进行学习率的实验
SGD的收敛速度是独立样本
使用传统的优化算法作为参考点
11. Leverage sparsity of the training examples
For very high-dimensional vectors with few non zero coefficients, you only need to update the weight coefficients corresponding to nonzero pattern in x
利用训练样本的稀疏性
对于很少的非零系数的高维向量，只需更新x中对应于非零模式的权系数。
12. Use learning rates of the form ηt = η0/(1 + η0λt)
Allows you to start from reasonable learning rates determined by testing on a small sample
Works well in most situations if the initial point is slightly smaller than best value observed in training sample
使用表格ηT = 0 /η学习率（1 + 0ηλT）
允许你从一个小样本测试中确定的合理学习率开始。
如果初始点略小于训练样本中观察到的最佳值，那么在大多数情况下都能很好地工作。

## Cross-Validation都考

### Overﬁtting, Underﬁtting and Cross-Validation

#### Problem of Model Validation

Q1. Why We All Need Validation
Need to choose the best model.
Measure accuracy/power of selected model.
Good to measure ROI of the modeling project.
Statistical Reasons
Model building techniques are inherently designed to minimize “loss” or “bias”.
To an extent, a model will always fit “noise” as well as “signal”.
If you just fit a bunch of models on a given dataset and choose the “best” one, it will likely be overly “optimistic”.

#### Underfitting and Overfitting

Underfitting: high bias 偏差大
Model cannot capture the underlying trend of the data
Overfitting: high variance 方差大
Model is too complex to capture the true pattern
Model seeks to fit the noise or outlier of the data

Error on the dataset used to fit the model can be misleading
Doesn’t predict future performance.
Too much complexity can diminish model’s accuracy on future data.

Complex model:
Low “bias”:
The model fit is good on the training data.
The model value is close to the data’s expected value.
High “Variance”:
Model more likely to make a wrong prediction.

##### How do we know if we are underfitting or overfitting?

If by increasing capacity we decrease generalization error, then we are underfitting, otherwise we are overfitting.
If the error in representing the training set is relatively large and the generalization error is large, then underfitting;

Need to increase capacity (complexity of models).
– If the error in representing the training set is relatively small and the generalization error is large, then overfitting;
Need to decrease capacity or increase training set.
– Many features and relatively small training set.
– if you have chosen a large capacity to complement the many features, then you might overfit the data:
need to decrease the capacity.

–如果错误代表训练集比较小的泛化误差大，然后过；

-许多功能和相对较小的训练集。
–如果你选择了一个大容量为补充的多种特征，那么你可能过度拟合数据：

## Clustering： Hierarchical Agglomerative Clustering不考

1. Clustering
2. K-Means Clustering
3. Hierarchical Agglomerative Clustering 不考
4. Conclusion

## Recommendation system：只考问题定义：什么叫Collaborative Filtering，Recommendation有哪几种典型方法及各自优缺点，Cold-start Recommendation概念；不考方法，方法在课程项目中体现。

Recommender System applies statistical and knowledge discovery techniques to the problem of making product recommendations.

Improve conversion rate: Help customers find a product she/he wants to buy.

Improve customer loyalty: Create a value-added relationship. 提高用户忠诚度

Q1. 什么叫Collaborative Filtering
Make automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many other users (collaboration).

A1. Collaborative filtering， 即协同过滤。 使用某人的行为behavior来预测其它人会做什么。协同过滤就是基于邻域的算法，分为基于用户的协同过滤算法UserCF和基于物品的协同过滤算法ItemCF。

Q2. Recommendation有哪几种典型方法及各自优缺点
Memory-based CF :utilize the entire user-item database to generate a prediction.

• User-based CF : find similar users to predict ratings.
• Item-based CF : use similar items to predict ratings
Model-based CF : build a model from the rating data
( Matrix factorization, etc.) and use this model to predict missing ratings.
矩阵分解

User-based similarity is more dynamic.
Precompute user neighbourhood may lead to poor predictions.
Item-based similarity is static.
Precompute item neighbourhood may provide accurate results.

Memory-Based CF

Require minimal knowledge engineering efforts.
Users and products are symbols without any internal structure or characteristics.
Produce good-enough results in most cases.

Require a large number of explicit and reliable ratings.
Highly time consuming to compute similarity with millions of users & items.

Model-based CF
Models are learned from the underlying data rather than heuristics

Models of user ratings (or purchases):

Matrix Factorization is the most widely used algorithm.

Comparison between SGD and ALS
ALS is easier to parallelize than SGD.
ALS converges faster than the SGD.
SGD has less storage complexity than ALS.
(ALS needs to store the rating matrix R)
SGD has less computational complexity than ALS.
(ALS needs to compute the matrix-vector multiplication)
SGD和ALS的比较
ALS是比SGD容易并行化。
ALS的SGD更快的收敛速度比。
SGD比ALS较少的存储复杂度。
（ALS需要存储评级矩阵R）
SGD具有较小的计算复杂度比ALS。
（ALS需要计算矩阵向量乘法）

• 基于内容的推荐content-based
• 协同过滤collaborative filtering
• 隐语义模型(LFM, latent factor model)推荐

Q3. Cold-start Recommendation概念

1）用户冷启动：如何给新用户做个性化推荐
2）物品冷启动：如何将新物品推荐给可能对其感兴趣的用户。在新闻网站等时效性很强的网站中非常重要。
3）系统冷启动：如何在一个新开发的网站上设计个性化推荐，从而在网站刚发布时就让用户体验到个性化推荐服务。没有用户，只有一些物品信息。