Skip to content

Latest commit

 

History

History
151 lines (124 loc) · 7.05 KB

README.md

File metadata and controls

151 lines (124 loc) · 7.05 KB

机器学习

面试需要一些准备和技巧,但功夫在诗外。平时注意构建知识体系,论文和实验不断给体系添砖加瓦。本章侧重理论部分,系统设计参考3.3 机器学习系统设计

1. 面试要求

  • 熟悉常见模型的原理、代码、如何实际应用、优缺点、常见问题

    • 归纳偏置(Inductive Bias),数据同分布(IID)
  • 考察范围包括ML breadth, ML depth, ML application, coding

    • 算法背后的数学原理,写出主要数学公式,并能进行白板推导介绍
    • 一些较新的领域如大模型,会考察论文细节
    • 可能被持续追问为什么? 某个trick为什么能起作用?
    • 每一个算法如何scale,如何将算法map-reduce化
    • 每一个算法的复杂度、参数量、计算量
  • 简历中介绍自己的机器学习项目

2. 八股问题实例

模型细节与八股见具体模型页面

  • Generative vs Discriminative

    • A generative model will learn categories of data while a discriminative model will simply learn the distinction between different categories of data.
    • Discriminative models will generally outperform generative models on classification tasks. Discriminative model learns the predictive distribution p(y|x) directly while generative model learns the joint distribution p(x, y) then obtains the predictive distribution based on Bayes' rule.
  • The bias-variance tradeoff

    • Bias Variance Decomposition: Error = Bias ** 2 + Variance + Irreducible Error
    • Ideally, one wants to choose a model that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously.
    • High-variance learning methods may be able to represent their training set well but are at risk of overfitting to noisy or unrepresentative training data.
    • In contrast, algorithms with high bias typically produce simpler models that don't tend to overfit but may underfit their training data, failing to capture important regularities.
  • 怎么解决over-fitting

    • track: underfitting means large training error, large generalization error; overfitting means small training error, large generalization error
    • 数据角度,收集更多训练数据(more data);求其次,数据增强(Data augmentation);或Pretrained model
    • 特征角度,Feature selection
    • 模型角度
      • 降低模型复杂度,如神经网络的层数、宽度,树模型的树深度、剪枝;
      • 模型正则化(Regularization),如正则约束L2,dropout
      • 集成学习方法,bagging
    • 训练角度,Early stop,weight decay
  • 怎么解决under-fitting

    • 特征角度,增加新特征
    • 模型角度,增加模型复杂度,减少正则化系数
    • 训练角度,训练模型第一步就是要保证能够过拟合,增加epoch
  • 怎么解决样本不平衡问题

    • https://imbalanced-learn.org/en/stable/user_guide.html
    • 评价指标:AP(average_precision_score)
    • downsampling: faster convergence, save disk space, calibration. 样本多少可继续引申到样本的难易
    • upweight: every sample contribute the loss equality
    • long tail classification,只取头部80%的label,其他label mark as others
    • 极端imbalance,99.99% 和0.01%,outlier detection的方法
  • 怎么解决数据缺失的问题

  • 怎么解决类别变量中的高基数特征 high-cardinality

    • Feature Hashing
    • Target Encoding
    • Clustering Encoding
    • Embedding Encoding
  • 如何选择优化器

    • MSE, loglikelihood+GD
    • SGD-training data太大量
    • ADAM-sparse input
  • 怎么解决Gradient Vanishing & Exploding

    • 梯度消失
      • 激活函数activations, 如ReLU
      • residual network
      • batch normalization
    • 梯度爆炸
      • gradient clipping
      • LSTM gate
  • 数据收集

    • production data, label
    • Internet dataset
  • 分布不一致怎么解决

    • distribution有feature和label的问题。label尽量多收集data,还是balance data的问题
    • data distribution 改变,就是做auto train, auto deploy. 如果性能drop太多,人工干预重新训练
    • 穿越特征也会造成分布不一致的表象,从避免穿越角度解决
  • 线上线下不一致

    • model behaviors in production: data/feature distribution drift, feature bug
    • model generalization: offline metrics alignment
  • curse of dimensionality

    • Feature Selection
    • PCA
    • embedding
  • 怎么提升模型的latency

    • 小模型
    • 知识蒸馏
    • squeeze model to 8bit or 4bit
  • 模型的并行

    • 线性/逻辑回归
    • xgboost
    • cnn
    • RNN
    • transformer
    • 在深度学习框架中,单个张量的乘法内部会自动并行

3. 手写ML代码实例

ML code challenge

  • 手写KNN
  • 手写K-means
  • 手写softmax的backpropagation
  • 手写AUC
  • 手写SGD
  • 手写两层fully connected网络
  • 手写CNN
    • convolution layer的output size怎么算? 写出公式
  • 实现dropout,前向和后向
  • 实现focal loss
  • 手写LSTM
    • 给一个LSTM network的结构,计算参数量
  • NLP:
  • 视觉:
    • 手写iou/nms

参考