Abstract

一言でいうと、線形回帰モデルを扱う際に、特徴量の尺度をそろえる方法として、 logを取ってあげるとうまくいくらしい。

http://www.kenbenoit.net/courses/ME104/logmodels2.pdf

stats.stackexchange.com

そうなる理由は↑を詳しくは参照

もとはといえば、kaggleの物件情報から物件価格を予測するコンペのディスカッションにてこの手法を発見

https://www.kaggle.com/apapiu/house-prices-advanced-regression-techniques/regularized-linear-models

 I log transformed certain features for which the skew was > 0.75. This will make the feature more normally distributed and this makes linear regression perform better - since linear regression is sensitive to outliers. Note that if I used a tree-based model I wouldn't need to transform the variables.

上記リンク先のコメントの抜粋だが、日本語で要約すると次のことをいってるみたい。

良くなる理由としては、logを取ることで、外れ値の影響を受けやすい、線形回帰モデルにおいて、特徴量が正規分布に近づくことで、パフォーマンスが上がるとのこと。

how to

具体的にどういう風に前処理をするかは、以下リンクより引用

https://www.kaggle.com/apapiu/house-prices-advanced-regression-techniques/regularized-linear-models

Data preprocessing:

We're not going to do anything fancy here:

First I'll transform the skewed numeric features by taking log(feature + 1) - this will make the features more normal
Create Dummy variables for the categorical features
Replace the numeric missing values (NaN's) with the mean of their respective columns

忘れないようにメモ。

Happy Coding

This blog is for my memorandum about programming and English.

Happy Coding

This blog is for my memorandum

Linear Regression Models with Logarithmic Transformations

Abstract

how to

Data preprocessing:

We're not going to do anything fancy here: