machine learning - Why should my training set also be skewed in terms of number of class distribution just because my test set is skewed -
my question why should training set skewed (number of instances of positive class fewer compared negative class) when test set skewed. read important maintain distribution between classes same in both training , test set realistic performance. example, if test set has 90%-10% distribution of class instances, should training set have same proportions?
i finding difficult understand why important maintain proportions of class instances in training set present in test set.
the reason why find difficult understand don't want classifier learn patterns in both classes? so, should matter maintain skewness in training set because test set skewed?
any thoughts helpful
iiuc, you're asking rationale using stratified sampling (e.g., used in scikit's stratifiedkfold
.
once you've divided data train , test sets, have three datasets consider:
- the "real world" set, on classifier run
- the train set, on you'll learn patterns
- the test set, you'll use evaluate performance of classifier
(so uses of 2. + 3. estimating how things run on 1, including possibly tuning parameters.)
suppose data has class represented far uniform - appears 5% of times appear if classes generated uniformly. moreover, believe not gigo case - in real world, probability of class 5%.
when divide 2. + 3., run chance things skewed relative 1.:
it's possible class won't appear 5% of times (in train or test set), rather more or less.
it's possible of feature instances of class skewed in train or test set, relative 1.
in these cases, when make decisions based on 2. + 3. combination, it's probable won't indicate effect on 1., you're after.
incidentally, don't think emphasis on skewing train fit test, rather on making train , test each fit entire sampled data.
Comments
Post a Comment