2/19/2023 0 Comments Train caret![]() ![]() We would not want to falsely identify data that have low granularity but are evenly distributed, such as data from a discrete uniform distribution. If the frequency ratio is greater than a pre-specified threshold and the unique value percentage is less than a threshold, we might consider a predictor to be near zero-variance. the “percent of unique values’’ is the number of unique values divided by the total number of samples (times 100) that approaches zero as the granularity of the data increases.the frequency of the most prevalent value over the second most frequent value (called the “frequency ratio’’), which would be near one for well-behaved predictors and very large for highly-unbalanced data and.To identify these types of predictors, the following two metrics can be calculated: These “near-zero-variance” predictors may need to be identified and eliminated prior to modeling. The concern here that these predictors may become zero-variance predictors when the data are split into cross-validation/bootstrap sub-samples or that a few samples may have an undue influence on the model. For example, in the drug resistance data, the nR11 descriptor (number of 11-membered rings) data have a few unique numeric values that are highly unbalanced: data(mdrr)ĭata.frame( table(mdrrDescr $nR11)) # Var1 Freq Similarly, predictors might have only a handful of unique values that occur with very low frequencies. For many models (excluding tree-based models), this may cause the model to crash or the fit to be unstable. In some situations, the data generating mechanism can create predictors that only have a single unique value (i.e. a “zero-variance predictor”). Note there is no intercept and each factor has a dummy variable for each level, so this parameterization may not be useful for some model functions, such as lm.ģ.2 Zero- and Near Zero-Variance Predictors Head( predict(dummies, newdata = etitanic)) # pclass.1st pclass.2nd pclass.3rd sex.female sex.male age sibsp parch Using dummyVars: dummies <- dummyVars(survived ~. , data = etitanic)) # (Intercept) pclass2nd pclass3rd sexmale age sibsp parch The base R function model.matrix would generate the following variables: library(earth) The function takes a formula and a data set and outputs an object that can be used to create the dummy variables using the predict method.įor example, the etitanic data set in the earth package includes two factors: pclass (passenger class, with levels 1st, 2nd, 3rd) and sex (with levels female, male). The function dummyVars can be used to generate a complete (less than full rank parameterized) set of dummy variables from one or more factors. 22.2 Internal and External Performance Estimates.22 Feature Selection using Simulated Annealing. ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |