Monday, November 30, 2020

Feature Engineering - Numerical Data Scaling

 








Numerical data can be scaled to ensure proportionate influence on the prediction


Common techniques for scaling

So how do we do it, exactly? How can we align different features into the same scale?

Keep in mind that not all ML algorithms will be sensitive to different scales of inputted features. Here is a collection of commonly used scaling and normalizing transformations that we usually use for data science and ML projects:

  • Mean/variance standardization
  • MinMax scaling
  • Maxabs scaling
  • Robust scaling
  • Normalizer
==================================================================

So how do you encode nominal variables? The one-hot encoding method is a good choice. Here’s how it works.



In this example, you have the one column called home Type, and three different levels: House, Apartment, and Condo. The data frame has five observations for that particular feature.  

With one-hot encoding, you convert this one column of home Type into three columns: a column for House, a column for Apartment, and a column for Condo. You encode each observation with either a 1 or 0: 1 to indicate the home type of that particular observation, or 0 for the other options.


============================================================


Topics related to this subdomain

Here are some topics you may want to study for more in-depth information related to this subdomain:



  • Scaling

  • Normalizing

  • Dimensionality reduction

  • Date formatting

  • One-hot encoding

Monday, November 16, 2020

UNIVARIATE Analysis

 VISUALIZING UNIVARIATE CONTINUOUS DATA 


Uni-variate plots are of two types: 

1)Enumerative plots and 

2)Summary plots


Univariate enumerative Plots :

These plots enumerate/show every observation in data and provide information about the distribution of the observations on a single data variable. We now look at different enumerative plots.


examples: 

1. UNIVARIATE SCATTER PLOT  

2. LINE PLOT (with markers)

3. STRIP PLOT

4. SWARM PLOT 


Uni-variate summary plots :

These plots give a more concise description of the location, dispersion, and distribution of a variable than an enumerative plot. It is not feasible to retrieve every individual data value in a summary plot, but it helps in efficiently representing the whole data from which better conclusions can be made on the entire data set.



  5. HISTOGRAMS

  6. DENSITY PLOTS

  7. RUG PLOTS 

  8. BOX PLOTS 

                  9. distplot() :

                  10. VIOLIN PLOTS



VISUALIZING CATEGORICAL VARIABLES :

11. BAR CHART :

12. PIE CHART :



https://www.analyticsvidhya.com/blog/2020/07/univariate-analysis-visualization-with-illustrations-in-python/




Monday, November 9, 2020

HyperParameters for each ML algorithm

In machine learning, a hyperparameter (sometimes called a tuning or training parameter) is defined as any parameter whose value is set/chosen at the onset of the learning process. Whereas other parameter values are computed during training.


K-Nearest Neighbors           : K , Leaf_size , Weights and Metric 

Decision Trees and Random Forests : N_estimators, Max_depth  , Min_samples_split  ,                     Min_samples_leaf and Criterion 

AdaBoost and Gradient Boost : N_estimators, Learning_rate  and Base_estimator (AdaBoost)                                                                     /  Loss (Gradient Boosting) 

Support Vector Machines         : C, Kernel, and Gamma.


Specifically, I will focus on the hyperparameters that tend to have the greatest effect on the bias-variance tradeoff

However, it is very, very important to keep in mind the bias-variance tradeoff, as well as the tradeoff between computational costs and scoring metrics. Ideally, we want a model with low bias and low variance to limit overall error

https://medium.com/swlh/the-hyperparameter-cheat-sheet-770f1fed32ff

The Hyperparameter Cheat Sheet  A quick guide to hyperparameter tuning utilizing Scikit Learn’s GridSearchCV, and the bias/variance trade-off

J.P. Rinfret

---------------------------------------------------------------------------------------------------------------------

=====================================================================


https://towardsdatascience.com/model-parameters-and-hyperparameters-in-machine-learning-what-is-the-difference-702d30970f6


Examples of hyperparameters used in the scikit-learn package

1.Perceptron Classifier

Perceptron(n_iter=40, eta0=0.1, random_state=0)


2. Train, Test Split Estimator

train_test_split( X, y, test_size=0.4, random_state=0)


3. Logistic Regression Classifier

LogisticRegression(C=1000.0, random_state=0)


4. KNN (k-Nearest Neighbors) Classifier

KNeighborsClassifier(n_neighbors=5, p=2, metric='minkowski')


5. Support Vector Machine Classifier

SVC(kernel='linear', C=1.0, random_state=0)


6. Decision Tree Classifier

DecisionTreeClassifier(criterion='entropy', 

                       max_depth=3, random_state=0)

7. Lasso Regression

Lasso(alpha = 0.1)

8. Principal Component Analysis

PCA(n_components = 4)