Thursday, May 5, 2022

Azure

 https://www.youtube.com/watch?v=Z0Xuvwi0838

Delivering services privately in your VNet with Azure Private Link BRK3168


You tube - Marc Kean  Need to subscribe


https://www.keepsecure.ca/blog/azure-storage-network-security/

https://www.keepsecure.ca/blog/snowflake-storage-integration-and-azure/












Sunday, March 20, 2022

Data Mesh

 

https://towardsdatascience.com/data-domains-and-data-products-64cc9d28283e

https://towardsdatascience.com/data-mesh-topologies-85f4cad14bf2


https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/scenarios/data-management/


Search for Piethein Strengholt to get Azure Data Mesh Patterns.

https://piethein.medium.com/

Snowflake Data Lake Patterns

 












Wednesday, March 2, 2022

Python LINKS

Pandas - Links

 https://github.com/tommyod/awesome-pandas


SQL alchemy

https://docs.sqlalchemy.org/en/14/core/engines.html


https://www.sentryone.com/blog/access-sql-server-databases-from-python


https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/data-levels-and-measurement/



Sunday, March 28, 2021

Visualization Books

 Visualization Analysis and Design,Tamara Munzner,A K Peters,Much of the course material is taken from this book

Making Data Visual,Danyel Fischer & Miriah Meyer,O'Reilly,Very good introductory text


Monday, January 18, 2021

DataFrames vs Datasets vs RDD

 https://www.youtube.com/watch?v=9yNmTucj6HU&list=PLfxl5dzojKr4K_NtVFKDnecP3mUp8K6Fj&index=5



















Tuesday, January 12, 2021

SPARK Websites

 https://www.composablesystems.org/17-400/fa2020/schedule/


https://www.composablesystems.org/17-400/fa2020/#course-information



https://heather.miller.am/index.html#teaching


https://heather.miller.am/teaching/cs4240/spring2018/


https://heather.miller.am/index.html#teaching

Friday, December 25, 2020

Deep Learning Hyperparameter Tuning example

 https://www.kaggle.com/jamesleslie/titanic-neural-network-for-beginners


titanic-neural-network-for-beginners  : 


Summary:


Create_model is the key concept in the whole algorithm.

def create_model(lyrs=[8], act='linear', pt='Adam', dr=0.0):


used GridsearchCV to find the best Hyperparameter Tuning.

Hyperparameters:  batch_size , epochs , optimizer , layers and drops 



Hyperparameter Tuning


Grid searchCV - batch size and epochs

batch_size = [16, 32, 64]

epochs = [50, 100]


Best: 0.822671 using {'batch_size': 32, 'epochs': 50}

===================================================

Grid searchCV - Optimization Algorithm

optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Nadam']


Best: 0.822671 using {'opt': 'Adam'}


===================================================


Grid searchCV - Hidden neurons

       layers = [[8],[10],[10,5],[12,6],[12,8,4]]


Best: 0.822671 using {'lyrs': [8]}


===================================================



Grid searchCV - Dropout

drops = [0.0, 0.01, 0.05, 0.1, 0.2, 0.5]


Best: 0.824916 using {'dr': 0.2}


===================================================


model = create_model(lyrs=[8], dr=0.2)


training = model.fit(X_train, y_train, epochs=50, batch_size=32, 

                     validation_split=0.2, verbose=0)




Still have few questions -- 


a. Initial train model given val_acc: 86.53% but where as train model at the end given acc: 83.16%

b. making batch size and epochs as constant values and then the remaining hyperparmaters found the best value

   rather than adding one hyperparameter and then another hyperparameter.



Monday, December 7, 2020

Evolution of XGBoost Algorithm from Decision Trees

Credit to 

 https://towardsdatascience.com/https-medium-com-vishalmorde-xgboost-algorithm-long-she-may-rein-edd9f99be63d




XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now.




Machine Learning Validation Techniques

Credit to 

 https://towardsdatascience.com/validating-your-machine-learning-model-25b4c8643fb7


The following methods for validation will be demonstrated:

  • k-Fold Cross-Validation
  • Leave-one-out Cross-Validation
  • Leave-one-group-out Cross-Validation
  • Nested Cross-Validation
  • Time-series Cross-Validation
  • Wilcoxon signed-rank test
  • McNemar’s test
  • 5x2CV paired t-test
  • 5x2CV combined F test

Monday, November 30, 2020

Feature Engineering - Numerical Data Scaling

 








Numerical data can be scaled to ensure proportionate influence on the prediction


Common techniques for scaling

So how do we do it, exactly? How can we align different features into the same scale?

Keep in mind that not all ML algorithms will be sensitive to different scales of inputted features. Here is a collection of commonly used scaling and normalizing transformations that we usually use for data science and ML projects:

  • Mean/variance standardization
  • MinMax scaling
  • Maxabs scaling
  • Robust scaling
  • Normalizer
==================================================================

So how do you encode nominal variables? The one-hot encoding method is a good choice. Here’s how it works.



In this example, you have the one column called home Type, and three different levels: House, Apartment, and Condo. The data frame has five observations for that particular feature.  

With one-hot encoding, you convert this one column of home Type into three columns: a column for House, a column for Apartment, and a column for Condo. You encode each observation with either a 1 or 0: 1 to indicate the home type of that particular observation, or 0 for the other options.


============================================================


Topics related to this subdomain

Here are some topics you may want to study for more in-depth information related to this subdomain:



  • Scaling

  • Normalizing

  • Dimensionality reduction

  • Date formatting

  • One-hot encoding

Monday, November 16, 2020

UNIVARIATE Analysis

 VISUALIZING UNIVARIATE CONTINUOUS DATA 


Uni-variate plots are of two types: 

1)Enumerative plots and 

2)Summary plots


Univariate enumerative Plots :

These plots enumerate/show every observation in data and provide information about the distribution of the observations on a single data variable. We now look at different enumerative plots.


examples: 

1. UNIVARIATE SCATTER PLOT  

2. LINE PLOT (with markers)

3. STRIP PLOT

4. SWARM PLOT 


Uni-variate summary plots :

These plots give a more concise description of the location, dispersion, and distribution of a variable than an enumerative plot. It is not feasible to retrieve every individual data value in a summary plot, but it helps in efficiently representing the whole data from which better conclusions can be made on the entire data set.



  5. HISTOGRAMS

  6. DENSITY PLOTS

  7. RUG PLOTS 

  8. BOX PLOTS 

                  9. distplot() :

                  10. VIOLIN PLOTS



VISUALIZING CATEGORICAL VARIABLES :

11. BAR CHART :

12. PIE CHART :



https://www.analyticsvidhya.com/blog/2020/07/univariate-analysis-visualization-with-illustrations-in-python/