ML notes

satya - 12/25/2022, 12:38:41 PM

Category variables

1. When cat variables are inputs, translate them to integers from strings. Ex: Cities for predicting prices

2. Nominal cats: Cat variables can be nominal. They don't have an implicit numerical order

3. Ordinal: Second type is category has a numerical order ex: your degrees.

satya - 12/25/2022, 12:39:25 PM

One hot encoding

1. Create a new column for each cat

2. Make them binary encoded 0 or 1

3. Leave out the last one as the 0s in others will indicate that option

satya - 12/25/2022, 12:40:29 PM

pd.getdummies(col-name): api

1. These translation of cat variables to bin columns is called dummry variables

satya - 12/25/2022, 12:41:57 PM

API: pd.concat() to merge by columns


#Signature
pd.concat([df1,df2],axis="columns")

satya - 12/25/2022, 12:43:35 PM

api: pd.drop()

1. Drop the cat columns

2. Drop one of the extra dummy columns. Just one.

satya - 12/25/2022, 12:46:01 PM

Concept: Dummy variable trap, Multi-colinear

When a variable can be derived from a combination of other variables, then the variables are called multi-colinear variables. This is called the trap.

Rule of thumb is one of them

satya - 12/25/2022, 12:46:14 PM

Dummy variable trap, Multi-colinear

Dummy variable trap, Multi-colinear

Search for: Dummy variable trap, Multi-colinear

satya - 12/25/2022, 12:48:37 PM

api: model.score(X,y)

api: model.score(X,y)

Search for: api: model.score(X,y)

satya - 12/25/2022, 12:57:45 PM

API alternate to pd.get_dummies(), LabelEncoder


#Imports
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder


#Covert cats to booleans
le=LabelEncoder()
tcol = le.fit_transform(df.cat-column)

#Assign it to a column in df
#replace the orig cat column with bin column
df.original-cat-column-name = tcol

satya - 12/25/2022, 1:01:55 PM

Some of these notes are taken from

satya - 12/25/2022, 1:02:20 PM

How to use OneHotEncoder

How to use OneHotEncoder

Search for: How to use OneHotEncoder

satya - 12/25/2022, 1:04:51 PM

Quick summary of Cats

1. Convert them to integers

2. you can use pd.get_dummies that converts one column into multiple binary columns

3. Or use OHE from Scikit

satya - 12/25/2022, 4:08:38 PM

Classfication types


Binary
Multiclass

satya - 12/25/2022, 4:13:40 PM

API: plt.scatter()

API: plt.scatter()

Search for: API: plt.scatter()

satya - 12/25/2022, 4:23:58 PM

APIs


#Some basic apis
plt.scatter
df.shape

#***************************
# training and test split
#***************************
from skelearn.model_selection import train_test_split

x-train, x-test, y-train, y-test
= train_test_split(input-df, output-df, train_size=0.9)


#***************************
# LRegression
#***************************
from sklearn.linear_model import LogisticRegression

model = LR()
model.fit(x-train, y-train)
output = model.predict(x-test)

model.score(x-test,y-test)
model.predict_proba(x-test)

satya - 12/25/2022, 4:24:26 PM

Use Kaggle to download sample data sets for a number of your projects

Use Kaggle to download sample data sets for a number of your projects

satya - 12/25/2022, 4:28:44 PM

Sklearn.datasets

Sklearn.datasets

Search for: Sklearn.datasets

satya - 12/25/2022, 4:29:42 PM

The available datasets at Sklearn

The available datasets at Sklearn

satya - 12/25/2022, 4:34:44 PM

what does dir() method do in python?

what does dir() method do in python?

Search for: what does dir() method do in python?

satya - 12/25/2022, 4:35:11 PM

It describes the properties and methods of an object

It describes the properties and methods of an object

satya - 12/25/2022, 6:05:31 PM

codebasics ML github link

codebasics ML github link

satya - 12/25/2022, 6:13:43 PM

Scikit learn LogisticsRegressin fit method inputs and outputs

Scikit learn LogisticsRegressin fit method inputs and outputs

Search for: Scikit learn LogisticsRegressin fit method inputs and outputs

satya - 12/25/2022, 6:15:15 PM

LogisticRegression API

LogisticRegression API

satya - 12/27/2022, 11:10:16 AM

Summary of logistic regression

1. Used for binary classification

2. Multiclass classfication

satya - 12/27/2022, 11:10:44 AM

From ChatGPT

1. Logistic Regression is a supervised machine learning algorithm that is used for classification tasks. It is a type of regression analysis that is used to predict a binary outcome (i.e., a outcome that can only have two possible values, such as 0 or 1, yes or no, true or false).

2. The goal of logistic regression is to find the best fitting model to predict the probability of an event occurring. The model is developed by fitting a logistic function to the data. The logistic function is a sigmoid curve that maps any real-valued number to a value between 0 and 1. This curve can be interpreted as a probability, with values closer to 0 indicating a low probability and values closer to 1 indicating a high probability.

3. Logistic Regression is widely used in a variety of fields, including finance, economics, and psychology. It is often used to predict the likelihood of a certain event occurring (e.g., whether a customer will churn or not, whether a patient will have a certain disease or not) based on a set of independent variables. It is also commonly used in binary classification tasks, where the goal is to predict which of two classes an observation belongs to.

4. In summary, Logistic Regression is a statistical model used for predicting binary outcomes based on a set of independent variables. It is widely used in a variety of fields and is particularly useful for classification tasks.

satya - 12/27/2022, 11:13:43 AM

Bin classifcation

1. Given age find out if they buy insurance or not

2. X - age

2. Y - yes or no

satya - 12/27/2022, 11:18:03 AM

Multi-class

1. Given a set of 64 bits of dark and light tell what kind of a digit or shape it is

2. The way you call this is identical to bin classification

3. Except there are more inputs and outputs

satya - 12/27/2022, 11:45:35 AM

What is confusion matrix?

What is confusion matrix?

Search for: What is confusion matrix?

satya - 12/27/2022, 11:45:46 AM

What is seaborn library?

What is seaborn library?

Search for: What is seaborn library?

satya - 12/27/2022, 11:50:59 AM

Sample code for multiclass regression

Sample code for multiclass regression

satya - 12/27/2022, 12:19:42 PM

Functions used


# load data sets from scikitlearn
# load digits dataset
load_digits 

# plotting images
plt.matshow

#Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predicted)

satya - 12/27/2022, 12:22:51 PM

Confusion matrix


          Predicted
          spam  not spam
Actual
 spam      85      15
not spam   10      90

satya - 12/27/2022, 12:23:58 PM

Explanation

1. 15 spams are identified wrongly as "not spam"

2. 10 emails that are not spam are identified as "spam"

satya - 2/4/2023, 6:26:50 PM

Test

  1. Imbalanced Classes: If the classes are imbalanced, meaning that one class has significantly more instances than the others, logistic regression can produce biased results towards the larger class.
  2. Non-Linear Relationships: If the relationship between the features and the classes is non-linear, logistic regression may not be able to capture the relationship and produce poor results.
  3. Overlapping Classes: If the classes overlap, meaning that instances of one class have similar features to instances of another class, logistic regression can have difficulty distinguishing between the classes.
  4. High Dimensionality: If the number of features is very high, logistic regression can become computationally expensive and may overfit the data.