ML notes
satya - 12/25/2022, 12:38:41 PM
Category variables
1. When cat variables are inputs, translate them to integers from strings. Ex: Cities for predicting prices
2. Nominal cats: Cat variables can be nominal. They don't have an implicit numerical order
3. Ordinal: Second type is category has a numerical order ex: your degrees.
satya - 12/25/2022, 12:39:25 PM
One hot encoding
1. Create a new column for each cat
2. Make them binary encoded 0 or 1
3. Leave out the last one as the 0s in others will indicate that option
satya - 12/25/2022, 12:40:29 PM
pd.getdummies(col-name): api
1. These translation of cat variables to bin columns is called dummry variables
satya - 12/25/2022, 12:41:57 PM
API: pd.concat() to merge by columns
#Signature
pd.concat([df1,df2],axis="columns")
satya - 12/25/2022, 12:43:35 PM
api: pd.drop()
1. Drop the cat columns
2. Drop one of the extra dummy columns. Just one.
satya - 12/25/2022, 12:46:01 PM
Concept: Dummy variable trap, Multi-colinear
When a variable can be derived from a combination of other variables, then the variables are called multi-colinear variables. This is called the trap.
Rule of thumb is one of them
satya - 12/25/2022, 12:46:14 PM
Dummy variable trap, Multi-colinear
Dummy variable trap, Multi-colinear
satya - 12/25/2022, 12:48:37 PM
api: model.score(X,y)
api: model.score(X,y)
satya - 12/25/2022, 12:57:45 PM
API alternate to pd.get_dummies(), LabelEncoder
#Imports
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
#Covert cats to booleans
le=LabelEncoder()
tcol = le.fit_transform(df.cat-column)
#Assign it to a column in df
#replace the orig cat column with bin column
df.original-cat-column-name = tcol
satya - 12/25/2022, 1:01:55 PM
Some of these notes are taken from
satya - 12/25/2022, 1:02:20 PM
How to use OneHotEncoder
How to use OneHotEncoder
satya - 12/25/2022, 1:04:51 PM
Quick summary of Cats
1. Convert them to integers
2. you can use pd.get_dummies that converts one column into multiple binary columns
3. Or use OHE from Scikit
satya - 12/25/2022, 4:08:38 PM
Classfication types
Binary
Multiclass
satya - 12/25/2022, 4:23:58 PM
APIs
#Some basic apis
plt.scatter
df.shape
#***************************
# training and test split
#***************************
from skelearn.model_selection import train_test_split
x-train, x-test, y-train, y-test
= train_test_split(input-df, output-df, train_size=0.9)
#***************************
# LRegression
#***************************
from sklearn.linear_model import LogisticRegression
model = LR()
model.fit(x-train, y-train)
output = model.predict(x-test)
model.score(x-test,y-test)
model.predict_proba(x-test)
satya - 12/25/2022, 4:24:26 PM
Use Kaggle to download sample data sets for a number of your projects
Use Kaggle to download sample data sets for a number of your projects
satya - 12/25/2022, 4:34:44 PM
what does dir() method do in python?
what does dir() method do in python?
satya - 12/25/2022, 4:35:11 PM
It describes the properties and methods of an object
It describes the properties and methods of an object
satya - 12/25/2022, 6:13:43 PM
Scikit learn LogisticsRegressin fit method inputs and outputs
Scikit learn LogisticsRegressin fit method inputs and outputs
Search for: Scikit learn LogisticsRegressin fit method inputs and outputs
satya - 12/27/2022, 11:10:16 AM
Summary of logistic regression
1. Used for binary classification
2. Multiclass classfication
satya - 12/27/2022, 11:10:44 AM
From ChatGPT
1. Logistic Regression is a supervised machine learning algorithm that is used for classification tasks. It is a type of regression analysis that is used to predict a binary outcome (i.e., a outcome that can only have two possible values, such as 0 or 1, yes or no, true or false).
2. The goal of logistic regression is to find the best fitting model to predict the probability of an event occurring. The model is developed by fitting a logistic function to the data. The logistic function is a sigmoid curve that maps any real-valued number to a value between 0 and 1. This curve can be interpreted as a probability, with values closer to 0 indicating a low probability and values closer to 1 indicating a high probability.
3. Logistic Regression is widely used in a variety of fields, including finance, economics, and psychology. It is often used to predict the likelihood of a certain event occurring (e.g., whether a customer will churn or not, whether a patient will have a certain disease or not) based on a set of independent variables. It is also commonly used in binary classification tasks, where the goal is to predict which of two classes an observation belongs to.
4. In summary, Logistic Regression is a statistical model used for predicting binary outcomes based on a set of independent variables. It is widely used in a variety of fields and is particularly useful for classification tasks.
satya - 12/27/2022, 11:13:43 AM
Bin classifcation
1. Given age find out if they buy insurance or not
2. X - age
2. Y - yes or no
satya - 12/27/2022, 11:18:03 AM
Multi-class
1. Given a set of 64 bits of dark and light tell what kind of a digit or shape it is
2. The way you call this is identical to bin classification
3. Except there are more inputs and outputs
satya - 12/27/2022, 11:45:35 AM
What is confusion matrix?
What is confusion matrix?
satya - 12/27/2022, 11:45:46 AM
What is seaborn library?
What is seaborn library?
satya - 12/27/2022, 11:50:59 AM
Sample code for multiclass regression
satya - 12/27/2022, 12:19:42 PM
Functions used
# load data sets from scikitlearn
# load digits dataset
load_digits
# plotting images
plt.matshow
#Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_predicted)
satya - 12/27/2022, 12:22:51 PM
Confusion matrix
Predicted
spam not spam
Actual
spam 85 15
not spam 10 90
satya - 12/27/2022, 12:23:58 PM
Explanation
1. 15 spams are identified wrongly as "not spam"
2. 10 emails that are not spam are identified as "spam"
satya - 2/4/2023, 6:26:50 PM
Test
- Imbalanced Classes: If the classes are imbalanced, meaning that one class has significantly more instances than the others, logistic regression can produce biased results towards the larger class.
- Non-Linear Relationships: If the relationship between the features and the classes is non-linear, logistic regression may not be able to capture the relationship and produce poor results.
- Overlapping Classes: If the classes overlap, meaning that instances of one class have similar features to instances of another class, logistic regression can have difficulty distinguishing between the classes.
- High Dimensionality: If the number of features is very high, logistic regression can become computationally expensive and may overfit the data.