Skip to content

CaptainStorm21/Hands-on-Python-Ai-Pandas

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

97 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

some code sources / files are jumping a coupe of numbers - Those are associated with the udemy course
Hands-on Machine Learning Tutorial with Pandas, Numpy, Seaborn, Scikit-learn in Python and ChatGPT: A Complete Work-flow

It's a fun class very educational - most chatGPT autogenerated codes do not look exactly the same as in the videos. 



# Lesson 1 

https://chat.openai.com/

Give me a python code to load an excel data 

Change file name excel_data 

# Lesson 2 

Give me a python code to find out missing values in  a data

Add 

 Display the head
df.head()

Give me a python code for simpleimpurer to impute mission values in variabe of a data 

Imputer is a machine learning function which is used to impute the missing values, or to replace themissing values in a variable in a data by a standard measure.

pip install scikit-learn



## Solution to excersize 1 
import pandas as pd

data = {
    'age': [25, 30, None, 40, 35, None],
    'income': [50000, 60000, 'XXX', 80000, 70000, 32477],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York', 'Chicago']
}

df = pd.DataFrame(data)

duplicate_row = pd.DataFrame({'age': [30, None], 'income': [60000, 'YYY'], 'city': ['Los Angeles', 'Los Angeles']})
df = pd.concat([df, duplicate_row], ignore_index=True)

print(df)

## Solution to exercise 2 
import pandas as pd
from sklearn.impute import SimpleImputer
 
# Create DataFrame
data = {
    'age': [25, 30, None, 40, 35],  # Three missing values
    'income': [50000.3, 60000.34, 'XXX', 80000, 70000],  # Inconsistent value 'XXX'
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York']
}
 
df = pd.DataFrame(data)l

df['income'] = pd.to_numeric(df['income'], errors='coerce')

age_strategy = 'mean'  # Choose an appropriate strategy for the 'age' column
income_strategy = 'median'  # Choose an appropriate strategy for the 'income' column

age_imputer = SimpleImputer(strategy=age_strategy)
income_imputer = SimpleImputer(strategy=income_strategy)

df['age'] = age_imputer.fit_transform(df[['age']])
df['income'] = income_imputer.fit_transform(df[['income']])

df['age'] = df['age'].astype(int)
df['income'] = df['income'].astype(float)


print(df)

# Lesson 3

Give me a python code find the data types of the variables in a data

Give me a python code to find out non-numberic value from the numeric values in a variable



## Exercise 3
Write the following code for creating the dataframe:

import pandas as pd
 
# Create DataFrame
data = {
    'age': [25, 30, None, 40, 35],  # Three missing values
    'income': [50000, 60000, 'XXX', 80000, 70000],  # Inconsistent value 'XXX'
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York']
}
 
df = pd.DataFrame(data)
Then, perform necessary code after this to drop the inconsistent value 'XXX' from income.

Solution

import pandas as pd
 
# Create DataFrame
data = {
    'age': [25, 30, None, 40, 35],  # Three missing values
    'income': [50000, 60000, 'XXX', 80000, 70000],  # Inconsistent value 'XXX'
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York']
}
 
df = pd.DataFrame(data)

# Convert 'income' column to numeric, coerce errors to NaN
df['income'] = pd.to_numeric(df['income'], errors='coerce')

# Drop rows with NaN values in the 'income' column
df = df.dropna(subset=['income'])

print(df)


In Python, specifically in the context of Pandas, dropna() is a method used to remove missing (NaN) values from a DataFrame or Series object. It allows you to drop either entire rows or entire columns that contain missing values.






## Exercise 4
Removing inconsistent value
Write the following code for creating the dataframe:

import pandas as pd
 
# Create DataFrame
data = {
    'age': [25, 30, None, 40, 35],  # Three missing values
    'income': [50000, 60000, 'XXX', 80000, 70000],  # Inconsistent value 'XXX'
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York']
}
 
df = pd.DataFrame(data)
Then, perform necessary code after this to drop the inconsistent value 'XXX' from income.


Solution 
import pandas as pd
 
# Create DataFrame
data = {
    'age': [25, 30, None, 40, 35],  # Three missing values
    'income': [50000, 60000, 'XXX', 80000, 70000],  # Inconsistent value 'XXX'
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York']
}
 
df = pd.DataFrame(data)

# Remove rows with inconsistent value 'XXX' from 'income' column
df = df[df['income'] != 'XXX']

print(df)


## Exercise 5
Assigning correct data types
Write the following code first:

import pandas as pd
 
# Create DataFrame
data = {
    'age': [25, 30, None, 40, 35],  # Three missing values
    'income': [50000, 60000, 'XXX', 80000, 70000],  # Inconsistent value 'XXX'
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York']
}
 
df = pd.DataFrame(data)
 
# Drop rows with 'XXX' in 'income' column
df = df[df['income'] != 'XXX']
Now assign correct data type to the variable income.


solution 
import pandas as pd

# Create DataFrame
data = {
    'age': [25, 30, None, 40, 35],  # Three missing values
    'income': [50000, 60000, 'XXX', 80000, 70000],  # Inconsistent value 'XXX'
    'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'New York']
}

print(data)

df = pd.DataFrame(data)

# Drop rows with 'XXX' in 'income' column
df = df[df['income'] != 'XXX']

# Convert 'income' column to numeric
df['income'] = pd.to_numeric(df['income'])


## Lesson 5 
Give me a python code to deal with duplicates

To handle duplicates in a DataFrame using Pandas, you can use several methods like drop_duplicates(), duplicated(), or groupby(). Here's how you can use these methods:

python
Copy code
import pandas as pd

# Sample DataFrame
data = {
    'name': ['John', 'Alice', 'Bob', 'Alice', 'John'],
    'age': [25, 30, 35, 30, 25],
    'city': ['New York', 'Los Angeles', 'Chicago', 'Los Angeles', 'New York']
}

df = pd.DataFrame(data)

### Method 1: Drop duplicates
df_dedup = df.drop_duplicates()
print("DataFrame after dropping duplicates:")
print(df_dedup)

### Method 2: Identify duplicates
duplicates_mask = df.duplicated()
print("\nMask of duplicated rows:")
print(duplicates_mask)

### Method 3: Group by and aggregate (e.g., take the first occurrence)
df_grouped = df.groupby(['name', 'age', 'city']).first().reset_index()
print("\nDataFrame after grouping and taking first occurrence:")
print(df_grouped)



About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published