Working with Null Values
Real-world data is rarely clean. Often, you will encounter missing records, blank cells, or corrupted data.
In Python/Pandas, these are represented as NaN (Not a Number) or None.
In this tutorial, we will learn how to detect, remove, and fill these missing values using the provided dataset.
๐น 1. Setup: Creating Data with Nulls
To simulate missing data, we use np.nan from the NumPy library.
๐งช Code Example
import pandas as pd
import numpy as np
# Creating a dataset with intentional missing values (np.nan)
data = {
'name': ['Atul', 'Aman', 'Riya', 'Sneha', 'Dev'],
'class': [10, 10, 11, 12, np.nan],
'roll': [1, 2, 3, 4, np.nan],
'subject': ['Math', 'Science', 'English', np.nan, np.nan],
'status': ['pass', 'fail', np.nan, 'pass', 'fail']
}
table = pd.DataFrame(data)
print(table)
๐ค Output
name class roll subject status
0 Atul 10.0 1.0 Math pass
1 Aman 10.0 2.0 Science fail
2 Riya 11.0 3.0 English NaN
3 Sneha 12.0 4.0 NaN pass
4 Dev NaN NaN NaN fail
๐ Explanation
- We imported
numpy as npto accessnp.nan. - Pandas automatically detects
np.nanas a missing value (displayed asNaN). - Notice that rows 2, 3, and 4 have holes in the data.
๐น 2. Finding Null Values
Before fixing data, we need to know exactly how much data is missing.
A. The isna() Method
This checks every single cell in the table. If it's null, it returns True; otherwise, False.
๐งช Code Example
print(table.isna())
๐ค Output
name class roll subject status
0 False False False False False
1 False False False False False
2 False False False False True
3 False False False True False
4 False True True True False
B. The isna().sum() Method (Count)
The boolean table above is hard to read. Instead, we sum up the "True" values to see the total errors per column.
๐งช Code Example
# Sums up the True values for each column
print(table.isna().sum())
๐ค Output
name 0
class 1
roll 1
subject 2
status 1
dtype: int64
๐ Explanation
- subject: 2 means there are 2 missing values in the "subject" column.
- name: 0 means the name column is perfect (no missing data).
๐น 3. Removing Null Values (dropna)
One strategy to handle nulls is to simply delete the rows that contain them.
However, if we delete every row with a null, we might lose too much data. We use the thresh (Threshold) parameter to be smarter.
Rule: thresh=3 means "Keep the row ONLY if it has at least 3 valid (non-null) values."
๐งช Code Example
# Drop rows that have less than 3 valid values
print(table.dropna(thresh=3))
๐ค Output
name class roll subject status
0 Atul 10.0 1.0 Math pass
1 Aman 10.0 2.0 Science fail
2 Riya 11.0 3.0 English NaN
3 Sneha 12.0 4.0 NaN pass
๐ Explanation
- Row 4 (Dev) was deleted. Why?
- Valid values: "Dev" (Name), "fail" (Status).
- Total Valid = 2.
- Since 2 < 3, it is dropped.
- Row 3 (Sneha) was kept. Why?
- Valid values: Name, Class, Roll, Status. (Subject is missing).
- Total Valid = 4.
- Since 4 >= 3, it is saved.
๐น 4. Filling Null Values (fillna)
Deleting data is risky. A safer option is Imputation: filling the empty cells with a meaningful default value.
A. Simple Fill (Fill All)
This replaces every single NaN in the entire DataFrame with the same value.
๐งช Code Example
# Replace ALL nulls with "Khali Hai"
print(table.fillna(value="Khali Hai"))
๐ค Output
name class roll subject status
0 Atul 10.0 1.0 Math pass
1 Aman 10.0 2.0 Science fail
2 Riya 11.0 3.0 English Khali Hai
3 Sneha 12.0 4.0 Khali Hai pass
4 Dev Khali Hai Khali Hai Khali Hai fail
B. Smart Fill (Dictionary Method)
Usually, you need different default values for different columns. (e.g., "Class" should be 0, but "Subject" should be text). We use a dictionary to map column names to their specific fill values.
๐งช Code Example
# Define rules for specific columns
fill_values = {
"name": "name not found",
"class": "class not found",
"roll": "roll no not found",
"subject": "no subject selected"
}
# Apply these rules
print(table.fillna(value=fill_values))
๐ค Output
name class roll subject status
0 Atul 10.0 1.0 Math pass
1 Aman 10.0 2.0 Science fail
2 Riya 11.0 3.0 English NaN
3 Sneha 12.0 4.0 no subject selected pass
4 Dev class not found roll no not found no subject selected fail
๐ Explanation
- Row 4 (Dev): Class was NaN → became "class not found". Roll was NaN → became "roll no not found".
- Row 3 (Sneha): Subject was NaN → became "no subject selected".
- Row 2 (Riya): Status was NaN. Since we did not include "status" in our dictionary, it remains NaN.
๐ Summary
| Method | Description |
|---|---|
df.isna().sum() |
Counts missing values per column. |
df.dropna(thresh=N) |
Keeps row ONLY if it has N valid values. |
df.fillna("Val") |
Fills all missing cells with "Val". |
df.fillna(dict) |
Fills specific columns with specific values. |