Handling Null Values

Working with Null Values

Real-world data is rarely clean. Often, you will encounter missing records, blank cells, or corrupted data. In Python/Pandas, these are represented as NaN (Not a Number) or None.

In this tutorial, we will learn how to detect, remove, and fill these missing values using the provided dataset.


๐Ÿ”น 1. Setup: Creating Data with Nulls

To simulate missing data, we use np.nan from the NumPy library.

๐Ÿงช Code Example


import pandas as pd
import numpy as np

# Creating a dataset with intentional missing values (np.nan)
data = {
    'name': ['Atul', 'Aman', 'Riya', 'Sneha', 'Dev'],
    'class': [10, 10, 11, 12, np.nan],       
    'roll': [1, 2, 3, 4, np.nan],
    'subject': ['Math', 'Science', 'English', np.nan, np.nan],
    'status': ['pass', 'fail', np.nan, 'pass', 'fail']
}

table = pd.DataFrame(data)
print(table)

๐Ÿ“ค Output


    name  class  roll  subject status
0   Atul   10.0   1.0     Math   pass
1   Aman   10.0   2.0  Science   fail
2   Riya   11.0   3.0  English    NaN
3  Sneha   12.0   4.0      NaN   pass
4    Dev    NaN   NaN      NaN   fail

๐Ÿ“ Explanation

  • We imported numpy as np to access np.nan.
  • Pandas automatically detects np.nan as a missing value (displayed as NaN).
  • Notice that rows 2, 3, and 4 have holes in the data.

๐Ÿ”น 2. Finding Null Values

Before fixing data, we need to know exactly how much data is missing.

A. The isna() Method

This checks every single cell in the table. If it's null, it returns True; otherwise, False.

๐Ÿงช Code Example


print(table.isna())

๐Ÿ“ค Output


    name  class   roll  subject  status
0  False  False  False    False   False
1  False  False  False    False   False
2  False  False  False    False    True
3  False  False  False     True   False
4  False   True   True     True   False

B. The isna().sum() Method (Count)

The boolean table above is hard to read. Instead, we sum up the "True" values to see the total errors per column.

๐Ÿงช Code Example


# Sums up the True values for each column
print(table.isna().sum())

๐Ÿ“ค Output


name       0
class      1
roll       1
subject    2
status     1
dtype: int64

๐Ÿ“ Explanation

  • subject: 2 means there are 2 missing values in the "subject" column.
  • name: 0 means the name column is perfect (no missing data).

๐Ÿ”น 3. Removing Null Values (dropna)

One strategy to handle nulls is to simply delete the rows that contain them.

However, if we delete every row with a null, we might lose too much data. We use the thresh (Threshold) parameter to be smarter.

Rule: thresh=3 means "Keep the row ONLY if it has at least 3 valid (non-null) values."

๐Ÿงช Code Example


# Drop rows that have less than 3 valid values
print(table.dropna(thresh=3))

๐Ÿ“ค Output


    name  class  roll  subject status
0   Atul   10.0   1.0     Math   pass
1   Aman   10.0   2.0  Science   fail
2   Riya   11.0   3.0  English    NaN
3  Sneha   12.0   4.0      NaN   pass

๐Ÿ“ Explanation

  • Row 4 (Dev) was deleted. Why?
    • Valid values: "Dev" (Name), "fail" (Status).
    • Total Valid = 2.
    • Since 2 < 3, it is dropped.
  • Row 3 (Sneha) was kept. Why?
    • Valid values: Name, Class, Roll, Status. (Subject is missing).
    • Total Valid = 4.
    • Since 4 >= 3, it is saved.

๐Ÿ”น 4. Filling Null Values (fillna)

Deleting data is risky. A safer option is Imputation: filling the empty cells with a meaningful default value.

A. Simple Fill (Fill All)

This replaces every single NaN in the entire DataFrame with the same value.

๐Ÿงช Code Example


# Replace ALL nulls with "Khali Hai"
print(table.fillna(value="Khali Hai"))

๐Ÿ“ค Output


    name      class       roll    subject     status
0   Atul       10.0        1.0       Math       pass
1   Aman       10.0        2.0    Science       fail
2   Riya       11.0        3.0    English  Khali Hai
3  Sneha       12.0        4.0  Khali Hai       pass
4    Dev  Khali Hai  Khali Hai  Khali Hai       fail

B. Smart Fill (Dictionary Method)

Usually, you need different default values for different columns. (e.g., "Class" should be 0, but "Subject" should be text). We use a dictionary to map column names to their specific fill values.

๐Ÿงช Code Example


# Define rules for specific columns
fill_values = {
    "name": "name not found",
    "class": "class not found",
    "roll": "roll no not found",
    "subject": "no subject selected"
}

# Apply these rules
print(table.fillna(value=fill_values))

๐Ÿ“ค Output


    name             class               roll              subject status
0   Atul              10.0                1.0                 Math   pass
1   Aman              10.0                2.0              Science   fail
2   Riya              11.0                3.0              English    NaN
3  Sneha              12.0                4.0  no subject selected   pass
4    Dev   class not found  roll no not found  no subject selected   fail

๐Ÿ“ Explanation

  • Row 4 (Dev): Class was NaN → became "class not found". Roll was NaN → became "roll no not found".
  • Row 3 (Sneha): Subject was NaN → became "no subject selected".
  • Row 2 (Riya): Status was NaN. Since we did not include "status" in our dictionary, it remains NaN.

๐Ÿ“Œ Summary

Method Description
df.isna().sum() Counts missing values per column.
df.dropna(thresh=N) Keeps row ONLY if it has N valid values.
df.fillna("Val") Fills all missing cells with "Val".
df.fillna(dict) Fills specific columns with specific values.

Post a Comment

Do Leave Your Comments...

Previous Post Next Post

Contact Form