Remove Duplicate Images using Python π
Do you have a folder full of images and want to delete the duplicates? Here's a simple Python script that scans all the images in a folder and keeps only the unique ones based on their visual hash. This is especially useful for meme folders, downloaded wallpapers, or social media collections.
Features:
- Uses
imagehash
andPillow
to identify duplicates visually (not just by filename or size). - Skips corrupted or unreadable image files.
- Prints duplicate image pairs for reference.
- Automatically creates an output folder and saves only unique images there.
Python Code:
import os
import shutil
from PIL import Image, UnidentifiedImageError
import imagehash
# Input and output folder paths
input_folder = r'images_output' # Images Input Folder
output_folder = r'images_output' # Images Output Folder (you can change this to a different path)
# Create the output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)
# Dictionary to store image hashes and their paths
image_hashes = {}
# Iterate through files in the folder
for filename in os.listdir(input_folder):
if filename.lower().endswith(('png', 'jpg', 'jpeg', 'gif', 'bmp', 'webp')):
img_path = os.path.join(input_folder, filename)
try:
with Image.open(img_path) as img:
img.load() # Force load image to catch truncation errors
# Calculate image hash
img_hash = imagehash.average_hash(img)
# Check if hash already exists
if img_hash in image_hashes:
print(f"Duplicate found: {img_path} and {image_hashes[img_hash]}")
else:
# Copy unique image to the output folder
shutil.copy(img_path, output_folder)
image_hashes[img_hash] = img_path
except (OSError, UnidentifiedImageError) as e:
print(f"Skipping file due to error: {img_path} — {e}")
print("✅ Unique images have been saved to the output folder.")
How to Use:
- Install the required libraries (if not already installed):
- Put all your images inside a folder (e.g.,
images_output
). - Run the script in your Python environment (VS Code, IDLE, or any terminal).
- All unique images will be copied to the same folder (or change output path to keep them separate).
pip install Pillow imagehash
Notes:
- It uses average_hash from
imagehash
, which is good for detecting visually similar images even with minor edits. - You can replace it with
phash
,whash
, ordhash
depending on your needs. - If you want to move (not copy) files, change
shutil.copy
toshutil.move
.
π‘ Pro Tip: This can be used to clean meme folders before uploading them to Instagram, YouTube Shorts, etc.
π Like this tip? Let me know in the comments!
πTrending Topics
π Connect With Us:
“In the world of code, Python is the language of simplicity, where logic meets creativity, and every line brings us closer to our goals.”— Only Python
π Follow Us And Stay Updated For Daily Updates
π More Resources
π Python Crash Course Chapter-wise Exercises
π AI And MACHINE LEARNING ROADMAP: From Basic to Advanced
Stage 1: Python & Programming Fundamentals
----------------------------------------
1. Python & Programming Fundamentals
----------------------------------------
1.1 Environment Setup
• Install Python 3.x, VS Code / PyCharm
• Configure linting, formatters (e.g., Pylint, Black)
• Jupyter Notebook / Google Colab basics
1.2 Core Python Syntax
• Variables, Data Types (int, float, str, bool)
• Operators: arithmetic, comparison, logical, bitwise
• Control Flow: if / else / elif
• Loops: for, while, break/continue
1.3 Functions & Modules
• Defining functions, return values
• Parameters: positional, keyword, default args
• *args, **kwargs
• Organizing code: modules and packages
• Standard library exploration (os, sys, datetime, random, math)
1.4 Data Structures
• Lists, Tuples, Sets, Dictionaries
• List/dict comprehensions
• Built-in functions: map, filter, zip, enumerate
• When to use which structure
1.5 File Handling & Exceptions
• Reading/Writing text and binary files
• Context managers (`with` statement)
• Exception handling: try/except/finally
• Custom exceptions
1.6 Object-Oriented Programming (OOP)
• Classes, Instances, Attributes, Methods
• __init__, self, class vs instance attributes
• Inheritance, Polymorphism, Encapsulation
• Magic methods: __str__, __repr__, __add__, etc.
• Use-cases in structuring larger projects
1.7 Virtual Environments & Package Management
• venv / pipenv / poetry basics
• Installing and managing dependencies
• requirements.txt and environment.yml
π Tools: VS Code, Git for version control, Jupyter/Colab
Stage 2: Mathematics for Machine Learning
----------------------------------------
2. Mathematics for Machine Learning
----------------------------------------
2.1 Linear Algebra
• Scalars, Vectors, Matrices, Tensors
• Operations: addition, multiplication, dot product
• Matrix properties: transpose, inverse, rank
• Eigenvalues & Eigenvectors (intuition)
• Applications: data transformations, PCA
2.2 Calculus
• Functions and limits (intuitive overview)
• Derivatives: gradient of single-variable and multi-variable functions
• Chain rule (key for backpropagation in neural networks)
• Partial derivatives
• Basic integration (overview; less often used directly)
2.3 Probability & Statistics
• Basic probability theory: events, conditional probability, Bayes’ theorem
• Random variables, distributions (normal, binomial, Poisson, etc.)
• Descriptive statistics: mean, median, mode, variance, standard deviation
• Inferential statistics: hypothesis testing, p-values, confidence intervals
• Sampling methods, bias, variance concepts
2.4 Optimization Basics
• Concept of optimization in ML (finding minima of loss functions)
• Gradient descent: batch, stochastic, mini-batch
• Learning rate intuition
π Tools / References:
• Interactive calculators: Desmos, GeoGebra
• Python libraries: NumPy for experimentation
Stage 3: Data Handling & Preprocessing
----------------------------------------
3. Data Handling & Preprocessing
----------------------------------------
3.1 NumPy Essentials
• ndarrays: creation, indexing, slicing
• Vectorized operations vs Python loops
• Broadcasting rules
• Random number generation
3.2 Pandas for Tabular Data
• Series & DataFrame: creation and basic ops
• Reading data: CSV, Excel, JSON
• Indexing, selection (loc/iloc), filtering rows
• Handling missing values: dropna, fillna
• Detecting/removing duplicates
• Combining datasets: merge, join, concat
• GroupBy operations, aggregation, pivot tables
3.3 Feature Engineering
• Feature scaling: normalization (Min-Max), standardization (Z-score)
• Encoding categorical variables: one-hot, ordinal encoding
• Date/time feature extraction (if applicable)
• Creating new features via domain knowledge
• Feature selection: variance threshold, correlation analysis
3.4 Data Visualization
• Matplotlib basics: line plot, scatter plot, histograms, bar charts
• Seaborn overview: higher-level plots (heatmap, pairplot)
• Visualizing distributions, relationships, outliers
• Plot customization: titles, labels, legends
3.5 Handling Real-World Data Challenges
• Imbalanced datasets: oversampling (SMOTE), undersampling, class weights
• Outlier detection and treatment
• Data leakage awareness
• Pipeline creation in scikit-learn
π Tools: NumPy, Pandas, Matplotlib, Seaborn, scikit-learn utilities
Stage 4: Core Machine Learning
----------------------------------------
4. Core Machine Learning
----------------------------------------
4.1 ML Concepts & Workflow
• What is ML? Supervised vs Unsupervised vs Semi-supervised vs Reinforcement
• Training, Validation, Testing splits
• Overfitting vs Underfitting, bias-variance trade-off
• Cross-validation techniques: k-fold, stratified
4.2 Supervised Learning: Regression
• Linear Regression: assumptions, cost function, normal equation
• Regularized Regression: Ridge, Lasso, Elastic Net
• Polynomial Regression
• Evaluation metrics: MSE, RMSE, MAE, R²
4.3 Supervised Learning: Classification
• Logistic Regression: sigmoid, decision boundary, loss
• k-Nearest Neighbors (KNN)
• Decision Trees: entropy/gini, pruning
• Ensemble Methods:
- Bagging: Random Forest
- Boosting: AdaBoost, Gradient Boosting, XGBoost (intro)
• Support Vector Machines (SVM): kernel trick overview
• Naive Bayes: Gaussian, Multinomial
• Evaluation: accuracy, precision, recall, F1-score, ROC-AUC
• Confusion matrix analysis
4.4 Unsupervised Learning
• Clustering:
- K-Means: elbow method, silhouette score
- Hierarchical clustering: dendrograms
- DBSCAN
• Dimensionality Reduction:
- PCA: variance explained
- t-SNE / UMAP (visualization-focused)
• Anomaly Detection overview
4.5 Model Selection & Tuning
• Hyperparameter tuning: grid search, random search, Bayesian optimization (overview)
• Automated tuning libraries (e.g., scikit-learn’s GridSearchCV, RandomizedSearchCV)
• Pipeline building to avoid leakage
• Feature importance and model interpretability basics
π Tools: scikit-learn, pandas, NumPy
Stage 5: Deep Learning Foundations
----------------------------------------
5. Deep Learning Foundations
----------------------------------------
5.1 Neural Network Basics
• Artificial neuron model, activation functions (ReLU, Sigmoid, Tanh)
• Architecture: input, hidden, output layers
• Forward propagation, loss functions (Cross-entropy, MSE)
• Backpropagation: gradient computation, chain rule
5.2 Deep Learning Frameworks
• TensorFlow & Keras: Sequential and Functional APIs
• PyTorch basics: tensors, autograd, nn.Module
• Comparing TF/Keras vs PyTorch (choose one to start)
5.3 Training Deep Models
• Optimizers: SGD, Adam, RMSprop
• Learning rate scheduling
• Regularization: Dropout, Batch Normalization, Weight Decay
• Handling overfitting: early stopping, data augmentation
5.4 Basic DL Projects
• MNIST digit classification
• CIFAR-10 image classification (small CNN)
• Simple feedforward network on tabular data
π Tools: TensorFlow/Keras or PyTorch, GPU if available (Colab/GPU runtime)
Stage 6: Advanced Deep Learning & Architectures
----------------------------------------
6. Advanced Deep Learning & Architectures
----------------------------------------
6.1 Convolutional Neural Networks (CNNs)
• Convolution operations, filters, feature maps
• Pooling layers, padding, stride
• Famous architectures overview: LeNet, AlexNet, VGG, ResNet (intuition)
• Transfer Learning: fine-tuning pre-trained models
6.2 Recurrent Neural Networks (RNNs) & Sequence Models
• RNN basics: hidden states, vanishing gradients
• LSTM, GRU: gating mechanisms
• Sequence-to-sequence models (intro)
• Attention mechanism: intuition
6.3 Transformers & Attention
• Self-attention mechanism
• Transformer architecture: encoder, decoder overview
• Pre-trained transformer models: BERT, GPT family (conceptual)
• Fine-tuning transformers for tasks
6.4 Generative Models
• Autoencoders: basic, variational autoencoders (VAE) overview
• Generative Adversarial Networks (GANs): generator/discriminator intuition
• Applications and basic experiments
6.5 Advanced Techniques
• Multi-task learning, meta-learning (intro)
• Few-shot learning, transfer learning deeper dive
• Neural architecture search (overview)
• Model compression, pruning, quantization (deployment considerations)
π Tools: TensorFlow / PyTorch, Hugging Face Transformers library
Stage 7: Natural Language Processing (NLP) Advanced
----------------------------------------
7. Natural Language Processing (NLP)
----------------------------------------
7.1 Text Preprocessing & Representation
• Tokenization (word, subword/BPE)
• Stopwords removal, lemmatization vs stemming
• Word embeddings: Word2Vec, GloVe, FastText
• Contextual embeddings: ELMo, BERT embeddings
7.2 Transformer-based NLP
• Pre-trained models: BERT, RoBERTa, GPT, T5
• Fine-tuning for classification, QA, summarization
• Sequence generation tasks using GPT-like models
7.3 Specialized NLP Tasks
• Named Entity Recognition (NER)
• Machine Translation overview
• Question Answering pipelines
• Text Summarization (extractive vs abstractive)
• Sentiment Analysis deep dive
7.4 Evaluation Metrics in NLP
• BLEU, ROUGE, METEOR (for generation)
• Accuracy, F1 for classification tasks
π Tools: Hugging Face Transformers, spaCy, NLTK
Stage 8: Computer Vision Advanced
----------------------------------------
8. Computer Vision (CV)
----------------------------------------
8.1 Image Preprocessing & Augmentation
• OpenCV basics: reading, resizing, color conversions
• Data augmentation techniques: flips, rotations, crops, color jitter
8.2 Advanced CNN Architectures
• Inception, ResNet, DenseNet, EfficientNet (conceptual)
• Transfer learning and fine-tuning advanced models
• Object detection frameworks: YOLOvX, SSD, Faster R-CNN (overview)
• Semantic segmentation: U-Net, Mask R-CNN
• Instance segmentation concepts
8.3 Vision Transformers (ViT)
• Applying transformer concepts to images
• Fine-tuning ViT for classification
8.4 Specialized CV Tasks
• Face recognition pipelines
• Video analysis basics: action recognition, object tracking
• 3D vision intro (depth estimation)
π Tools: OpenCV, TensorFlow/PyTorch, libraries like Detectron2 or YOLO implementations
Stage 9: Reinforcement Learning & Advanced Topics
----------------------------------------
9. Reinforcement Learning & Advanced Topics
----------------------------------------
9.1 Reinforcement Learning Foundations
• Markov Decision Process (MDP)
• Value functions, policy functions
• Q-Learning, SARSA (tabular methods)
• Exploration vs Exploitation
9.2 Deep Reinforcement Learning
• Deep Q-Networks (DQN)
• Policy Gradient Methods: REINFORCE, Actor-Critic
• Advanced: A3C, PPO, DDPG overview
9.3 Other Advanced AI Topics
• Graph Neural Networks (GNNs): node/graph embeddings (overview)
• Time Series Forecasting with ML/DL: RNN/LSTM, Prophet intro
• Bayesian Methods overview
• AutoML and neural architecture search concepts
• Federated Learning basics (privacy-aware training)
• MLOps fundamentals:
- Model versioning
- Continuous integration/continuous deployment (CI/CD) for ML
- Monitoring models in production
- Tools: MLflow, Kubeflow (intro)
• Edge AI / TinyML overview (deploying models on devices)
π Tools: RL libraries (Stable Baselines3), MLflow, Kubernetes intro, Docker
Stage 10: Deployment, Production & MLOps
----------------------------------------
10. Deployment, Production & MLOps
----------------------------------------
10.1 Model Serving & APIs
• REST API with Flask / FastAPI
• gRPC basics (overview)
• Dockerizing ML applications
• Serving with TensorFlow Serving or TorchServe
10.2 Cloud Deployment
• Deploy on AWS Sagemaker / GCP AI Platform / Azure ML (basic workflow)
• Serverless deployments (AWS Lambda, Cloud Functions) for small models
• CI/CD pipelines for ML: GitHub Actions or Jenkins integration
10.3 Monitoring & Maintenance
• Logging model inputs/outputs
• Drift detection (data/model drift)
• Retraining pipelines (automated or scheduled)
• Scaling considerations
10.4 MLOps Tools & Practices
• Experiment tracking (MLflow, Weights & Biases)
• Data versioning (DVC)
• Model registry concepts
• Infrastructure as Code (Terraform intro)
π Tools: Docker, Kubernetes basics, CI/CD tools, cloud consoles
Stage 11: Real-World Projects & Portfolio
----------------------------------------
11. Real-World Projects & Portfolio
----------------------------------------
11.1 Project Ideas by Domain
• Tabular Data: Predictive analytics (e.g., churn prediction)
• NLP: Chatbot, summarizer, translation prototype
• CV: Image classifier, object detector, image segmentation app
• Time Series: Forecasting stock or weather data
• RL: Simple game-playing agent
• Generative: GAN art generation or style transfer demo
11.2 End-to-End Pipeline
• Data collection & preprocessing
• Model training & validation
• Deployment as API or web app (Streamlit/Flask)
• Monitoring & iteration
• Documentation & README
11.3 Collaboration & Open Source
• Participate in Kaggle competitions (beginner → intermediate)
• Contribute to open-source ML projects
• Write blog posts/tutorials documenting your projects
11.4 Soft Skills & Communication
• Clear README, code comments
• Presentation slides or videos of project demos
• Networking: sharing work on LinkedIn, GitHub
π Tools: GitHub Pages, Streamlit, Heroku/Netlify, Docker
Stage 12: Ethics, Explainability & Continuous Learning
----------------------------------------
12. Ethics, Explainability & Continuous Learning
----------------------------------------
12.1 AI Ethics & Responsible AI
• Bias & Fairness: identifying and mitigating bias
• Privacy concerns: GDPR, data protection best practices
• Transparency: documenting data sources and model decisions
12.2 Explainable AI (XAI)
• Model interpretability: SHAP, LIME (basic usage)
• Interpreting black-box models vs inherently interpretable models
• Communicating explanations to stakeholders
12.3 Continuous Learning & Staying Updated
• Following research: arXiv alerts, ML conferences (NeurIPS, ICML, CVPR summaries)
• Blogs, podcasts, newsletters (e.g., “The Batch” by deeplearning.ai)
• Reading codebases of popular libraries, exploring new architectures
• Community involvement: forums, study groups
12.4 Advanced Research Topics (Optional/For Aspirants)
• Research paper reading workflow
• Experimentation frameworks
• Contributing to academic research or advanced industrial research
π Tools: arXiv, Google Scholar alerts, RSS readers, community forums
Comments
Post a Comment