Customer Segmentation Model

Utkarsh Tripathi (2025EM1100146); Juwaria Qadri (2025EM1100132); Mohammed Omar (2025EM1100235); Merin Ann Cherian (2025EM1100211)

Phase 1

Segmenting Blinkit customers based on their spending behaviour and delivery experience to identify distinct customer clusters (high, medium, and low spending) for targeted marketing strategies and increasing sales

Proposal

Project Statement
Across customer-centric businesses, one of the major goals for the organization is being able to predict the needs of its customers and how to best cater to their needs while maintaining high profitability. However, to make the solution feasible, the way forward would be to create groups of consumers that share similar spending patterns, allowing the businesses to create targeted marketing campaigns, improve customer satisfaction, demand-oriented product development, and therefore improve profitability.
Business Goal
The objective of this project is to develop a machine learning model that can create suitable and usable clusters/segments given the profile and purchases of the customer. The model should then be able to predict, within bounds of acceptable error, which segment a customer is likely to belong to. The model should allow stakeholders to make informed and data-backed decisions on product demand, campaign successes and customer retention.
Data Source
We will use the “Blinkit Sales Dataset” for this project. This dataset provides detailed information on product sales, visibility, item types, and outlet performance, making it ideal for performing sales data analysis and gaining insights into business trends. The dataset is well-structured and suitable for data preprocessing, exploratory data analysis, and predictive modeling tasks.
- Source Platform: Kaggle
- Dataset: Blinkit Sales Data (Vaghasiya 2025)
Tools and Technology
Following languages and libraries are planned to be used for this project. More libraries may be used and every non trivial library shall be updated.
- Python
  - Core Libraries
    - Data Manipulation: Pandas, NumPy
    - ML: scikit-learn
    - Data Visualization & Storytelling: Matplotlib, Seaborn
  - Development Environment: VS Code, Google Colab, GitHub
- Dashboard: Power BI
Project Workflow
The project is to follow a standard data science development lifecycle,
1. Data Acquisition: Fetch the dataset from Kaggle using its API.
2. Preprocessing: Handle missing values (if any), encode categorical variables, and
  check for data inconsistencies.
3. EDA: Analyze features to understand their relationship with attrition using statistical summaries and visualizations.
4. Feature Engineering: Create new features from existing ones if necessary to improve model performance.
5. Modeling: Train several clustering models (e.g., K Means Clustering, PCA, Decision Trees).
6. Evaluation: Assess model performance using metrics like Accuracy, Precision, Recall, and F1-Score. Select the best-performing model.
7. Visualization: Create an interactive dashboard in Power BI to present the key findings and predictions to stakeholders.
Data Extraction
The “Blinkit Sales Dataset” is acquired directly from the Kaggle repository. To ensure a professional and reproducible workflow, manually downloading the files is not done.
Instead, we will perform the following steps:
- Automate the Process: We will write a Python script that utilizes the official Kaggle API to connect to the source and download the dataset.
- Ensure Reproducibility: This scripted approach guarantees that the data extraction process is consistent and can be easily re-run by any team member or reviewer.
- Prepare for Analysis: The script will handle the unzipping of the downloaded files and load the data directly into a Pandas DataFrame, making it immediately available for the next phase of our project.
- Notebook: Customer Segmentation Notebook
Schema / Data Dictionary
This data dictionary was created after inspecting the dataset.

	Feature Name	Data Type	Description	PK (Yes/No)
0	order_id	int64	Unique identifier for each order	Yes
1	customer_id	int64	Unique identifier for each customer	No
2	order_date	object	The timestamp when the order was placed	No
3	promised_delivery_time	object	The time informed to the customer for completi...	No
4	actual_delivery_time	object	The actual time when the order was delivered	No
5	delivery_status	object	The delivery status of the order	No
6	order_total	float64	The total price of the order (in Rupees)	No
7	payment_method	object	The payment method used by the customer	No
8	delivery_partner_id	int64	Unique identifier representing the delivery pa...	No
9	store_id	int64	Unique identifier for the store that fulfilled...	No

Source: Data Dictionary

Phase 2

Customer Segmentation Modelling

	order_id	customer_id	order_date	promised_delivery_time	actual_delivery_time	delivery_status	order_total	payment_method	delivery_partner_id	store_id
0	1961864118	30065862	2024-07-17 08:34:01	2024-07-17 08:52:01	2024-07-17 08:47:01	On Time	3197.07	Cash	63230	4771
1	1549769649	9573071	2024-05-28 13:14:29	2024-05-28 13:25:29	2024-05-28 13:27:29	On Time	976.55	Cash	14983	7534
2	9185164487	45477575	2024-09-23 13:07:12	2024-09-23 13:25:12	2024-09-23 13:29:12	On Time	839.05	UPI	39859	9886
3	9644738826	88067569	2023-11-24 16:16:56	2023-11-24 16:34:56	2023-11-24 16:33:56	On Time	440.23	Card	61497	7917
4	5427684290	83298567	2023-11-20 05:00:39	2023-11-20 05:17:39	2023-11-20 05:18:39	On Time	2526.68	Cash	84315	2741
5	3265154092	43367112	2023-03-18 16:29:51	2023-03-18 16:49:51	2023-03-18 16:48:51	On Time	3161.43	UPI	554	3442
6	4898355547	13284996	2023-04-16 18:50:37	2023-04-16 19:01:37	2023-04-16 19:02:37	On Time	956.40	Card	14630	1318
7	6568151549	88866835	2024-03-31 06:26:48	2024-03-31 06:37:48	2024-03-31 06:39:48	On Time	905.47	Cash	67714	115
8	6006693867	24496983	2023-07-13 23:49:36	2023-07-14 00:02:36	2023-07-14 00:05:36	On Time	1371.17	Card	91362	9021
9	374186990	52215833	2023-08-09 01:17:30	2023-08-09 01:37:30	2023-08-09 01:44:30	Slightly Delayed	1601.19	Wallet	77203	7955

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   order_id                5000 non-null   int64  
 1   customer_id             5000 non-null   int64  
 2   order_date              5000 non-null   object 
 3   promised_delivery_time  5000 non-null   object 
 4   actual_delivery_time    5000 non-null   object 
 5   delivery_status         5000 non-null   object 
 6   order_total             5000 non-null   float64
 7   payment_method          5000 non-null   object 
 8   delivery_partner_id     5000 non-null   int64  
 9   store_id                5000 non-null   int64  
dtypes: float64(1), int64(4), object(5)
memory usage: 390.8+ KB

order_id                  0
customer_id               0
order_date                0
promised_delivery_time    0
actual_delivery_time      0
delivery_status           0
order_total               0
payment_method            0
delivery_partner_id       0
store_id                  0
dtype: int64

	order_id	customer_id	order_total	delivery_partner_id	store_id
count	5.000000e+03	5.000000e+03	5000.00000	5000.000000	5000.000000
mean	5.029129e+09	5.009685e+07	2201.86170	50050.318200	4999.689000
std	2.863533e+09	2.919082e+07	1303.02438	28802.276922	2886.089242
min	6.046500e+04	3.181300e+04	13.25000	43.000000	1.000000
25%	2.531421e+09	2.404314e+07	1086.21500	24928.500000	2509.250000
50%	5.074378e+09	4.997808e+07	2100.69000	50262.500000	4987.000000
75%	7.488579e+09	7.621215e+07	3156.88250	74478.250000	7500.750000
max	9.998298e+09	9.989390e+07	6721.46000	99968.000000	9995.000000

	0
order_id	int64
customer_id	int64
order_date	object
promised_delivery_time	object
actual_delivery_time	object
delivery_status	object
order_total	float64
payment_method	object
delivery_partner_id	int64
store_id	int64

dtype: object

'Shape of DataFrame:'

(5000, 10)

'Data Types:'

order_id                    int64
customer_id                 int64
order_date                 object
promised_delivery_time     object
actual_delivery_time       object
delivery_status            object
order_total               float64
payment_method             object
delivery_partner_id         int64
store_id                    int64
dtype: object

'First 5 Rows:'

	order_id	customer_id	order_date	promised_delivery_time	actual_delivery_time	delivery_status	order_total	payment_method	delivery_partner_id	store_id
0	1961864118	30065862	2024-07-17 08:34:01	2024-07-17 08:52:01	2024-07-17 08:47:01	On Time	3197.07	Cash	63230	4771
1	1549769649	9573071	2024-05-28 13:14:29	2024-05-28 13:25:29	2024-05-28 13:27:29	On Time	976.55	Cash	14983	7534
2	9185164487	45477575	2024-09-23 13:07:12	2024-09-23 13:25:12	2024-09-23 13:29:12	On Time	839.05	UPI	39859	9886
3	9644738826	88067569	2023-11-24 16:16:56	2023-11-24 16:34:56	2023-11-24 16:33:56	On Time	440.23	Card	61497	7917
4	5427684290	83298567	2023-11-20 05:00:39	2023-11-20 05:17:39	2023-11-20 05:18:39	On Time	2526.68	Cash	84315	2741

Missing Values:
order_id                  0
customer_id               0
order_date                0
promised_delivery_time    0
actual_delivery_time      0
delivery_status           0
order_total               0
payment_method            0
delivery_partner_id       0
store_id                  0
dtype: int64

order_id                  0
customer_id               0
order_date                0
promised_delivery_time    0
actual_delivery_time      0
delivery_status           0
order_total               0
payment_method            0
delivery_partner_id       0
store_id                  0
dtype: int64

Number of Duplicated Rows: 0

Descriptive Statistics:

	order_id	customer_id	order_date	promised_delivery_time	actual_delivery_time	delivery_status	order_total	payment_method	delivery_partner_id	store_id
count	5.000000e+03	5.000000e+03	5000	5000	5000	5000	5000.00000	5000	5000.000000	5000.000000
unique	NaN	NaN	5000	4999	5000	3	NaN	4	NaN	NaN
top	NaN	NaN	2024-07-17 08:34:01	2024-10-13 14:06:50	2024-07-17 08:47:01	On Time	NaN	Card	NaN	NaN
freq	NaN	NaN	1	2	1	3470	NaN	1285	NaN	NaN
mean	5.029129e+09	5.009685e+07	NaN	NaN	NaN	NaN	2201.86170	NaN	50050.318200	4999.689000
std	2.863533e+09	2.919082e+07	NaN	NaN	NaN	NaN	1303.02438	NaN	28802.276922	2886.089242
min	6.046500e+04	3.181300e+04	NaN	NaN	NaN	NaN	13.25000	NaN	43.000000	1.000000
25%	2.531421e+09	2.404314e+07	NaN	NaN	NaN	NaN	1086.21500	NaN	24928.500000	2509.250000
50%	5.074378e+09	4.997808e+07	NaN	NaN	NaN	NaN	2100.69000	NaN	50262.500000	4987.000000
75%	7.488579e+09	7.621215e+07	NaN	NaN	NaN	NaN	3156.88250	NaN	74478.250000	7500.750000
max	9.998298e+09	9.989390e+07	NaN	NaN	NaN	NaN	6721.46000	NaN	99968.000000	9995.000000

Categorical Column Summary:

	order_date	promised_delivery_time	actual_delivery_time	delivery_status	payment_method
count	5000	5000	5000	5000	5000
unique	5000	4999	5000	3	4
top	2024-07-17 08:34:01	2024-10-13 14:06:50	2024-07-17 08:47:01	On Time	Card
freq	1	2	1	3470	1285


 Rows with Invalid (≤0) Order Total:
 Empty DataFrame
Columns: [order_id, customer_id, order_date, promised_delivery_time, actual_delivery_time, delivery_status, order_total, payment_method, delivery_partner_id, store_id]
Index: []


Rows with Invalid Delivery Times (Before Order Date):
         order_id  customer_id      order_date promised_delivery_time  \
1281  2148242015     39093069  30-04-23 23:48    2023-01-05 00:04:00   
1563  7821942520     84468124  12-02-24 23:57    2024-02-13 00:12:00   
2933  3368317718     70867736  31-05-24 23:59    2024-01-06 00:17:00   

     actual_delivery_time        delivery_status  order_total payment_method  \
1281  2023-01-05 00:33:00  Significantly Delayed      3239.66            UPI   
1563  2024-02-13 00:13:00                On Time      2732.91         Wallet   
2933  2024-01-06 00:12:00                On Time       914.88            UPI   

      delivery_partner_id  store_id  
1281                13676      9904  
1563                11136      5805  
2933                96055       522

Unique Value Counts per Column:
order_id                  5000
customer_id               2172
order_date                5000
promised_delivery_time    4999
actual_delivery_time      5000
delivery_status              3
order_total               4550
payment_method               4
delivery_partner_id       5000
store_id                  5000
dtype: int64

References

Vaghasiya, Akshit. 2025. “Blinkit Sales Dataset.” https://www.kaggle.com/datasets/akxiit/blinkit-sales-dataset.