Customer Segmentation Model

Authors
Affiliation

Utkarsh Tripathi (2025EM1100146)

BITS Pilani Digital

Juwaria Qadri (2025EM1100132)

BITS Pilani Digital

Mohammed Omar (2025EM1100235)

BITS Pilani Digital

Merin Ann Cherian (2025EM1100211)

BITS Pilani Digital

Phase 1

Segmenting Blinkit customers based on their spending behaviour and delivery experience to identify distinct customer clusters (high, medium, and low spending) for targeted marketing strategies and increasing sales

Proposal

  1. Project Statement
    Across customer-centric businesses, one of the major goals for the organization is being able to predict the needs of its customers and how to best cater to their needs while maintaining high profitability. However, to make the solution feasible, the way forward would be to create groups of consumers that share similar spending patterns, allowing the businesses to create targeted marketing campaigns, improve customer satisfaction, demand-oriented product development, and therefore improve profitability.

  2. Business Goal
    The objective of this project is to develop a machine learning model that can create suitable and usable clusters/segments given the profile and purchases of the customer. The model should then be able to predict, within bounds of acceptable error, which segment a customer is likely to belong to. The model should allow stakeholders to make informed and data-backed decisions on product demand, campaign successes and customer retention.

  3. Data Source
    We will use the “Blinkit Sales Dataset” for this project. This dataset provides detailed information on product sales, visibility, item types, and outlet performance, making it ideal for performing sales data analysis and gaining insights into business trends. The dataset is well-structured and suitable for data preprocessing, exploratory data analysis, and predictive modeling tasks.

    • Source Platform: Kaggle
    • Dataset: Blinkit Sales Data (Vaghasiya 2025)
  4. Tools and Technology
    Following languages and libraries are planned to be used for this project. More libraries may be used and every non trivial library shall be updated.

    • Python
      • Core Libraries
        • Data Manipulation: Pandas, NumPy
        • ML: scikit-learn
        • Data Visualization & Storytelling: Matplotlib, Seaborn
      • Development Environment: VS Code, Google Colab, GitHub
    • Dashboard: Power BI
  5. Project Workflow
    The project is to follow a standard data science development lifecycle,

    1. Data Acquisition: Fetch the dataset from Kaggle using its API.
    2. Preprocessing: Handle missing values (if any), encode categorical variables, and
      check for data inconsistencies.
    3. EDA: Analyze features to understand their relationship with attrition using statistical summaries and visualizations.
    4. Feature Engineering: Create new features from existing ones if necessary to improve model performance.
    5. Modeling: Train several clustering models (e.g., K Means Clustering, PCA, Decision Trees).
    6. Evaluation: Assess model performance using metrics like Accuracy, Precision, Recall, and F1-Score. Select the best-performing model.
    7. Visualization: Create an interactive dashboard in Power BI to present the key findings and predictions to stakeholders.
  6. Data Extraction
    The “Blinkit Sales Dataset” is acquired directly from the Kaggle repository. To ensure a professional and reproducible workflow, manually downloading the files is not done.
    Instead, we will perform the following steps:

    • Automate the Process: We will write a Python script that utilizes the official Kaggle API to connect to the source and download the dataset.
    • Ensure Reproducibility: This scripted approach guarantees that the data extraction process is consistent and can be easily re-run by any team member or reviewer.
    • Prepare for Analysis: The script will handle the unzipping of the downloaded files and load the data directly into a Pandas DataFrame, making it immediately available for the next phase of our project.
    • Notebook: Customer Segmentation Notebook
  7. Schema / Data Dictionary
    This data dictionary was created after inspecting the dataset.

Feature Name Data Type Description PK (Yes/No)
0 order_id int64 Unique identifier for each order Yes
1 customer_id int64 Unique identifier for each customer No
2 order_date object The timestamp when the order was placed No
3 promised_delivery_time object The time informed to the customer for completi... No
4 actual_delivery_time object The actual time when the order was delivered No
5 delivery_status object The delivery status of the order No
6 order_total float64 The total price of the order (in Rupees) No
7 payment_method object The payment method used by the customer No
8 delivery_partner_id int64 Unique identifier representing the delivery pa... No
9 store_id int64 Unique identifier for the store that fulfilled... No
Source: Data Dictionary

Phase 2

Open In Colab

Customer Segmentation Modelling

order_id customer_id order_date promised_delivery_time actual_delivery_time delivery_status order_total payment_method delivery_partner_id store_id
0 1961864118 30065862 2024-07-17 08:34:01 2024-07-17 08:52:01 2024-07-17 08:47:01 On Time 3197.07 Cash 63230 4771
1 1549769649 9573071 2024-05-28 13:14:29 2024-05-28 13:25:29 2024-05-28 13:27:29 On Time 976.55 Cash 14983 7534
2 9185164487 45477575 2024-09-23 13:07:12 2024-09-23 13:25:12 2024-09-23 13:29:12 On Time 839.05 UPI 39859 9886
3 9644738826 88067569 2023-11-24 16:16:56 2023-11-24 16:34:56 2023-11-24 16:33:56 On Time 440.23 Card 61497 7917
4 5427684290 83298567 2023-11-20 05:00:39 2023-11-20 05:17:39 2023-11-20 05:18:39 On Time 2526.68 Cash 84315 2741
5 3265154092 43367112 2023-03-18 16:29:51 2023-03-18 16:49:51 2023-03-18 16:48:51 On Time 3161.43 UPI 554 3442
6 4898355547 13284996 2023-04-16 18:50:37 2023-04-16 19:01:37 2023-04-16 19:02:37 On Time 956.40 Card 14630 1318
7 6568151549 88866835 2024-03-31 06:26:48 2024-03-31 06:37:48 2024-03-31 06:39:48 On Time 905.47 Cash 67714 115
8 6006693867 24496983 2023-07-13 23:49:36 2023-07-14 00:02:36 2023-07-14 00:05:36 On Time 1371.17 Card 91362 9021
9 374186990 52215833 2023-08-09 01:17:30 2023-08-09 01:37:30 2023-08-09 01:44:30 Slightly Delayed 1601.19 Wallet 77203 7955
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   order_id                5000 non-null   int64  
 1   customer_id             5000 non-null   int64  
 2   order_date              5000 non-null   object 
 3   promised_delivery_time  5000 non-null   object 
 4   actual_delivery_time    5000 non-null   object 
 5   delivery_status         5000 non-null   object 
 6   order_total             5000 non-null   float64
 7   payment_method          5000 non-null   object 
 8   delivery_partner_id     5000 non-null   int64  
 9   store_id                5000 non-null   int64  
dtypes: float64(1), int64(4), object(5)
memory usage: 390.8+ KB
order_id                  0
customer_id               0
order_date                0
promised_delivery_time    0
actual_delivery_time      0
delivery_status           0
order_total               0
payment_method            0
delivery_partner_id       0
store_id                  0
dtype: int64
order_id customer_id order_total delivery_partner_id store_id
count 5.000000e+03 5.000000e+03 5000.00000 5000.000000 5000.000000
mean 5.029129e+09 5.009685e+07 2201.86170 50050.318200 4999.689000
std 2.863533e+09 2.919082e+07 1303.02438 28802.276922 2886.089242
min 6.046500e+04 3.181300e+04 13.25000 43.000000 1.000000
25% 2.531421e+09 2.404314e+07 1086.21500 24928.500000 2509.250000
50% 5.074378e+09 4.997808e+07 2100.69000 50262.500000 4987.000000
75% 7.488579e+09 7.621215e+07 3156.88250 74478.250000 7500.750000
max 9.998298e+09 9.989390e+07 6721.46000 99968.000000 9995.000000
0
order_id int64
customer_id int64
order_date object
promised_delivery_time object
actual_delivery_time object
delivery_status object
order_total float64
payment_method object
delivery_partner_id int64
store_id int64

'Shape of DataFrame:'
(5000, 10)
'Data Types:'
order_id                    int64
customer_id                 int64
order_date                 object
promised_delivery_time     object
actual_delivery_time       object
delivery_status            object
order_total               float64
payment_method             object
delivery_partner_id         int64
store_id                    int64
dtype: object
'First 5 Rows:'
order_id customer_id order_date promised_delivery_time actual_delivery_time delivery_status order_total payment_method delivery_partner_id store_id
0 1961864118 30065862 2024-07-17 08:34:01 2024-07-17 08:52:01 2024-07-17 08:47:01 On Time 3197.07 Cash 63230 4771
1 1549769649 9573071 2024-05-28 13:14:29 2024-05-28 13:25:29 2024-05-28 13:27:29 On Time 976.55 Cash 14983 7534
2 9185164487 45477575 2024-09-23 13:07:12 2024-09-23 13:25:12 2024-09-23 13:29:12 On Time 839.05 UPI 39859 9886
3 9644738826 88067569 2023-11-24 16:16:56 2023-11-24 16:34:56 2023-11-24 16:33:56 On Time 440.23 Card 61497 7917
4 5427684290 83298567 2023-11-20 05:00:39 2023-11-20 05:17:39 2023-11-20 05:18:39 On Time 2526.68 Cash 84315 2741
Missing Values:
order_id                  0
customer_id               0
order_date                0
promised_delivery_time    0
actual_delivery_time      0
delivery_status           0
order_total               0
payment_method            0
delivery_partner_id       0
store_id                  0
dtype: int64
order_id                  0
customer_id               0
order_date                0
promised_delivery_time    0
actual_delivery_time      0
delivery_status           0
order_total               0
payment_method            0
delivery_partner_id       0
store_id                  0
dtype: int64
Number of Duplicated Rows: 0
Descriptive Statistics:
order_id customer_id order_date promised_delivery_time actual_delivery_time delivery_status order_total payment_method delivery_partner_id store_id
count 5.000000e+03 5.000000e+03 5000 5000 5000 5000 5000.00000 5000 5000.000000 5000.000000
unique NaN NaN 5000 4999 5000 3 NaN 4 NaN NaN
top NaN NaN 2024-07-17 08:34:01 2024-10-13 14:06:50 2024-07-17 08:47:01 On Time NaN Card NaN NaN
freq NaN NaN 1 2 1 3470 NaN 1285 NaN NaN
mean 5.029129e+09 5.009685e+07 NaN NaN NaN NaN 2201.86170 NaN 50050.318200 4999.689000
std 2.863533e+09 2.919082e+07 NaN NaN NaN NaN 1303.02438 NaN 28802.276922 2886.089242
min 6.046500e+04 3.181300e+04 NaN NaN NaN NaN 13.25000 NaN 43.000000 1.000000
25% 2.531421e+09 2.404314e+07 NaN NaN NaN NaN 1086.21500 NaN 24928.500000 2509.250000
50% 5.074378e+09 4.997808e+07 NaN NaN NaN NaN 2100.69000 NaN 50262.500000 4987.000000
75% 7.488579e+09 7.621215e+07 NaN NaN NaN NaN 3156.88250 NaN 74478.250000 7500.750000
max 9.998298e+09 9.989390e+07 NaN NaN NaN NaN 6721.46000 NaN 99968.000000 9995.000000
Categorical Column Summary:
order_date promised_delivery_time actual_delivery_time delivery_status payment_method
count 5000 5000 5000 5000 5000
unique 5000 4999 5000 3 4
top 2024-07-17 08:34:01 2024-10-13 14:06:50 2024-07-17 08:47:01 On Time Card
freq 1 2 1 3470 1285

 Rows with Invalid (≤0) Order Total:
 Empty DataFrame
Columns: [order_id, customer_id, order_date, promised_delivery_time, actual_delivery_time, delivery_status, order_total, payment_method, delivery_partner_id, store_id]
Index: []

Rows with Invalid Delivery Times (Before Order Date):
         order_id  customer_id      order_date promised_delivery_time  \
1281  2148242015     39093069  30-04-23 23:48    2023-01-05 00:04:00   
1563  7821942520     84468124  12-02-24 23:57    2024-02-13 00:12:00   
2933  3368317718     70867736  31-05-24 23:59    2024-01-06 00:17:00   

     actual_delivery_time        delivery_status  order_total payment_method  \
1281  2023-01-05 00:33:00  Significantly Delayed      3239.66            UPI   
1563  2024-02-13 00:13:00                On Time      2732.91         Wallet   
2933  2024-01-06 00:12:00                On Time       914.88            UPI   

      delivery_partner_id  store_id  
1281                13676      9904  
1563                11136      5805  
2933                96055       522  
Unique Value Counts per Column:
order_id                  5000
customer_id               2172
order_date                5000
promised_delivery_time    4999
actual_delivery_time      5000
delivery_status              3
order_total               4550
payment_method               4
delivery_partner_id       5000
store_id                  5000
dtype: int64

References

Vaghasiya, Akshit. 2025. “Blinkit Sales Dataset.” https://www.kaggle.com/datasets/akxiit/blinkit-sales-dataset.