The Cost of a Minute: How Delivery Delays Define Customer Value

Customer Segmentation Analysis on Blinkit Sales dataset

Authors
Affiliation

Utkarsh Tripathi

BITS Pilani

Juwaria Qadri

BITS Pilani

Mohammed Omar

BITS Pilani

Merin Ann Cherian

BITS Pilani

Published

December 12, 2025

1. Project Overview

The notebook is divided into the following components:

  • Data Loading: Importing the dataset.
  • Data Validation: Initial checks for structure and quality.
  • Data Cleaning: Preprocessing and handling missing values.
  • EDA: Exploratory Data Analysis to uncover patterns.
  • Feature Engineering: Creating new features for modelling.
  • Cluster Modelling: Applying K-Means clustering.
  • Basic Inferences: Deriving insights from the clusters.
Code
# Library imports
# Data Analytics
import pandas as pd
import numpy as np
# Data Viz
import seaborn as sns
import matplotlib.pyplot as plt
# scikit-learn
from sklearn.preprocessing import StandardScaler, RobustScaler, MinMaxScaler, OneHotEncoder
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score

2. Data Loading

Loading a CSV dataset into the Pandas DataFrame.

# Dataset loading

input_path = "./blinkit_orders.csv"
df=pd.read_csv(input_path)

3. Data Validation

Basic Data Exploration & Summary

The following functions are used to inspect the dataset structure and summary statistics: - basic_exploration(df): Prints shape, top rows, info, null counts, unique values, and duplicates. - summarize_df(df): Provides descriptive statistics and checks for anomalies (e.g., negative order totals).

Click to see Data Validation Functions
def basic_exploration(df):
    print("Shape of DataFrame:")
    print(df.shape)

    # Display top 5 rows
    print("Top 5 Rows:")
    display(df.head())

    # Basic info about the data
    print("Basic Information:")
    print(df.info())

    # Count null values and describe statistics
    print("Count of Null Values:")
    print(df.isnull().sum())

    print("Unique Value Counts per Column:", df.nunique(), sep="\n")

    print("Number of Duplicated Rows:", df.duplicated().sum())

def summarize_df(df):
    # Checking for anomalies
    display(df.describe(include='all'))

    # Categorical column summary
    display(df.describe(include=['object']))

    if 'order_total' in df.columns:
        invalid_order_total = df[df['order_total'] <= 0]
        print("Rows with Invalid (≤0) Order Total:")
        display(invalid_order_total)

basic_exploration(df)
summarize_df(df)
Shape of DataFrame:
(5000, 10)
Top 5 Rows:
order_id customer_id order_date promised_delivery_time actual_delivery_time delivery_status order_total payment_method delivery_partner_id store_id
0 1961864118 30065862 2024-07-17 08:34:01 2024-07-17 08:52:01 2024-07-17 08:47:01 On Time 3197.07 Cash 63230 4771
1 1549769649 9573071 2024-05-28 13:14:29 2024-05-28 13:25:29 2024-05-28 13:27:29 On Time 976.55 Cash 14983 7534
2 9185164487 45477575 2024-09-23 13:07:12 2024-09-23 13:25:12 2024-09-23 13:29:12 On Time 839.05 UPI 39859 9886
3 9644738826 88067569 2023-11-24 16:16:56 2023-11-24 16:34:56 2023-11-24 16:33:56 On Time 440.23 Card 61497 7917
4 5427684290 83298567 2023-11-20 05:00:39 2023-11-20 05:17:39 2023-11-20 05:18:39 On Time 2526.68 Cash 84315 2741
Basic Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 10 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   order_id                5000 non-null   int64  
 1   customer_id             5000 non-null   int64  
 2   order_date              5000 non-null   object 
 3   promised_delivery_time  5000 non-null   object 
 4   actual_delivery_time    5000 non-null   object 
 5   delivery_status         5000 non-null   object 
 6   order_total             5000 non-null   float64
 7   payment_method          5000 non-null   object 
 8   delivery_partner_id     5000 non-null   int64  
 9   store_id                5000 non-null   int64  
dtypes: float64(1), int64(4), object(5)
memory usage: 390.8+ KB
None
Count of Null Values:
order_id                  0
customer_id               0
order_date                0
promised_delivery_time    0
actual_delivery_time      0
delivery_status           0
order_total               0
payment_method            0
delivery_partner_id       0
store_id                  0
dtype: int64
Unique Value Counts per Column:
order_id                  5000
customer_id               2172
order_date                5000
promised_delivery_time    4999
actual_delivery_time      5000
delivery_status              3
order_total               4550
payment_method               4
delivery_partner_id       5000
store_id                  5000
dtype: int64
Number of Duplicated Rows: 0
order_id customer_id order_date promised_delivery_time actual_delivery_time delivery_status order_total payment_method delivery_partner_id store_id
count 5.000000e+03 5.000000e+03 5000 5000 5000 5000 5000.00000 5000 5000.000000 5000.000000
unique NaN NaN 5000 4999 5000 3 NaN 4 NaN NaN
top NaN NaN 2023-08-23 12:04:18 2024-10-13 14:06:50 2023-08-23 12:21:18 On Time NaN Card NaN NaN
freq NaN NaN 1 2 1 3470 NaN 1285 NaN NaN
mean 5.029129e+09 5.009685e+07 NaN NaN NaN NaN 2201.86170 NaN 50050.318200 4999.689000
std 2.863533e+09 2.919082e+07 NaN NaN NaN NaN 1303.02438 NaN 28802.276922 2886.089242
min 6.046500e+04 3.181300e+04 NaN NaN NaN NaN 13.25000 NaN 43.000000 1.000000
25% 2.531421e+09 2.404314e+07 NaN NaN NaN NaN 1086.21500 NaN 24928.500000 2509.250000
50% 5.074378e+09 4.997808e+07 NaN NaN NaN NaN 2100.69000 NaN 50262.500000 4987.000000
75% 7.488579e+09 7.621215e+07 NaN NaN NaN NaN 3156.88250 NaN 74478.250000 7500.750000
max 9.998298e+09 9.989390e+07 NaN NaN NaN NaN 6721.46000 NaN 99968.000000 9995.000000
order_date promised_delivery_time actual_delivery_time delivery_status payment_method
count 5000 5000 5000 5000 5000
unique 5000 4999 5000 3 4
top 2023-08-23 12:04:18 2024-10-13 14:06:50 2023-08-23 12:21:18 On Time Card
freq 1 2 1 3470 1285
Rows with Invalid (≤0) Order Total:
order_id customer_id order_date promised_delivery_time actual_delivery_time delivery_status order_total payment_method delivery_partner_id store_id

4. Data Cleaning

The data cleaning process involves several key steps to ensure data quality:

  1. Drop Irrelevant Columns: Removing columns that do not contribute to analysis.
  2. Remove High Missing Value Columns: Dropping features with >50% missing data.
  3. Impute Missing Values:
    • Numerical: Median
    • Categorical: Mode
    • Others: Forward Fill
  4. Remove Duplicates: Eliminating identical rows.
  5. Handle Outliers: Capping order_total using IQR (Interquartile Range).
  6. Date Conversions: Parsing date columns to datetime objects.
  7. Feature Creation: Calculating delivery_delay_minutes.
  8. Type Conversion: Converting object columns to categorical types for efficiency.
Click to see Data Cleaning Functions
def drop_irrelevant_columns(df, irrelevant_cols):
    """Drop columns with irrelevant information."""
    for col in irrelevant_cols:
        if col in df.columns:
            df.drop(columns=[col], inplace=True)

def drop_high_missing_value_columns(df, threshold=0.5):
    """Remove columns with missing values above a certain threshold."""
    cols_to_drop = [col for col in df.columns if df[col].isnull().mean() >= threshold]
    df.drop(columns=cols_to_drop, inplace=True)

def impute_missing_values(df):
    """Impute missing values based on the data type of each column."""
    for col in df.columns:
        if df[col].dtype in ['float64', 'int64']:
            df[col] = df[col].fillna(df[col].median())
        elif df[col].dtype == 'object':
            df[col] = df[col].fillna(df[col].mode()[0])
        else:
            df[col] = df[col].ffill()

def remove_duplicates(df):
    """Remove duplicate rows."""
    before = df.shape[0]
    df.drop_duplicates(inplace=True)
    print(f"Removed {before - df.shape[0]} duplicate rows")

def handle_outliers(df, column='order_total'):
    """Handle outliers using IQR capping."""
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    df.loc[df[column] < lower_bound, column] = lower_bound
    df.loc[df[column] > upper_bound, column] = upper_bound

def convert_to_datetime(df, columns):
    """Convert date/time columns to datetime."""
    for col in columns:
        if col in df.columns:
            df[col] = pd.to_datetime(df[col], dayfirst=False, errors='coerce')

def calculate_delivery_delay(df):
    """Calculate delivery delay (in minutes)."""
    df['delivery_delay_minutes'] = (df['actual_delivery_time'] -
                                    df['promised_delivery_time']).dt.total_seconds() / 60

def convert_categorical_like_columns(df, columns):
    """Convert categorical-like columns to category type."""
    for col in columns:
        if col in df.columns:
            df[col] = df[col].astype('category')

def save_cleaned_dataset(df):
    """Save the cleaned dataset to a CSV file."""
    # df.to_csv(filename, index=False)
    print("Data cleaning completed successfully!")
    print("Final Data Shape:", df.shape)
    print("Data Types:", df.dtypes, sep='\n')
    return df

Execution of Data Cleaning

We apply the defined cleaning pipeline to our dataframe.

Code
# List of irrelevant columns (example placeholder)
irrelevant_cols = ['irrelevant_column']

# Columns with high missing values threshold
threshold_high_missing_values = 0.5

# Categorical-like columns to convert
categorical_like_columns = ['payment_method', 'delivery_status']

drop_irrelevant_columns(df, irrelevant_cols)
drop_high_missing_value_columns(df, threshold=threshold_high_missing_values)
impute_missing_values(df)
remove_duplicates(df)
handle_outliers(df)
convert_to_datetime(df, columns=['order_date','actual_delivery_time', 'promised_delivery_time'])
calculate_delivery_delay(df)
convert_categorical_like_columns(df, categorical_like_columns)

# Save the cleaned dataset
df_cleaned = save_cleaned_dataset(df)
Removed 0 duplicate rows
Data cleaning completed successfully!
Final Data Shape: (5000, 11)
Data Types:
order_id                           int64
customer_id                        int64
order_date                datetime64[ns]
promised_delivery_time    datetime64[ns]
actual_delivery_time      datetime64[ns]
delivery_status                 category
order_total                      float64
payment_method                  category
delivery_partner_id                int64
store_id                           int64
delivery_delay_minutes           float64
dtype: object

5. Exploratory Data Analysis (EDA)

We explore the dataset to understand key distributions, relationships, and customer behaviors.

Statistical Summary

We begin with a statistical overview of the dataset columns.

  • Numerical Columns: Descriptive stats for order totals and delivery delays.
  • Correlation: Checking the relationship between spending and delays.
  • Categorical Columns: Analyzing distributions of payment methods and delivery status.
  • Store Performance: Comparing delivery performance across stores.
  • Daily Trends: Aggregating sales over time.
Click to see EDA Helper Functions
def print_numerical_columns(df):
    """Print numerical columns with descriptive stats."""
    print("Numerical Columns")
    print("Descriptive stats", df[['order_total', 'delivery_delay_minutes']].describe(), sep="\n")
    print("\nDistribution of Spending", df[['order_total']].quantile([0.25, 0.5, 0.75, 0.95]), sep="\n")

def print_correlation(df):
    """Print correlation between numerical columns."""
    print("Correlation between order total and delivery delay in minutes",
          df[['order_total', 'delivery_delay_minutes']].corr(), sep="\n")

def print_categorical_columns(df):
    """Print categorical columns with distribution overview."""
    print("\nCategorical Columns")
    print("Payment Methods Distribution", df['payment_method'].value_counts(), sep="\n")
    print("Delivery Status Distribution",
          (df['delivery_status'].value_counts(normalize=True) * 100).round(2), sep="\n")

def filter_significantly_delayed_orders(df):
    """Filter and print significantly delayed orders."""
    significantly_delayed_orders = df[df['delivery_status'] == 'Significantly Delayed']
    count_delayed = significantly_delayed_orders.shape[0]
    print("Significantly delayed orders:", sep='\n')
    display(significantly_delayed_orders.head(5))
    display(count_delayed)

def generate_customer_summary(df):
    """Generate a summary table for each customer."""
    customer_summary = df.groupby('customer_id').agg({
        'order_total': 'sum',
        'order_id': 'count',
        'delivery_delay_minutes': 'mean'
    }).rename(columns={
        'order_total': 'total_spent',
        'order_id': 'order_count',
        'delivery_delay_minutes': 'avg_delivery_delay'
    }).sort_values(by='order_count', ascending=False)
    print("Summary Table for Each Customer:", sep='\n')
    display(customer_summary.head(10))

def generate_payment_method_summary(df):
    """Generate a summary table for payment methods."""
    payment_summary_df = df.groupby('payment_method', observed=True).agg({
        'order_total': 'mean',
        'order_id': 'count'
    }).rename(columns={
        'order_total': 'avg_order_value',
        'order_id': 'num_orders'
    })
    print("Payment Method Summary:")
    display(payment_summary_df)

def store_wise_order_total_by_delivery_status(df):
    """Generate a pivot table for order total by delivery status per store."""
    pivot_table = pd.pivot_table(df,
                                 index='store_id',
                                 columns='delivery_status',
                                 values='order_total',
                                 aggfunc='sum',
                                 fill_value=0,
                                 observed=False)
    print("Store-Wise Order Total by Delivery Status:")
    display(pivot_table)

def daily_sales_patterns(df):
    """Generate daily sales patterns."""
    daily_sales_df = df.set_index('order_date').resample('D').agg({
        'order_total': 'sum',
        'order_id': 'count'
    }).rename(columns={
        'order_total': 'total_sales',
        'order_id': 'num_orders'
    }).sort_index(level=1, ascending=False)
    print("Daily Sales Patterns:")
    display(daily_sales_df.head(10))
Code
print_numerical_columns(df_cleaned)
print_correlation(df_cleaned)
print_categorical_columns(df_cleaned)
filter_significantly_delayed_orders(df_cleaned)
generate_customer_summary(df_cleaned)
generate_payment_method_summary(df_cleaned)
store_wise_order_total_by_delivery_status(df_cleaned)
daily_sales_patterns_df = daily_sales_patterns(df_cleaned)
Numerical Columns
Descriptive stats
       order_total  delivery_delay_minutes
count  5000.000000             5000.000000
mean   2201.674720                4.443000
std    1302.416242                8.063929
min      13.250000               -5.000000
25%    1086.215000               -1.000000
50%    2100.690000                2.000000
75%    3156.882500                8.000000
max    6262.883750               30.000000

Distribution of Spending
      order_total
0.25    1086.2150
0.50    2100.6900
0.75    3156.8825
0.95    4452.7910
Correlation between order total and delivery delay in minutes
                        order_total  delivery_delay_minutes
order_total                1.000000               -0.002591
delivery_delay_minutes    -0.002591                1.000000

Categorical Columns
Payment Methods Distribution
payment_method
Card      1285
Cash      1257
Wallet    1244
UPI       1214
Name: count, dtype: int64
Delivery Status Distribution
delivery_status
On Time                  69.40
Slightly Delayed         20.74
Significantly Delayed     9.86
Name: proportion, dtype: float64
Significantly delayed orders:
order_id customer_id order_date promised_delivery_time actual_delivery_time delivery_status order_total payment_method delivery_partner_id store_id delivery_delay_minutes
23 6905293278 97259564 2024-06-16 10:23:23 2024-06-16 10:36:23 2024-06-16 11:02:23 Significantly Delayed 1404.68 Cash 53868 4187 26.0
31 9073572548 97111069 2023-05-02 23:13:42 2023-05-02 23:25:42 2023-05-02 23:47:42 Significantly Delayed 1091.89 Card 28575 6542 22.0
62 2149289255 57579640 2024-11-01 15:00:29 2024-11-01 15:20:29 2024-11-01 15:39:29 Significantly Delayed 2687.31 Cash 25251 7315 19.0
70 2122832513 55028958 2024-02-17 19:48:14 2024-02-17 19:58:14 2024-02-17 20:18:14 Significantly Delayed 2037.94 UPI 39396 5245 20.0
76 5127172657 32293017 2024-06-11 18:11:31 2024-06-11 18:30:31 2024-06-11 18:50:31 Significantly Delayed 2185.50 Cash 80034 288 20.0
493
Summary Table for Each Customer:
total_spent order_count avg_delivery_delay
customer_id
77869660 19052.94 9 5.666667
17805991 18409.90 8 4.625000
8791577 19028.36 8 7.750000
93018527 9521.09 7 0.142857
12832151 17178.65 7 6.000000
75213636 10038.08 7 1.142857
10562528 11050.11 7 2.714286
13604883 14002.59 7 3.000000
21701991 11648.60 7 5.571429
25128143 17638.83 7 5.428571
Payment Method Summary:
avg_order_value num_orders
payment_method
Card 2229.496449 1285
Cash 2204.028632 1257
UPI 2189.458323 1214
Wallet 2182.479317 1244
Store-Wise Order Total by Delivery Status:
delivery_status On Time Significantly Delayed Slightly Delayed
store_id
1 1172.69 0.00 0.00
2 22.14 0.00 0.00
9 1599.11 0.00 0.00
12 1639.04 0.00 0.00
14 0.00 704.19 0.00
... ... ... ...
9984 3327.67 0.00 0.00
9987 1052.23 0.00 0.00
9989 0.00 3706.10 0.00
9991 2650.68 0.00 0.00
9995 0.00 0.00 2989.97

5000 rows × 3 columns

Daily Sales Patterns:
total_sales num_orders
order_date
2024-11-04 13820.12 4
2024-11-03 28761.95 12
2024-11-02 10794.55 5
2024-11-01 14781.03 9
2024-10-31 25689.95 8
2024-10-30 23835.14 13
2024-10-29 11223.88 7
2024-10-28 17221.29 8
2024-10-27 11797.07 5
2024-10-26 23652.49 11

Visualisation

We perform detailed visualisation to validate our statistical findings.

  • Univariate Analysis: Histograms and boxplots for individual features.
  • Bivariate & Multivariate: Scatter plots to explore interaction effects (e.g., Order Total vs Delivery Delay).
  • Outlier Detection: Visualizing and removing extensive outliers in delivery delays.
Click to see Visualization Functions
def univariate_numeric_features(df):
    """Univariate analysis for numeric features."""
    # Order Total Distribution
    sns.histplot(df['order_total'], kde=True, bins=50)
    plt.title('Order Total Distribution')
    plt.xlabel('Order Total (₹)')
    plt.ylabel('Number of Orders')
    plt.show()

    # Delivery Delay Distribution
    sns.histplot(df['delivery_delay_minutes'], bins=10)
    plt.title('Delivery Delay Distribution (in minutes)')
    plt.xlabel('Delivery Delay (minutes)')
    plt.ylabel('Number of Orders')
    plt.show()

def univariate_categorical_features(df):
    """Univariate analysis for categorical features."""
    # Delivery Status Breakdown
    delivery_counts = df['delivery_status'].value_counts()
    plt.figure(figsize=(6, 6))
    plt.pie(delivery_counts, labels=delivery_counts.index, autopct='%1.1f%%', colors=['green', 'orange', 'red'])
    plt.title("Delivery Status Breakdown")
    plt.show()

    # Payment Method Count
    sns.countplot(x='payment_method', data=df)
    plt.title('Payment Method Count')
    plt.xlabel('Payment Method')
    plt.ylabel('Number of Orders')
    plt.show()

def univariate_customer_metrics(df):
    """Univariate analysis for customer metrics."""
    # Customer Frequency Distribution
    order_frequency_dist = df.groupby('customer_id').agg(
        total_orders=('order_id', 'count'),  # Total orders
        total_spent=('order_total', 'sum'),  # Lifetime value
        avg_order_value=('order_total', 'mean')  # Average order value
    )['total_orders'].value_counts().sort_index()

    print(order_frequency_dist)

    plt.figure(figsize=(8, 5))
    plt.bar(order_frequency_dist.index, order_frequency_dist.values, color='skyblue')
    plt.xlabel("Number of Orders")
    plt.ylabel("Number of Customers")
    plt.title("Customer Order Frequency Distribution")
    plt.show()

def univariate_one_time_vs_repeat_customers(df):
    """Univariate analysis for one-time vs repeat customers."""
    customer_order_counts = df.groupby('customer_id').size()
    print(f"One-time customers: {customer_order_counts[customer_order_counts == 1].count()}")
    print(f"Repeat customers: {customer_order_counts[customer_order_counts > 1].count()}")

def bivariate_analysis(df):
    """Bivariate analysis."""
    # Compare order amounts for On Time vs. Delayed deliveries
    plt.figure(figsize=(8, 5))
    sns.boxplot(x='delivery_status', y='order_total', data=df)
    plt.title("Order Total by Delivery Status")
    plt.xlabel("Delivery Status")
    plt.ylabel("Order Total (₹)")
    plt.show()

    # Customer Spending Patterns
    customer_metrics = df.groupby('customer_id').agg(
        total_orders=('order_id', 'count'),
        total_spent=('order_total', 'sum'),
        avg_order_value=('order_total', 'mean')
    )

    sns.scatterplot(
        data=customer_metrics,
        x='total_orders',
        y='avg_order_value',
        hue='total_orders',
        palette='coolwarm',
        alpha=0.7
    )
    plt.title("Average Order Value vs. Total Orders")
    plt.xlabel("Total Orders")
    plt.ylabel("Average Order Value (₹)")
    plt.show()

    # Aggregate total spend per customer and visualize
    customer_spending = df.groupby('customer_id')['order_total'].sum()

    sns.boxplot(x=customer_spending)
    plt.title("Customer Total Spending Distribution")
    plt.xlabel("Total Spend (₹)")
    plt.ylabel("Number of Customers")
    plt.show()

    plt.figure(figsize=(8, 5))
    plt.hist(customer_spending, bins=20, color='skyblue', edgecolor='black')
    plt.title("Customer Spending Distribution")
    plt.xlabel("Total Spend (₹)")
    plt.ylabel("Number of Customers")
    plt.show()

def multivariate_analysis(df):
    """Multivariate analysis."""
    # Relationship Between Order Total and Delivery Delay
    plt.figure(figsize=(12, 5))
    sns.scatterplot(
        x='order_total',
        y='delivery_delay_minutes',
        hue='payment_method',
        data=df,
        alpha=0.7,
        palette='pastel'
    )
    plt.title("Order Total vs Delivery Delay by Payment Method")
    plt.xlabel("Order Total (₹)")
    plt.ylabel("Delivery Delay (minutes)")
    plt.legend(title="Payment Method")
    plt.show()

def boxplot_delivery_delays(df):
    """Plot a boxplot of delivery delays."""
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df['delivery_delay_minutes'], color='green')
    plt.title("Boxplot of Delivery Delays")
    plt.xlabel("Delivery Delay (minutes)")
    plt.show()

def violin_plot_delivery_delays(df):
    """Plot a violin plot of delivery delays."""
    plt.figure(figsize=(6, 4))
    sns.violinplot(x=df['delivery_delay_minutes'], color='lightcoral')
    plt.title("Violin Plot of Delivery Delays")
    plt.xlabel("Delivery Delay (minutes)")
    plt.show()

def remove_outliers_delivery_delays(df):
    """Remove outliers from delivery delays and plot the boxplot."""
    # Calculate Q1, Q3, and IQR
    Q1 = df['delivery_delay_minutes'].quantile(0.25)
    Q3 = df['delivery_delay_minutes'].quantile(0.75)
    IQR = Q3 - Q1

    # Filter out outliers
    df_cleaned = df[(df['delivery_delay_minutes'] >= (Q1 - 1.5 * IQR)) &
                    (df['delivery_delay_minutes'] <= (Q3 + 1.5 * IQR))]

    # Plot boxplot without outliers
    plt.figure(figsize=(6, 4))
    sns.boxplot(x=df_cleaned['delivery_delay_minutes'], color='lightgreen')
    plt.title("Boxplot of Delivery Delays (Outliers Removed)")
    plt.xlabel("Delivery Delay (minutes)")
    plt.show()
    return df_cleaned
Code
univariate_numeric_features(df_cleaned)
univariate_categorical_features(df_cleaned)
univariate_customer_metrics(df_cleaned)
univariate_one_time_vs_repeat_customers(df_cleaned)
bivariate_analysis(df_cleaned)
multivariate_analysis(df_cleaned)
boxplot_delivery_delays(df_cleaned)
violin_plot_delivery_delays(df_cleaned)

total_orders
1    680
2    686
3    451
4    234
5     82
6     28
7      8
8      2
9      1
Name: count, dtype: int64

One-time customers: 680
Repeat customers: 1492

The delivery delay data shows significant outliers. We apply IQR filtering to clean this distribution.

Code
df_cleaned = remove_outliers_delivery_delays(df_cleaned)

# Save the cleaned dataset again after outlier removal
# df_cleaned = save_cleaned_dataset(df_cleaned, 'cleaned_data.csv')

Final Data Inspection

Code
df_cleaned.head(5)
order_id customer_id order_date promised_delivery_time actual_delivery_time delivery_status order_total payment_method delivery_partner_id store_id delivery_delay_minutes
0 1961864118 30065862 2024-07-17 08:34:01 2024-07-17 08:52:01 2024-07-17 08:47:01 On Time 3197.07 Cash 63230 4771 -5.0
1 1549769649 9573071 2024-05-28 13:14:29 2024-05-28 13:25:29 2024-05-28 13:27:29 On Time 976.55 Cash 14983 7534 2.0
2 9185164487 45477575 2024-09-23 13:07:12 2024-09-23 13:25:12 2024-09-23 13:29:12 On Time 839.05 UPI 39859 9886 4.0
3 9644738826 88067569 2023-11-24 16:16:56 2023-11-24 16:34:56 2023-11-24 16:33:56 On Time 440.23 Card 61497 7917 -1.0
4 5427684290 83298567 2023-11-20 05:00:39 2023-11-20 05:17:39 2023-11-20 05:18:39 On Time 2526.68 Cash 84315 2741 1.0
Code
df_cleaned['delivery_delay_minutes'].describe()
count    4703.000000
mean        3.099086
std         6.188961
min        -5.000000
25%        -2.000000
50%         2.000000
75%         6.000000
max        21.000000
Name: delivery_delay_minutes, dtype: float64

6. Feature Engineering

We transform raw transactional data into customer-centric features to prepare for clustering.

Key Transformation Steps

  1. Aggregation: Grouping by customer_id to calculate metrics:
    • avg_order_total: Mean value of orders.
    • total_spent: Sum of all orders.
    • num_orders: Count of orders.
    • avg_delay: Mean delivery delay.
    • most_used_payment: Mode of payment method.
    • most_delivery_status: Mode of delivery status.
  2. Encoding: One-hot encoding categorical variables (most_used_payment, most_delivery_status).
  3. Scaling & Normalization:
    • StandardScaler: For avg_order_total.
    • Log Transform + StandardScaler: For total_spent to handle skewness.
    • RobustScaler: For avg_delay to handle outliers.
    • MinMax Scaler: For num_orders to range (-1, 1).
Click to see Feature Engineering Functions
def aggregate_customer_data(df_cleaned):
    # Aggregating customer data
    customer_df = df_cleaned.groupby('customer_id').agg({
        'order_total': ['mean', 'sum', 'count'],
        'delivery_delay_minutes': 'mean',
        'payment_method': lambda x: x.mode()[0],
        'delivery_status': lambda x: x.mode()[0]
    }).reset_index()

    # Renaming columns
    customer_df.columns = [
        'customer_id', 'avg_order_total', 'total_spent', 'num_orders',
        'avg_delay', 'most_used_payment', 'most_delivery_status'
    ]
    return customer_df

def one_hot_encode_categorical_features(df):
    # One-hot encoding categorical features separately
    df_temp = df.copy()
    one_hot = pd.get_dummies(df_temp[['most_used_payment', 'most_delivery_status']], prefix=['payment', 'status'])
    df_final = pd.concat([df_temp.drop(columns=['most_used_payment', 'most_delivery_status']), one_hot], axis=1)
    return df_final

def handle_numerical_features(df):
    # Standardize avg_order_total
    scaler = StandardScaler()
    df['avg_order_total'] = scaler.fit_transform(df[['avg_order_total']])

    # Log transformation for total_spent to fix skewness
    df['total_spent'] = np.log1p(df['total_spent'])
    scaler = StandardScaler()
    df['total_spent'] = scaler.fit_transform(df[['total_spent']])

    # Robust scaling for avg_delay
    scaler = RobustScaler()
    df['avg_delay'] = scaler.fit_transform(df[['avg_delay']])

    # MinMax scaling for num_orders
    scaler = MinMaxScaler(feature_range=(-1, 1))
    df['num_orders'] = scaler.fit_transform(df[['num_orders']])

    return df

def process_customer_data(df_cleaned):
    customer_df = aggregate_customer_data(df_cleaned)
    customer_df_original=customer_df
    one_hot_encoded_df = one_hot_encode_categorical_features(customer_df)
    processed_df = handle_numerical_features(one_hot_encoded_df)

    # Visualizations
    plot_correlation_matrix(processed_df)
    plot_histogram(processed_df['avg_order_total'])
    plot_histogram(processed_df['total_spent'])
    plot_histogram(processed_df['avg_delay'])
    plot_histogram(processed_df['num_orders'])

    return processed_df,customer_df_original

def plot_correlation_matrix(df):
    plt.figure(figsize=(10, 8))
    sns.heatmap(df.corr(), cmap='coolwarm', annot=False)
    plt.title("Feature Correlation Matrix")
    plt.show()

def plot_histogram(feature):
    plt.hist(feature, bins=150)
    plt.xlabel(feature.name)
    plt.ylabel('Frequency')
    plt.title(f'Histogram of {feature.name}')
    plt.show()

Execution

We process the data and visualize the distribution of updated features.

Code
customer_df, customer_df_original = process_customer_data(df_cleaned)

# Final Scaling adjustments (redundant but included for consistency with original script)
scaler = RobustScaler()
customer_df['avg_delay'] = scaler.fit_transform(customer_df[['avg_delay']])

scaler = MinMaxScaler(feature_range=(-1, 1))
customer_df['num_orders'] = scaler.fit_transform(customer_df[['num_orders']] )
customer_df[customer_df.select_dtypes('bool').columns] = customer_df.select_dtypes('bool').astype(int)

Dataset Finalization

We remove identifiers and perform a final standard scaling on the entire dataset to ensure uniformity for the clustering algorithm.

Code
customer_df = customer_df.drop(['customer_id'], axis=1)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(customer_df)

# Save processed data
# customer_df.to_csv('customer_df_processed.csv')

# Preview
customer_df.head(5)
avg_order_total total_spent num_orders avg_delay payment_Card payment_Cash payment_UPI payment_Wallet status_On Time status_Significantly Delayed status_Slightly Delayed
0 0.663119 0.513699 -0.714286 0.909091 0 0 1 0 1 0 0
1 0.418142 0.865678 -0.428571 0.515152 1 0 0 0 1 0 0
2 1.482011 1.572998 -0.142857 0.954545 0 0 1 0 0 0 1
3 0.394492 0.402101 -0.714286 0.454545 0 1 0 0 1 0 0
4 -0.144778 0.137538 -0.714286 2.181818 1 0 0 0 0 0 1

7. Cluster Modelling

We employ K-Means Clustering to segment customers. To determine the optimal number of segments (k), we evaluate multiple scenarios using key performance metrics.

Methodology

  1. Iterative Clustering: Run K-Means for k from 2 to 5.

  2. Dimensionality Reduction: Use PCA (Principal Component Analysis) to visualize high-dimensional clusters in 2D.

  3. Evaluation Metrics: To rigorously assess cluster quality, we analyze the evolution of three distinct metrics as k varies:

    • Silhouette Score: Measures the mean distance between a sample and all other points in the same cluster (\(a\)) versus the mean distance to points in the nearest neighboring cluster (\(b\)). The score is given by \((b - a) / \max(a, b)\).
      • Evolution: We look for a peak. A decrease as k increases suggests that clusters are becoming less distinct or that natural clusters are being artificially split.
    • Davies-Bouldin Index: Represents the average similarity between each cluster and its most similar one, where similarity is the ratio of within-cluster scatter to between-cluster separation.
      • Evolution: We look for a minimum. A rising trend indicates that new clusters are overlapping more or becoming less compact relative to their separation.
    • Calinski-Harabasz Score: Also known as the Variance Ratio Criterion, it is the ratio of the sum of between-clusters dispersion and of inter-cluster dispersion for all clusters.
      • Evolution: We look for a peak. A drop implies that the gain in definition from adding another cluster does not justify the increased complexity (variance).
Click to see Clustering Functions
def perform_kmeans_clustering(X_scaled, k_values):
    """
    Execute kmeans clustering algorithm and log the results.
    """
    silhouette_scores = []
    dbi_scores = []
    ch_scores = []
    cluster_data = pd.DataFrame()

    for k in k_values:
        # Perform KMeans clustering
        kmeans = KMeans(n_clusters=k, random_state=42)

        # PCA transformation for viz
        pca = PCA(n_components=2)
        pca_data = pca.fit_transform(X_scaled)
        clusters = kmeans.fit_predict(X_scaled) # Fitting on scaled data, not PCA for accuracy

        # Create a DataFrame for visualization and export
        pca_df = pd.DataFrame(pca_data, columns=['PC1', 'PC2'])
        pca_df['cluster'] = clusters
        pca_df['k'] = k

        # Plot the PCA-transformed data with clusters
        plt.figure(figsize=(8, 6))
        sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='cluster', palette='Set2')
        plt.title(f"Customer Clusters Visualized with PCA (k={k})")
        plt.show()

        # Calculate scores
        silhouette_scores.append(silhouette_score(X_scaled, clusters))
        dbi_scores.append(davies_bouldin_score(X_scaled, clusters))
        ch_scores.append(calinski_harabasz_score(X_scaled, clusters))

        # Append the current cluster data to the overall DataFrame
        if cluster_data.empty:
            cluster_data = pca_df
        else:
            cluster_data = pd.concat([cluster_data, pca_df], ignore_index=True)

    score_dict = {
        'Silhouette Scores': silhouette_scores,
        'Davies Bouldin scores':dbi_scores,
        'Calinski Harabasz Scores':ch_scores
    }

    return score_dict, cluster_data

def export_model_data(cluster_data, filename):
    cluster_data.to_csv(filename, index=False)
    print(f"Model data exported successfully to {filename}")

Hyperparameter Tuning (Finding optimal k)

We test k values from 2 to 6 to find the “elbow” or best performing configuration.

Code
k_values = list(range(2, 6))
scoreset, cluster_data = perform_kmeans_clustering(X_scaled, k_values)
# export_model_data(cluster_data, "kmeans_clusters.csv")

Performance Evaluation

We analyse the change in metrics as k increases.

Code
# Plotting scores
for metric_name, scores in scoreset.items():
    plt.figure(figsize=(8,5))
    plt.plot(k_values, scores, marker='o', linestyle='-', color='blue')
    plt.xticks(k_values)
    plt.xlabel("Number of Clusters (k)")
    plt.ylabel(metric_name)
    plt.title(f"{metric_name} vs Number of Clusters")
    plt.grid(True)
    plt.show()

# Save scores
score_df = pd.DataFrame(scoreset, index=k_values)
# score_df.to_csv("scores.csv", index=True, index_label='k_value')

Selection of Best K

Based on the evaluation metrics:

  • Silhouette Score: Highest at k=2.
  • Davies-Bouldin Score: Lowest at k=2.
  • Calinski-Harabasz Score: Highest at k=2.

Conclusion: We proceed with k=2 as the optimal number of clusters.

8. Basic Inferences (K=2)

We now finalize the model with 2 clusters and analyze the characteristics of each customer segment.

Code
# Final Model Execution
kmeans = KMeans(n_clusters=2, random_state=42)
clusters = kmeans.fit_predict(X_scaled)

# Add back to original dataframes
customer_df['cluster'] = clusters
customer_df_original['cluster'] = clusters

# PCA for final visualization
pca = PCA(n_components=2)
pca_data = pca.fit_transform(X_scaled)
pca_df = pd.DataFrame(pca_data, columns=['PC1', 'PC2'])
pca_df['cluster'] = clusters

plt.figure(figsize=(8,6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='cluster', palette='Set2')
plt.title("Final Customer Clusters (k=2)")
plt.show()

Cluster Profiling

We examine the profile of the two clusters.

Code
print("Cluster Counts:")
print(customer_df_original['cluster'].value_counts())
Cluster Counts:
cluster
1    1574
0     556
Name: count, dtype: int64

Numeric Feature Summary

Code
numeric_cols = ['avg_order_total', 'total_spent', 'num_orders', 'avg_delay']
summary = customer_df_original.groupby('cluster')[numeric_cols].mean()
print(summary)
         avg_order_total  total_spent  num_orders  avg_delay
cluster                                                     
0            2184.672628  4927.497896    2.269784   3.042971
1            2194.013664  4838.751811    2.186150   3.184366

Categorical Feature Summary

Code
categorical_cols = ['most_used_payment', 'most_delivery_status']

for col in categorical_cols:
    print(f"Distribution of {col} by cluster:")
    print(customer_df_original.groupby('cluster')[col].value_counts(normalize=True).rename("percentage") * 100)
    print("\n")
Distribution of most_used_payment by cluster:
cluster  most_used_payment
0        Cash                 100.000000
         Card                   0.000000
         UPI                    0.000000
         Wallet                 0.000000
1        Card                  51.715375
         UPI                   26.365947
         Wallet                21.918679
         Cash                   0.000000
Name: percentage, dtype: float64


Distribution of most_delivery_status by cluster:
cluster  most_delivery_status 
0        On Time                  85.611511
         Slightly Delayed         12.050360
         Significantly Delayed     2.338129
1        On Time                  84.625159
         Slightly Delayed         13.214740
         Significantly Delayed     2.160102
Name: percentage, dtype: float64

Final Interpretation

Based on the profiles, we can label the segments:

  • Cluster 0: “On-Time High-Value” - High spending, frequent orders, reliable delivery.
  • Cluster 1: “Delay-Affected Low-Value” - Low spending, infrequent orders, frequent delays.
Code
# Mapping cluster names
label_map = {
    0: "On-Time High-Value Customers",
    1: "Delay-Affected Low-Value Customers"
}

customer_df_original['cluster'] = customer_df_original['cluster'].map(label_map)
pca_df['cluster'] = pca_df['cluster'].map(label_map)

# Save result
# customer_df_original.to_csv('results.csv', index=False)

# Final Visual
plt.figure(figsize=(10,6))
sns.scatterplot(data=pca_df, x='PC1', y='PC2', hue='cluster', palette='Set2')
plt.title("Customer Segments")
plt.show()

Conclusion

The analysis reveals that delivery performance is a critical differentiator. High-value customers consistently experience on-time deliveries, while the lower-value segment is plagued by delays. This suggests a potential correlation between service quality and customer value, or perhaps operational issues specifically affecting a subset of orders.