Overview

The purpose of this project was to use machine learning to predict product sales and understand which product features are associated with higher sales. In this project, I worked through a full data science and machine learning process demonstrating that I can:

  • define a focused machine learning research question,
  • clean and prepare a dataset for analysis,
  • explore patterns in the data with summary statistics and visualizations,
  • build and compare multiple machine learning models,
  • and interpret the results honestly, including the limitations of the dataset.

Dataset

  • Source: https://www.kaggle.com/datasets/fahmidachowdhury/e-commerce-sales-analysis?resource=download
  • Size: 1,000 rows with product information and monthly sales columns
  • Description: This dataset contains product-level information including category, price, review score, review count, and sales for 12 months. I created a new target variable called annual_sales by summing the monthly sales columns.

Methods

  • Data cleaning and preprocessing with Pandas
  • Exploratory data analysis with Pandas, Matplotlib, and Seaborn
  • Feature encoding and scaling with scikit-learn
  • Machine learning models:
    • Linear Regression
    • Random Forest Regressor
    • Gradient Boosting Regressor
  • Model evaluation using:
    • MAE
    • RMSE

Full Essay & Code

Results

This project found that the available variables had very weak predictive power for annual sales. The best model was Linear Regression, but its R² value was still slightly below 0, meaning the models did not predict sales well.

This suggests that variables like price, category, review score, and review count were not enough on their own to explain sales in this dataset. More useful predictors might include advertising, discounts, brand strength, seasonality, or customer demand.

Distribution of annual sales Actual vs Predicted Annual Sales