skip to content

Back to Projects

Scalable Movie Recommendation Engine (Netflix Prize)

Distributed Collaborative Filtering on 100M+ Ratings.

Stack: Python, PyTorch, SQL, NumPy, Pandas, Surprise Library.

Project Overview

  • The Challenge: Building a recommendation system for the Netflix Prize dataset (480K users, 17K movies, 100M+ ratings) while addressing extreme data sparsity (1.2% density) and scalability constraints.
  • Key Objectives: Minimizing RMSE through ensemble modeling and optimized matrix factorization.

Technical Implementation

Exploratory Data Analysis (EDA)

Performed deep analysis of rating distributions, identifying that 10% of users provided 50% of ratings, necessitating specific weighting for power users.

Algorithms Implemented

  • Baseline Models: Global Mean, User/Item Mean, and Regression-based approaches.
  • Collaborative Filtering: KNN (User-User & Item-Item) and SVD (Singular Value Decomposition).
  • Advanced Deep Learning: Restricted Boltzmann Machines (RBM) to capture non-linear latent features in the rating data.

Performance Metrics

Achieved significant RMSE reduction, utilizing 5-fold cross-validation to ensure model generalization.

Engineering Highlights

  • Data Pipeline: Handled large-scale data preprocessing, including cleaning, normalization, and SQL-based query optimization for distributed datasets.
  • Optimization: Implemented Stochastic Gradient Descent (SGD) and Alternating Least Squares (ALS) for matrix factorization.