Back to Projects
Scalable Movie Recommendation Engine (Netflix Prize)
Distributed Collaborative Filtering on 100M+ Ratings.
Stack: Python, PyTorch, SQL, NumPy, Pandas, Surprise Library.
Project Overview
- The Challenge: Building a recommendation system for the Netflix Prize dataset (480K users, 17K movies, 100M+ ratings) while addressing extreme data sparsity (1.2% density) and scalability constraints.
- Key Objectives: Minimizing RMSE through ensemble modeling and optimized matrix factorization.
Technical Implementation
Exploratory Data Analysis (EDA)
Performed deep analysis of rating distributions, identifying that 10% of users provided 50% of ratings, necessitating specific weighting for power users.
Algorithms Implemented
- Baseline Models: Global Mean, User/Item Mean, and Regression-based approaches.
- Collaborative Filtering: KNN (User-User & Item-Item) and SVD (Singular Value Decomposition).
- Advanced Deep Learning: Restricted Boltzmann Machines (RBM) to capture non-linear latent features in the rating data.
Performance Metrics
Achieved significant RMSE reduction, utilizing 5-fold cross-validation to ensure model generalization.
Engineering Highlights
- Data Pipeline: Handled large-scale data preprocessing, including cleaning, normalization, and SQL-based query optimization for distributed datasets.
- Optimization: Implemented Stochastic Gradient Descent (SGD) and Alternating Least Squares (ALS) for matrix factorization.