MicroLens-1M Multimodal CTR Prediction with Custom CLIP Embeddings
MicroLens-1M Multimodal CTR Prediction with Custom CLIP Embeddings
This project provides a complete end-to-end solution for the MicroLens-1M Multimodal CTR (Click-Through Rate) Prediction task from the WWW 2025 Multimodal Recommendation Challenge. It extracts custom multimodal item embeddings using OpenAI’s CLIP (ViT-B/32) by fusing image and text features, then integrates these into a Transformer + DCNv2 ranking model built with FuxiCTR. The solution achieves strong validation performance: AUC = 0.8705 and LogLoss = 0.955 on the provided valid set and 0.9026 on the public leaderboard (WWW 2025) with only 1 training epoch. The dataset is available on Kaggle at yummyooo123/www2025-mmctr-data. Below, I detail the project’s components, pipeline, setup instructions, and key implementation insights.
Available on GitHub at ammarlouah/Multimodal-CTR-Prediction-Challenge
Project Overview
The project focuses on multimodal CTR prediction using the MicroLens-1M dataset, addressing the challenge by:
- Multimodal Embedding Extraction: Generating custom 128-dimensional embeddings by fusing image and text (title + tags) features with a frozen CLIP model.
- CTR Model Training: Injecting these embeddings into a custom Transformer_DCN model that combines sequence modeling, feature interactions via DCNv2, and an MLP tower.
The pipeline includes embedding extraction, data enhancement, model training, and prediction generation, creating a robust solution for multimodal recommendation tasks.
Video Demonstration
To showcase the project’s workflow and results, here’s a video walkthrough:
End-to-End Pipeline Demonstration:
Dataset Structure
The project uses the MicroLens_1M_MMCTR dataset from Kaggle (link), organized as follows:
MicroLens_1M_x1/:train.parquet: Training data.valid.parquet: Validation data.test.parquet: Test data for predictions.item_info.parquet: Item metadata including titles and tags.
item_images/item_images/*.jpg: Image files for items.item_emb.parquet: Pre-computed embeddings (not used; we generate custom ones).item_feature.parquet: Additional item features.item_seq.parquet: Item sequences.README: Dataset documentation.
Functionality
Multimodal Embedding Extraction
- Loads a frozen CLIP (ViT-B/32) model to extract separate image and text features.
- Fuses features using learned attention-weighted projections into a 128-dimensional space.
- Handles cases with missing images by falling back to text-only embeddings.
- Processes all 91,718 items and saves the results as
custom_item_embeddings.parquet.
Enhanced Item Information
- Merges the custom 128-dim embeddings into
item_info.parquetasitem_emb_d128(in list format for FuxiCTR compatibility).
CTR Model Training (Transformer_DCN)
- Custom model architecture:
- Transformer for sequence modeling over historical and target item embeddings.
- DCNv2 for deep feature interactions.
- MLP tower for final predictions.
- Utilizes a custom
MMCTRDataLoaderto load pre-embedded features. - Trained with Adam optimizer, binary cross-entropy loss, and batch size of 128.
Prediction & Submission
- Generates predictions on the test set using the trained model.
- Outputs
prediction.csvwith columnsIDandTask1&2for submission.
Setup Instructions
Prerequisites
- OS: Any (tested on Linux via Kaggle).
- Python: 3.11+.
- Hardware: CUDA-capable GPU (e.g., Tesla T4).
- Dataset: Download from Kaggle and extract.
On Kaggle (Recommended)
- Create a new Kaggle notebook and add the dataset.
- Enable GPU T4 x2 accelerator and Internet.
- Upload and run the notebook sequentially.
Locally
- Download and extract the dataset to a local folder.
- Update paths in the notebook’s
Configclass (e.g.,DATA_ROOT,IMAGE_DIR). - Install dependencies:
1 2 3 4
pip install pandas pyarrow fastparquet scikit-learn tqdm pip install git+https://github.com/shenweichen/DeepCTR-Torch.git pip install fuxictr==2.3.7 pip install transformers==4.40.0 pillow==10.3.0 timm==0.9.16 accelerate==0.29.3
- Run the notebook step-by-step.
The process takes approximately 55 minutes for embedding extraction and 95 minutes for training (1 epoch) on a T4 GPU.
Implementation Details
- Embedding Pipeline: Uses
transformersfor CLIP, with custom fusion logic to combine modalities. Embeddings are saved in Parquet for efficiency. - Model Configuration: Defined in FuxiCTR with custom hyperparameters for Transformer layers, DCNv2 crosses, and MLP dimensions.
- Data Loading: Custom dataloader ensures proper handling of pre-embedded sequences.
- Training: Single epoch for baseline; can be extended for better performance.
- Output Files: Includes enhanced item info, model checkpoints, and submission CSV.
Troubleshooting
- GPU Issues: Ensure CUDA is enabled; fallback to CPU if needed (slower).
- Missing Dependencies: Re-run pip installs if modules are not found.
- Path Errors: Double-check dataset paths in the config.
- Memory Overflows: Reduce batch size if OOM occurs during training.
License
The project follows the licensing of the underlying dataset and libraries (e.g., MIT for FuxiCTR). Refer to the Kaggle dataset for specific terms.
Contributing
Contributions are welcome! Suggestions for improvements, such as enhanced fusion techniques or model tweaks, can be discussed via issues on a forked repository or directly on Kaggle.
Contact
For questions or feedback, reach out via Kaggle or email ammarlouah9@gmail.com.
Explore the dataset and try the notebook on Kaggle at yummyooo123/www2025-mmctr-data!
Last updated: December 21, 2025