MicroLens-1M Multimodal CTR Prediction with Custom CLIP Embeddings

Posted Dec 21, 2025 Updated Dec 21, 2025

By ammarlouah

3 min read

This project provides a complete end-to-end solution for the MicroLens-1M Multimodal CTR (Click-Through Rate) Prediction task from the WWW 2025 Multimodal Recommendation Challenge. It extracts custom multimodal item embeddings using OpenAI’s CLIP (ViT-B/32) by fusing image and text features, then integrates these into a Transformer + DCNv2 ranking model built with FuxiCTR. The solution achieves strong validation performance: AUC = 0.8705 and LogLoss = 0.955 on the provided valid set and 0.9026 on the public leaderboard (WWW 2025) with only 1 training epoch. The dataset is available on Kaggle at yummyooo123/www2025-mmctr-data. Below, I detail the project’s components, pipeline, setup instructions, and key implementation insights.

Available on GitHub at ammarlouah/Multimodal-CTR-Prediction-Challenge

Project Overview

The project focuses on multimodal CTR prediction using the MicroLens-1M dataset, addressing the challenge by:

Multimodal Embedding Extraction: Generating custom 128-dimensional embeddings by fusing image and text (title + tags) features with a frozen CLIP model.
CTR Model Training: Injecting these embeddings into a custom Transformer_DCN model that combines sequence modeling, feature interactions via DCNv2, and an MLP tower.

The pipeline includes embedding extraction, data enhancement, model training, and prediction generation, creating a robust solution for multimodal recommendation tasks.

Video Demonstration

To showcase the project’s workflow and results, here’s a video walkthrough:

End-to-End Pipeline Demonstration:

Dataset Structure

The project uses the MicroLens_1M_MMCTR dataset from Kaggle (link), organized as follows:

MicroLens_1M_x1/:
- train.parquet: Training data.
- valid.parquet: Validation data.
- test.parquet: Test data for predictions.
- item_info.parquet: Item metadata including titles and tags.
item_images/item_images/*.jpg: Image files for items.
item_emb.parquet: Pre-computed embeddings (not used; we generate custom ones).
item_feature.parquet: Additional item features.
item_seq.parquet: Item sequences.
README: Dataset documentation.

Functionality

Multimodal Embedding Extraction

Loads a frozen CLIP (ViT-B/32) model to extract separate image and text features.
Fuses features using learned attention-weighted projections into a 128-dimensional space.
Handles cases with missing images by falling back to text-only embeddings.
Processes all 91,718 items and saves the results as custom_item_embeddings.parquet.

Enhanced Item Information

Merges the custom 128-dim embeddings into item_info.parquet as item_emb_d128 (in list format for FuxiCTR compatibility).

CTR Model Training (Transformer_DCN)

Custom model architecture:
- Transformer for sequence modeling over historical and target item embeddings.
- DCNv2 for deep feature interactions.
- MLP tower for final predictions.
Utilizes a custom MMCTRDataLoader to load pre-embedded features.
Trained with Adam optimizer, binary cross-entropy loss, and batch size of 128.

Prediction & Submission

Generates predictions on the test set using the trained model.
Outputs prediction.csv with columns ID and Task1&2 for submission.

Setup Instructions

Prerequisites

OS: Any (tested on Linux via Kaggle).
Python: 3.11+.
Hardware: CUDA-capable GPU (e.g., Tesla T4).
Dataset: Download from Kaggle and extract.

On Kaggle (Recommended)

Create a new Kaggle notebook and add the dataset.
Enable GPU T4 x2 accelerator and Internet.
Upload and run the notebook sequentially.

Locally

Download and extract the dataset to a local folder.
Update paths in the notebook’s Config class (e.g., DATA_ROOT, IMAGE_DIR).

Install dependencies:

  
pip install pandas pyarrow fastparquet scikit-learn tqdm
pip install git+https://github.com/shenweichen/DeepCTR-Torch.git
pip install fuxictr==2.3.7
pip install transformers==4.40.0 pillow==10.3.0 timm==0.9.16 accelerate==0.29.3

Run the notebook step-by-step.

The process takes approximately 55 minutes for embedding extraction and 95 minutes for training (1 epoch) on a T4 GPU.

Implementation Details

Embedding Pipeline: Uses transformers for CLIP, with custom fusion logic to combine modalities. Embeddings are saved in Parquet for efficiency.
Model Configuration: Defined in FuxiCTR with custom hyperparameters for Transformer layers, DCNv2 crosses, and MLP dimensions.
Data Loading: Custom dataloader ensures proper handling of pre-embedded sequences.
Training: Single epoch for baseline; can be extended for better performance.
Output Files: Includes enhanced item info, model checkpoints, and submission CSV.

Troubleshooting

GPU Issues: Ensure CUDA is enabled; fallback to CPU if needed (slower).
Missing Dependencies: Re-run pip installs if modules are not found.
Path Errors: Double-check dataset paths in the config.
Memory Overflows: Reduce batch size if OOM occurs during training.

License

The project follows the licensing of the underlying dataset and libraries (e.g., MIT for FuxiCTR). Refer to the Kaggle dataset for specific terms.

Contributing

Contributions are welcome! Suggestions for improvements, such as enhanced fusion techniques or model tweaks, can be discussed via issues on a forked repository or directly on Kaggle.

Contact

For questions or feedback, reach out via Kaggle or email ammarlouah9@gmail.com.

Explore the dataset and try the notebook on Kaggle at yummyooo123/www2025-mmctr-data!

Last updated: December 21, 2025

Projects, Machine Learning, Deep Learning, Recommendation Systems

This post is licensed under CC BY 4.0 by the author.

MicroLens-1M Multimodal CTR Prediction with Custom CLIP Embeddings

Project Overview

Video Demonstration

Dataset Structure

Functionality

Multimodal Embedding Extraction

Enhanced Item Information

CTR Model Training (Transformer_DCN)

Prediction & Submission

Setup Instructions

Prerequisites

On Kaggle (Recommended)

Locally

Implementation Details

Troubleshooting

License

Contributing

Contact

Trending Tags