LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation

Yushi Lan1Fangzhou Hong1Shuai Yang2Shangchen Zhou1Xuyi Meng1
Xingang Pan 1 Bo Dai 3 Chen Change Loy 1
S-Lab, Nanyang Technological University1;
Wangxuan Institute of Computer Technology, Peking University2;
Shanghai Artificial Intelligence Laboratory 3

LN3Diff is a feedforward 3D diffusion model that creates high-quality 3D object mesh from text within 8 V100-SECONDS.
A standing hund. An UFO space aircraft. A sailboat with mast. An 18th century cannon. A blue plastic chair.
For more visual results, go checkout our project page :page_with_curl: Codes coming soon :facepunch: This repository contains the official implementation of LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation
---

[Project Page][arXiv]

## :mega: Updates [03/2024] Initial release. [04/2024] Inference and training codes on Objaverse, ShapeNet and FFHQ are released, including pre-trained model and training dataset. ## :dromedary_camel: TODO - [x] Release the inference and training code. - [x] Release the pre-trained checkpoints of ShapeNet and FFHQ. - [x] Release the pre-trained checkpoints of T23D Objaverse model trained with 30K+ instances dataset. - [x] Release the stage-1 VAE of Objaverse trained with 80K+ instances dataset. - [ ] Add Gradio demo. - [ ] Polish the dataset preparation and training doc. - [ ] add metrics evaluation scripts and samples. - [ ] Lint the code. - [ ] Release the new T23D Objaverse model trained with 80K+ instances dataset. ## :handshake: Citation If you find our work useful for your research, please consider citing the paper: ``` @misc{lan2024ln3diff, title={LN3Diff: Scalable Latent Neural Fields Diffusion for Speedy 3D Generation}, author={Yushi Lan and Fangzhou Hong and Shuai Yang and Shangchen Zhou and Xuyi Meng and Bo Dai and Xingang Pan and Chen Change Loy}, year={2024}, eprint={2403.12019}, archivePrefix={arXiv}, primaryClass={cs.CV} } ``` ## :desktop_computer: Requirements NVIDIA GPUs are required for this project. We conduct all the training on NVIDIA V100-32GiB (ShapeNet, FFHQ) and NVIDIA A100-80GiB (Objaverse). We have test the inference codes on NVIDIA V100. We recommend using anaconda to manage the python environments. The environment can be created via ```conda env create -f environment_ln3diff.yml```, and activated via ```conda activate ln3diff```. If you want to reuse your own PyTorch environment, install the following packages in your environment: ``` # first, check whether you have installed pytorch (>=2.0) and xformer. conda install -c conda-forge openexr-python git pip install openexr lpips imageio kornia opencv-python tensorboard tqdm timm ffmpeg einops beartype imageio[ffmpeg] blobfile ninja lmdb webdataset opencv-python click torchdiffeq transformers pip install git+https://github.com/nupurkmr9/vision-aided-gan. ``` ## :running_woman: Inference ### Download Models The pretrained stage-1 VAE and stage-2 LDM can be downloaded via [OneDrive](https://entuedu-my.sharepoint.com/:f:/g/personal/yushi001_e_ntu_edu_sg/ErdRV9hCYvlBioObT1v_LZ4Bnwye3sv6p5qiVZPNhI9coQ?e=nJgp8t). Put the downloaded checkpoints under ```checkpoints``` folder for inference. The checkpoints directory layout should be checkpoints ├── ffhq │ └── model_joint_denoise_rec_model1580000.pt ├── objaverse │ ├── model_rec1680000.pt │ └── model_joint_denoise_rec_model2310000.pt ├── shapenet │ └── car │ └── model_joint_denoise_rec_model1580000.pt │ └── chair │ └── model_joint_denoise_rec_model2030000.pt │ └── plane │ └── model_joint_denoise_rec_model770000.pt └── ... ### Inference Commands Note that to extract the mesh, 24GiB VRAM is required. #### Stage-1 VAE 3D reconstruction For (Objaverse) stage-1 VAE 3D reconstruction and extract VAE latents for diffusion learning, please run ```bash bash shell_scripts/final_release/inference/sample_obajverse.sh ``` which shall give the following result: The marching-cube extracted mesh can be visualized with Blender/MeshLab: Mesh Visualization **We upload the pre-extracted vae latents at [here](https://entuedu-my.sharepoint.com/:f:/g/personal/yushi001_e_ntu_edu_sg/EnXixldDrKhDtrcuPM4vjQYBv06uY58F1mF7f7KVdZ19lQ?e=nXQNdm), which contains the correponding VAE latents (with shape 32x32x12) of 76K G-buffer Objaverse objects. Feel free to use them in your own task.** For more G-buffer Objaverse examples, download the [demo data](https://entuedu-my.sharepoint.com/:f:/g/personal/yushi001_e_ntu_edu_sg/EoyzVJbMyBhLoKFJbbsq6bYBi1paLwQxIDjTkO1KjI4b1g?e=sJc3rQ). #### Stage-2 Text-to-3D We train 3D latent diffusion model on top of the stage-1 extracted latents. For the following bash inference file, to extract mesh from the generated tri-plane, set ```--export_mesh True```. To change the text prompt, set the ```prompt``` variable. For unconditional sampling, set the cfg guidance ```unconditional_guidance_scale=0```. Feel free to tune the cfg guidance scale to trade off diversity and fidelity. Note that the diffusion sampling batch size is set to ```4```, which costs around 16GiB VRAM. The mesh extraction of a single instance costs 24GiB VRAM. For text-to-3D on Objaverse, run ```bash bash shell_scripts/final_release/inference/sample_obajverse.sh ``` For text-to-3D on ShapeNet, run one of the following commands (which conducts T23D on car, chair and plane.): ```bash bash shell_scripts/final_release/inference/sample_shapenet_car_t23d.sh ``` ```bash bash shell_scripts/final_release/inference/sample_shapenet_chair_t23d.sh ``` ```bash bash shell_scripts/final_release/inference/sample_shapenet_plane_t23d.sh ``` For text-to-3D on FFHQ, run ```bash bash shell_scripts/final_release/inference/sample_ffhq_t23d.sh ``` ## :running_woman: Training ### Dataset For Objaverse, we use the rendering provided by [G-buffer Objaverse](https://aigc3d.github.io/gobjaverse/). A demo subset for stage-1 VAE reconstruction can be downloaded from [here](https://entuedu-my.sharepoint.com/:u:/g/personal/yushi001_e_ntu_edu_sg/Eb6LX2x-EgJLpiHbhRxsN9ABnEaSyjG-tsVBcUr_dQ5dnQ?e=JXWQo1). Note that for Objaverse training, we pre-process the raw data into [wds-dataset](https://github.com/webdataset/webdataset) shards for fast and flexible loading. The sample shard data can be found in [here](https://entuedu-my.sharepoint.com/:f:/g/personal/yushi001_e_ntu_edu_sg/ErtZQgnEH5ZItDqdUaiVbJgBe4nhZveJemQRqDW6Xwp7Zg?e=Zqt6Ss). For ShapeNet, we render our own data with foreground mask for training, which can be downloaded from [here](https://entuedu-my.sharepoint.com/:f:/g/personal/yushi001_e_ntu_edu_sg/EijBXIC_bUNOo0L3wnJKRqoBCqVnhhT_BReYRc1tc_0lrA?e=VQwWOZ). For training, we convert the raw data to LMDB for faster data loading. The pre-processed LMDB file can be downloaded from [here](https://entuedu-my.sharepoint.com/:f:/g/personal/yushi001_e_ntu_edu_sg/Ev7L8Als8K9JtLtj1G23Cc0BTNDbhCQPadxNLLVS7mV2FQ?e=C5woyE). For FFHQ, we use the pre-processed dataset from [EG3D](https://github.com/NVlabs/eg3d) and compress it into LMDB, which can also be found in the onedrive link above. ### Training Commands Coming soon. ## :newspaper_roll: License Distributed under the S-Lab License. See `LICENSE` for more information. ## Contact If you have any question, please feel free to contact us via `lanyushi15@gmail.com` or Github issues.