What is the methodology used to measure the carbon footprint of training Llama 3.1?

#34
by mrchrisadams - opened

Hi there,

I have a question abot the methodology used to work out the training carbon footprint for Llama 3. Here's the quote from the main README for the page:

Training Greenhouse Gas Emissions Estimated total location-based greenhouse gas emissions were 11,390 tons CO2eq for training. Since 2020, Meta has maintained net zero greenhouse gas emissions in its global operations and matched 100% of its electricity use with renewable energy, therefore the total market-based greenhouse gas emissions for training were 0 tons CO2eq.

[table]
The methodology used to determine training energy use and greenhouse gas emissions can be found here. Since Meta is openly releasing these models, the training energy use and greenhouse gas emissions will not be incurred by others.

Source: meta-llama/Meta-Llama-3.1-8B · Hugging Face by @huggingface

The linked paper is The Carbon Footprint of Machine Learning Training Will Plateau, Then Shrink a paper on Arxiv.org, and appears to be about Google, not Meta.

If it is indeed the correct paper, I'm guessing this screenshot that might this might be the methodology used:

Screenshot 2024-08-21 at 14.23.48.png

I was expecting to see a but more detail along the lines of the paper about the BLOOM model, called Estimating the Carbon Footprint of BLOOM, a 176B Parameter Language Model

Is that the correct link? Thanks

Ok, @bgamazay gave me a pointer to the LLama 2 paper, which has some more detail.

2.2.1 Training Hardware & Carbon Footprint

Training Hardware. We pretrained our models on Meta’s Research Super Cluster (RSC) (Lee and Sengupta, 2022) as well as internal production clusters. Both clusters use NVIDIA A100s. There are two key differences between the two clusters, with the first being the type of interconnect available: RSC uses NVIDIA Quantum InfiniBand while our production cluster is equipped with a RoCE (RDMA over converged Ethernet) solution based on commodity ethernet Switches. Both of these solutions interconnect 200 Gbps end-points. The second difference is the per-GPU power consumption cap — RSC uses 400W while our production cluster uses 350W. With this two-cluster setup, we were able to compare the suitability of these different types of interconnect for large scale training. RoCE (which is a more affordable, commercial interconnect network) can scale almost as well as expensive Infiniband up to 2000 GPUs, which makes pretraining even more democratizable.

Screenshot 2024-08-21 at 17.35.09.png

Carbon Footprint of Pretraining.
Following preceding research (Bender et al., 2021a; Patterson et al., 2021; Wu et al., 2022; Dodge et al., 2022) and using power consumption estimates of GPU devices and carbon efficiency, we aim to calculate the carbon emissions resulting from the pretraining of Llama 2 models. The actual power usage of a GPU is dependent on its utilization and is likely to vary from the Thermal Design
Power (TDP) that we employ as an estimation for GPU power. It is important to note that our calculations do not account for further power demands, such as those from interconnect or non-GPU server power consumption, nor from datacenter cooling systems. Additionally, the carbon output related to the production of AI hardware, like GPUs, could add to the overall carbon footprint as suggested by Gupta et al. (2022b,a).

Table 2 summarizes the carbon emission for pretraining the Llama 2 family of models. A cumulative of 3.3M GPU hours of computation was performed on hardware of type A100-80GB (TDP of 400W or 350W). We estimate the total emissions for training to be 539 tCO2eq, of which 100% were directly offset by Meta’s sustainability program.∗∗ Our open release strategy also means that these pretraining costs will not need to be incurred by other companies, saving more global resources

I guess the approach would likely be along the same lines, but the actual kit used is likely to be different, right?

Sign up or log in to comment