File size: 2,848 Bytes
07330c2
 
 
8f6f3cd
07330c2
 
 
 
 
8f6f3cd
 
 
 
 
 
07330c2
 
bf0bdd3
07330c2
 
e9af805
07330c2
 
 
 
 
 
 
 
8f6f3cd
07330c2
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8f6f3cd
07330c2
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: apache-2.0
language:
- multilingual
library_name: transformers
tags:
- climate
---


multilingual version of [CatastroBERT](https://huggingface.co/epfl-dhlab/CatastroBERT)



# CatastroBERT a model for Extreme weather events detection in French text

This model aims to facilitate the detection of paragraphs or articles relevant to extreme weather events
in French text. It is based on the [camembert-base](https://huggingface.co/camembert-base) model and was trained on manually annotated data (articles summaries) from the Gazette de Lausanne archives  collected by [impresso](https://impresso-project.ch/)

<div align=center>
    <img src="bert_illustration.png" width="500" height="500" />
</div>

## Model Description

- **Developed by:** Lucas Nicolas
- **Language(s) (NLP):** French
- **Finetuned from model :** [camembert-base](https://huggingface.co/camembert-base) (RoBERTa Checkpoint)

- **Repository:** Check the [CatastroBERT](https://github.com/dh-epfl-students/dhlab-CatastroBERT) GitHub page for more usage examples and information.

## Usage

### In Transformers

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch


model_name = "epfl-dhlab/CatastroBERT"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification(model_name)

def predict(text):
    # Prepare the text data
    inputs = tokenizer.encode_plus(
        text,
        None,
        add_special_tokens=True,
        return_token_type_ids=True,
        padding=True,
        max_length=512,
        truncation=True,
        return_tensors='pt'
    )

    ids = inputs['input_ids'].to('cuda' if torch.cuda.is_available() else 'cpu')
    mask = inputs['attention_mask'].to('cuda' if torch.cuda.is_available() else 'cpu')

    # Get predictions
    with torch.no_grad():
        outputs = model(ids, mask)
        logits = outputs.logits

    # Apply sigmoid function to get probabilities
    probs = torch.sigmoid(logits).cpu().numpy()

    # Return the probability of the class (1)
    return probs[0][0]

#example usage 
text = "Un violent ouragan du sud-ouest est passé cette nuit sur Lausanne."
print(f"Prediction: {predict(text)}")
```

### Training Data

This model was trained on manually a manually annotated dataset (articles summaries) curated from the Gazette de Lausanne archives  collected by the [impresso](https://impresso-project.ch/) project. The dataset is composed of 4500 articles summaries of which 3500 were used for training and 1000 for validation.

## Environmental Impact

Carbon emissions estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** RTX 3090
- **Hours used:** 26
- **Carbon Emitted:** 0.07 kg CO2