gonzalez-agirre commited on
Commit
c84e7a5
1 Parent(s): 3770b0d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +9 -5
README.md CHANGED
@@ -15,13 +15,17 @@ tags:
15
 
16
  - "gpt2-large-bne"
17
 
 
 
 
 
18
  widget:
19
  - text: "El modelo del lenguaje GPT es capaz de"
20
  - text: "La Biblioteca Nacional de España es una entidad pública y sus fines son"
21
 
22
  ---
23
 
24
- # GPT2-large trained with data from National Library of Spain (BNE)
25
 
26
  ## Table of Contents
27
  <details>
@@ -46,7 +50,7 @@ widget:
46
  </details>
47
 
48
  ## Overview
49
- - **Architecture:** gpt2-large-bne
50
  - **Language:** Spanish
51
  - **Task:** text-generation
52
  - **Data:** BNE
@@ -91,8 +95,8 @@ torch.Size([1, 14, 1280])
91
  ```
92
 
93
  ## Limitations and bias
94
- The training data used for this model has not been released as a dataset one can browse. We know it contains a lot of
95
- unfiltered content from the internet, which is far from neutral. Here's an example of how the model can have biased predictions:
96
 
97
  ```python
98
  >>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
@@ -121,7 +125,7 @@ unfiltered content from the internet, which is far from neutral. Here's an examp
121
  ### Training data
122
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
123
 
124
- To obtain a high-quality training corpus, the corpus has been preprocessed with a pipeline of operations, including among the others, sentence splitting, language detection, filtering of bad-formed sentences and deduplication of repetitive contents. During the process document boundaries are kept. This resulted into 2TB of Spanish clean corpus. Further global deduplication among the corpus is applied, resulting into 570GB of text.
125
 
126
  Some of the statistics of the corpus:
127
 
 
15
 
16
  - "gpt2-large-bne"
17
 
18
+ datasets:
19
+
20
+ - "bne"
21
+
22
  widget:
23
  - text: "El modelo del lenguaje GPT es capaz de"
24
  - text: "La Biblioteca Nacional de España es una entidad pública y sus fines son"
25
 
26
  ---
27
 
28
+ # GPT2-large trained with data from the National Library of Spain (BNE)
29
 
30
  ## Table of Contents
31
  <details>
 
50
  </details>
51
 
52
  ## Overview
53
+ - **Architecture:** gpt2-large
54
  - **Language:** Spanish
55
  - **Task:** text-generation
56
  - **Data:** BNE
 
95
  ```
96
 
97
  ## Limitations and bias
98
+
99
+ At the time of submission, no measures have been taken to estimate the bias and toxicity embedded in the model. However, we are well aware that our models may be biased since the corpora have been collected using crawling techniques on multiple web sources. We intend to conduct research in these areas in the future, and if completed, this model card will be updated. Nevertheless, here's an example of how the model can have biased predictions:
100
 
101
  ```python
102
  >>> from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline, set_seed
 
125
  ### Training data
126
  The [National Library of Spain (Biblioteca Nacional de España)](http://www.bne.es/en/Inicio/index.html) crawls all .es domains once a year. The training corpus consists of 59TB of WARC files from these crawls, carried out from 2009 to 2019.
127
 
128
+ To obtain a high-quality training corpus, the corpus has been preprocessed with a pipeline of operations, including among others, sentence splitting, language detection, filtering of bad-formed sentences, and deduplication of repetitive contents. During the process, document boundaries are kept. This resulted in 2TB of Spanish clean corpus. Further global deduplication among the corpus is applied, resulting in 570GB of text.
129
 
130
  Some of the statistics of the corpus:
131