omkarenator commited on
Commit
a1a7dfb
β€’
1 Parent(s): e9be071

deploy at 2024-09-08 21:29:30.504038

Browse files
Files changed (4) hide show
  1. Dockerfile +10 -0
  2. main.py +187 -0
  3. requirements.txt +1 -0
  4. style.css +65 -0
Dockerfile ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ FROM python:3.10
2
+ WORKDIR /code
3
+ COPY --link --chown=1000 . .
4
+ RUN mkdir -p /tmp/cache/
5
+ RUN chmod a+rwx -R /tmp/cache/
6
+ ENV HF_HUB_CACHE=HF_HOME
7
+ RUN pip install --no-cache-dir -r requirements.txt
8
+
9
+ ENV PYTHONUNBUFFERED=1 PORT=7860
10
+ CMD ["python", "main.py"]
main.py ADDED
@@ -0,0 +1,187 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from fasthtml_hf import setup_hf_backup
2
+ from fasthtml.common import *
3
+
4
+ app, rt = fast_app()
5
+
6
+
7
+ @rt("/")
8
+ def get():
9
+ return Html(
10
+ Head(
11
+ Meta(charset="UTF-8"),
12
+ Meta(name="viewport", content="width=device-width, initial-scale=1.0"),
13
+ Title("Simple Blog Post"),
14
+ Link(rel="stylesheet", href="style.css"),
15
+ ),
16
+ Body(
17
+ Div(
18
+ Aside(
19
+ H2("Table of Contents"),
20
+ Ul(
21
+ Li(A("Introduction", href="#section1")),
22
+ Li(A("Background", href="#section2")),
23
+ Li(A("Main Content", href="#section3")),
24
+ Li(A("Conclusion", href="#section4")),
25
+ ),
26
+ cls="toc",
27
+ ),
28
+ Main(
29
+ H1("Simple Blog Post"),
30
+ Section(
31
+ H2("Introduction"),
32
+ P("""We are excited to introduce TxT360, a
33
+ large-scale, comprehensive, and fully transparent
34
+ dataset designed for Large Language Model (LLM)
35
+ pre-training. TxT360 is engineered to strike a
36
+ balance between the quantity and quality of
37
+ pre-training data, pushing the limit on both
38
+ fronts. This comprehensive dataset encompasses both
39
+ expansive web-based data and highly curated data
40
+ sources, making it one of the most robust LLM
41
+ pre-training corpora available today. Our web data
42
+ component includes 99 snapshots from Common Crawl,
43
+ amassing 5.7 trillion tokens and occupying 11 TB of
44
+ disk space in jsonl.gz format. On the curated side,
45
+ TxT360 integrates one of the most extensive
46
+ collections of high-quality sources across multiple
47
+ domains, ensuring diverse and rich content referred
48
+ to as curated sources, 14 sources across 10
49
+ domains. To maintain the highest quality, we
50
+ meticulously pre-processed the web data to filter
51
+ out low-quality content and conducted thorough
52
+ reviews of the curated sources. This process not
53
+ only unified their formats but also identified and
54
+ rectified any anomalies. Not only do we 100%
55
+ open-source our processing scripts, but we also
56
+ release the details of our data reviews, revealing
57
+ the decision-making processes behind data selection
58
+ and quality assurance. This level of transparency
59
+ allows researchers and practitioners to fully
60
+ understand the dataset’s composition and make
61
+ informed decisions when using TxT360 for training.
62
+ Additionally, TxT360 includes detailed
63
+ documentation and analysis of the data, covering
64
+ distribution statistics, domain coverage, and
65
+ processing pipeline, which helps users navigate and
66
+ utilize the dataset effectively. Overall, TxT360
67
+ represents a significant step forward in the
68
+ availability and transparency of large-scale
69
+ training data for language models, setting a new
70
+ standard for dataset quality and openness."""),
71
+ id="section1",
72
+ ),
73
+ Section(
74
+ H2("Background"),
75
+ P(
76
+ """ The quality and size of a pre-training dataset
77
+ play a crucial role in the performance of large
78
+ language models (LLMs). The community has
79
+ introduced a variety of datasets for this purpose,
80
+ including purely web-based datasets like RefinedWeb
81
+ [1], RedPajama-Data-V2 [2], DCLM [3], and
82
+ FineWeb [4], as well as comprehensive datasets
83
+ derived from multiple highly-curated data sources
84
+ such as The Pile [5], RedPajama-Data-V1 [6], and
85
+ Dolma [7] . It is commonly known that web-based
86
+ datasets provide a vast quantity of data, while
87
+ highly-curated multi-source datasets consistently
88
+ deliver high quality and diversity, both critical
89
+ for effective LLM pre-training. However, despite
90
+ the advancements in both types of data, each type
91
+ of dataset has its limitations. For instance, the
92
+ processing scripts for the web dataset, RefinedWeb,
93
+ known for its high quality, are not public, and
94
+ only about 10% of the entire dataset has been
95
+ disclosed. Conversely, the web component of
96
+ existing highly-curated multi-source datasets is
97
+ relatively small compared to purely web-based
98
+ datasets, limiting their coverage and diversity
99
+ compared to the scale of information from the
100
+ internet. By integrating the extensive reach of
101
+ web data with the exceptional quality of curated
102
+ sources, TxT360 is crafted to meet and surpass the
103
+ rigorous standards required for state-of-the-art
104
+ LLM pre-training. """
105
+ ),
106
+ id="section2",
107
+ ),
108
+ Section(
109
+ H2("Main Content"),
110
+ P(
111
+
112
+ """The performance of a large language model (LLM)
113
+ depends heavily on the quality and size of its
114
+ pretraining dataset. However, the pretraining
115
+ datasets for state-of-the-art open LLMs like Llama
116
+ 3 and Mixtral are not publicly available and very
117
+ little is known about how they were created.
118
+ Reading time: 45 min. For the best reading
119
+ experience, we recommend not using a mobile phone.
120
+ Recently, we released 🍷 FineWeb, a new,
121
+ large-scale (15-trillion tokens, 44TB disk space)
122
+ dataset for LLM pretraining. FineWeb is derived
123
+ from 96 CommonCrawl snapshots and produces
124
+ better-performing LLMs than other open pretraining
125
+ datasets. To bring more clarity in machine learning
126
+ and advance the open understanding of how to train
127
+ good quality large language models, we carefully
128
+ documented and ablated all of the design choices
129
+ used in FineWeb, including in-depth investigations
130
+ of deduplication and filtering strategies. The
131
+ present long form report is a deep dive in how to
132
+ create a large and high-quality web-scale dataset
133
+ for LLM pretraining. The dataset itself, 🍷
134
+ FineWeb, is available here. We are extremely
135
+ thankful to the whole distill.pub team (Christopher
136
+ Olah, Shan Carter, Ludwig Schubert in particular)
137
+ for creating the template on which we based this
138
+ blog post. Thanks also for inspiring us with
139
+ exquisitely crafted articles and blog posts. In
140
+ this report we also introduce πŸ“š FineWeb-Edu, a
141
+ subset of FineWeb constructed using scalable
142
+ automated high-quality annotations for educational
143
+ value, and which outperforms all openly accessible
144
+ web-datasets on a number of educational benchmarks
145
+ such as MMLU, ARC, and OpenBookQA. πŸ“š FineWeb-Edu
146
+ is available in two sizes/filtering-level: 1.3
147
+ trillion (very high educational content) and 5.4
148
+ trillion (high educational content) tokens (all
149
+ tokens are measured with GPT2 tokenizer). You can
150
+ download it here. Both datasets are released under
151
+ the permissive ODC-By 1.0 license TLDR: This blog
152
+ covers a discussion on processing and evaluating
153
+ data quality at scale, the 🍷 FineWeb recipe
154
+ (listing and explaining all of our design choices),
155
+ and the process followed to create its πŸ“š
156
+ FineWeb-Edu subset."""
157
+
158
+ ),
159
+ id="section3",
160
+ ),
161
+ Section(
162
+ H2("Conclusion"),
163
+ P("""This is the conclusion section where we
164
+ summarize the key points discussed in the blog post
165
+ and provide final thoughts.
166
+
167
+
168
+
169
+
170
+
171
+
172
+
173
+ """
174
+ ),
175
+ id="section4",
176
+ ),
177
+ cls="content",
178
+ ),
179
+ cls="container",
180
+ )
181
+ ),
182
+ lang="en",
183
+ )
184
+
185
+
186
+ setup_hf_backup(app)
187
+ serve()
requirements.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ fasthtml-hf
style.css ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ body {
2
+ font-family: Arial, sans-serif;
3
+ margin: 0;
4
+ padding: 0;
5
+ display: flex;
6
+ }
7
+
8
+ .container {
9
+ display: flex;
10
+ width: 100%;
11
+ }
12
+
13
+ .toc {
14
+ width: 20%;
15
+ background-color: #f4f4f4;
16
+ padding: 20px;
17
+ box-shadow: 2px 0 5px rgba(0,0,0,0.1);
18
+ position: fixed;
19
+ height: 100%;
20
+ overflow-y: auto;
21
+ }
22
+
23
+ .toc h2 {
24
+ font-size: 1.5em;
25
+ margin-bottom: 10px;
26
+ }
27
+
28
+ .toc ul {
29
+ list-style-type: none;
30
+ padding: 0;
31
+ }
32
+
33
+ .toc ul li {
34
+ margin-bottom: 10px;
35
+ }
36
+
37
+ .toc ul li a {
38
+ text-decoration: none;
39
+ color: #333;
40
+ }
41
+
42
+ .toc ul li a:hover {
43
+ text-decoration: underline;
44
+ }
45
+
46
+ .content {
47
+ margin-left: 30%;
48
+ margin-right: 10%;
49
+ padding: 30px;
50
+ width: 80%;
51
+ }
52
+
53
+ .content h1 {
54
+ font-size: 2em;
55
+ margin-bottom: 20px;
56
+ }
57
+
58
+ .content section {
59
+ margin-bottom: 40px;
60
+ }
61
+
62
+ .content section h2 {
63
+ font-size: 1.5em;
64
+ margin-bottom: 10px;
65
+ }