Spaces:

omkarenator
/

fh-new

Sleeping

App Files Files Community

omkarenator commited on Sep 9

Commit

a1a7dfb

•

1 Parent(s): e9be071

deploy at 2024-09-08 21:29:30.504038

Browse files

Files changed (4) hide show

Dockerfile +10 -0
main.py +187 -0
requirements.txt +1 -0
style.css +65 -0

Dockerfile ADDED Viewed

	@@ -0,0 +1,10 @@

+FROM python:3.10
+WORKDIR /code
+COPY --link --chown=1000 . .
+RUN mkdir -p /tmp/cache/
+RUN chmod a+rwx -R /tmp/cache/
+ENV HF_HUB_CACHE=HF_HOME
+RUN pip install --no-cache-dir -r requirements.txt
+ENV PYTHONUNBUFFERED=1 PORT=7860
+CMD ["python", "main.py"]

main.py ADDED Viewed

	@@ -0,0 +1,187 @@

+from fasthtml_hf import setup_hf_backup
+from fasthtml.common import *
+app, rt = fast_app()
+@rt("/")
+def get():
+    return Html(
+        Head(
+            Meta(charset="UTF-8"),
+            Meta(name="viewport", content="width=device-width, initial-scale=1.0"),
+            Title("Simple Blog Post"),
+            Link(rel="stylesheet", href="style.css"),
+        ),
+        Body(
+            Div(
+                Aside(
+                    H2("Table of Contents"),
+                    Ul(
+                        Li(A("Introduction", href="#section1")),
+                        Li(A("Background", href="#section2")),
+                        Li(A("Main Content", href="#section3")),
+                        Li(A("Conclusion", href="#section4")),
+                    ),
+                    cls="toc",
+                ),
+                Main(
+                    H1("Simple Blog Post"),
+                    Section(
+                        H2("Introduction"),
+                        P("""We are excited to introduce TxT360, a
+                            large-scale, comprehensive, and fully transparent
+                            dataset designed for Large Language Model (LLM)
+                            pre-training. TxT360 is engineered to strike a
+                            balance between the quantity and quality of
+                            pre-training data, pushing the limit on both
+                            fronts. This comprehensive dataset encompasses both
+                            expansive web-based data and highly curated data
+                            sources, making it one of the most robust LLM
+                            pre-training corpora available today.  Our web data
+                            component includes 99 snapshots from Common Crawl,
+                            amassing 5.7 trillion tokens and occupying 11 TB of
+                            disk space in jsonl.gz format. On the curated side,
+                            TxT360 integrates one of the most extensive
+                            collections of high-quality sources across multiple
+                            domains, ensuring diverse and rich content referred
+                            to as curated sources, 14 sources across 10
+                            domains.  To maintain the highest quality, we
+                            meticulously pre-processed the web data to filter
+                            out low-quality content and conducted thorough
+                            reviews of the curated sources. This process not
+                            only unified their formats but also identified and
+                            rectified any anomalies. Not only do we 100%
+                            open-source our processing scripts, but we also
+                            release the details of our data reviews, revealing
+                            the decision-making processes behind data selection
+                            and quality assurance.  This level of transparency
+                            allows researchers and practitioners to fully
+                            understand the dataset’s composition and make
+                            informed decisions when using TxT360 for training.
+                            Additionally, TxT360 includes detailed
+                            documentation and analysis of the data, covering
+                            distribution statistics, domain coverage, and
+                            processing pipeline, which helps users navigate and
+                            utilize the dataset effectively.  Overall, TxT360
+                            represents a significant step forward in the
+                            availability and transparency of large-scale
+                            training data for language models, setting a new
+                            standard for dataset quality and openness."""),
+                        id="section1",
+                    ),
+                    Section(
+                        H2("Background"),
+                        P(
+                            """ The quality and size of a pre-training dataset
+                            play a crucial role in the performance of large
+                            language models (LLMs). The community has
+                            introduced a variety of datasets for this purpose,
+                            including purely web-based datasets like RefinedWeb
+                            [1], RedPajama-Data-V2 [2], DCLM [3], and
+                            FineWeb [4], as well as comprehensive datasets
+                            derived from multiple highly-curated data sources
+                            such as The Pile [5], RedPajama-Data-V1 [6], and
+                            Dolma [7] . It is commonly known that web-based
+                            datasets provide a vast quantity of data, while
+                            highly-curated multi-source datasets consistently
+                            deliver high quality and diversity, both critical
+                            for effective LLM pre-training.  However, despite
+                            the advancements in both types of data, each type
+                            of dataset has its limitations. For instance, the
+                            processing scripts for the web dataset, RefinedWeb,
+                            known for its high quality, are not public, and
+                            only about 10% of the entire dataset has been
+                            disclosed. Conversely, the web component of
+                            existing highly-curated multi-source datasets is
+                            relatively small compared to purely web-based
+                            datasets, limiting their coverage and diversity
+                            compared to the scale of information from the
+                            internet.  By integrating the extensive reach of
+                            web data with the exceptional quality of curated
+                            sources, TxT360 is crafted to meet and surpass the
+                            rigorous standards required for state-of-the-art
+                            LLM pre-training. """
+                        ),
+                        id="section2",
+                    ),
+                    Section(
+                        H2("Main Content"),
+                        P(
+                            """The performance of a large language model (LLM)
+                            depends heavily on the quality and size of its
+                            pretraining dataset. However, the pretraining
+                            datasets for state-of-the-art open LLMs like Llama
+                            3 and Mixtral are not publicly available and very
+                            little is known about how they were created.
+                            Reading time: 45 min. For the best reading
+                            experience, we recommend not using a mobile phone.
+                            Recently, we released 🍷 FineWeb, a new,
+                            large-scale (15-trillion tokens, 44TB disk space)
+                            dataset for LLM pretraining. FineWeb is derived
+                            from 96 CommonCrawl snapshots and produces
+                            better-performing LLMs than other open pretraining
+                            datasets. To bring more clarity in machine learning
+                            and advance the open understanding of how to train
+                            good quality large language models, we carefully
+                            documented and ablated all of the design choices
+                            used in FineWeb, including in-depth investigations
+                            of deduplication and filtering strategies. The
+                            present long form report is a deep dive in how to
+                            create a large and high-quality web-scale dataset
+                            for LLM pretraining. The dataset itself, 🍷
+                            FineWeb, is available here.  We are extremely
+                            thankful to the whole distill.pub team (Christopher
+                            Olah, Shan Carter, Ludwig Schubert in particular)
+                            for creating the template on which we based this
+                            blog post. Thanks also for inspiring us with
+                            exquisitely crafted articles and blog posts.  In
+                            this report we also introduce 📚 FineWeb-Edu, a
+                            subset of FineWeb constructed using scalable
+                            automated high-quality annotations for educational
+                            value, and which outperforms all openly accessible
+                            web-datasets on a number of educational benchmarks
+                            such as MMLU, ARC, and OpenBookQA. 📚 FineWeb-Edu
+                            is available in two sizes/filtering-level: 1.3
+                            trillion (very high educational content) and 5.4
+                            trillion (high educational content) tokens (all
+                            tokens are measured with GPT2 tokenizer). You can
+                            download it here.  Both datasets are released under
+                            the permissive ODC-By 1.0 license TLDR: This blog
+                            covers a discussion on processing and evaluating
+                            data quality at scale, the 🍷 FineWeb recipe
+                            (listing and explaining all of our design choices),
+                            and the process followed to create its 📚
+                            FineWeb-Edu subset."""
+                        ),
+                        id="section3",
+                    ),
+                    Section(
+                        H2("Conclusion"),
+                        P("""This is the conclusion section where we
+                            summarize the key points discussed in the blog post
+                            and provide final thoughts.
+                          """
+                        ),
+                        id="section4",
+                    ),
+                    cls="content",
+                ),
+                cls="container",
+            )
+        ),
+        lang="en",
+    )
+setup_hf_backup(app)
+serve()

requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ fasthtml-hf

style.css ADDED Viewed

	@@ -0,0 +1,65 @@

+body {
+    font-family: Arial, sans-serif;
+    margin: 0;
+    padding: 0;
+    display: flex;
+}
+.container {
+    display: flex;
+    width: 100%;
+}
+.toc {
+    width: 20%;
+    background-color: #f4f4f4;
+    padding: 20px;
+    box-shadow: 2px 0 5px rgba(0,0,0,0.1);
+    position: fixed;
+    height: 100%;
+    overflow-y: auto;
+}
+.toc h2 {
+    font-size: 1.5em;
+    margin-bottom: 10px;
+}
+.toc ul {
+    list-style-type: none;
+    padding: 0;
+}
+.toc ul li {
+    margin-bottom: 10px;
+}
+.toc ul li a {
+    text-decoration: none;
+    color: #333;
+}
+.toc ul li a:hover {
+    text-decoration: underline;
+}
+.content {
+    margin-left: 30%;
+    margin-right: 10%;
+    padding: 30px;
+    width: 80%;
+}
+.content h1 {
+    font-size: 2em;
+    margin-bottom: 20px;
+}
+.content section {
+    margin-bottom: 40px;
+}
+.content section h2 {
+    font-size: 1.5em;
+    margin-bottom: 10px;
+}