Spaces:

LLM360
/

TxT360

Running

App Files Files Community

victormiller commited on Oct 4

Commit

2f958f8

•

1 Parent(s): 48d8ec3

Update web.py

Browse files

Files changed (1) hide show

web.py +13 -6

web.py CHANGED Viewed

@@ -242,6 +242,7 @@ attrs.fraction_of_characters_in_duplicate_lines = sum(
 def web_data():
     return Div(
         Div(
         H2("Common Crawl Snapshot Processing"),
         H3("What This Section Contains"),
@@ -287,6 +288,8 @@ def web_data():
             margin-bottom: 15px
         """,
         ),
         H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets"),
         P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
         table_div_filter_data,
@@ -325,8 +328,9 @@ def web_data():
  #       P("Following C4, we remove any page where the phrase “lorem ipsum” appears since some pages have placeholder “lorem ipsum” text."),
-        H2("Stage 1: Document Preparation"),
         P(B("Text Extraction: "), """
@@ -486,8 +490,9 @@ def web_data():
             """,
         ),
-        H2("2. Line-Level Removal"),
         P("""
         Before filtering low-quality documents, we perform the line-level removal to remove low-quality lines.
         This ensured that computing quality signals would align with the final kept texts.
@@ -599,8 +604,9 @@ def web_data():
             margin-bottom: 15px
             """,
         ),
-        H2("3. Document-Level Filtering"),
         P("""
         In this section, we introduce each quality signal used to filter out low-quality documents.
         """),
@@ -1660,4 +1666,5 @@ def web_data():
             margin-bottom: 15px
             """,
         ),
     )

 def web_data():
     return Div(
+        Section(
         Div(
         H2("Common Crawl Snapshot Processing"),
         H3("What This Section Contains"),
             margin-bottom: 15px
         """,
         ),
+        id="section1",),
+        Section(
         H3("TxT360 CommonCrawl Filtering vs Other Pretraining Datasets"),
         P("The following section provides explicit details covering the reasoning and decisions behind each of the filters we applied. The table below provides a high-level comparison of TxT360's filtering compared to other commonly used pretraining datasets."),
         table_div_filter_data,
  #       P("Following C4, we remove any page where the phrase “lorem ipsum” appears since some pages have placeholder “lorem ipsum” text."),
+        id="section2",),
+        Section(
+        H2("Document Preparation"),
         P(B("Text Extraction: "), """
             """,
         ),
+        id="section3",),
+        Section(
+        H2("Line-Level Removal"),
         P("""
         Before filtering low-quality documents, we perform the line-level removal to remove low-quality lines.
         This ensured that computing quality signals would align with the final kept texts.
             margin-bottom: 15px
             """,
         ),
+        id="section4",),
+        Section(
+        H2("Document-Level Filtering"),
         P("""
         In this section, we introduce each quality signal used to filter out low-quality documents.
         """),
             margin-bottom: 15px
             """,
         ),
+        id="section5",),
     )