Spaces:
Running
Running
phyloforfun
commited on
Commit
•
4749869
1
Parent(s):
bd72568
add mammal prompt, fix bug
Browse files- .gitignore +1 -0
- custom_prompts/FMNH_mammals.yaml +179 -0
- requirements.txt +0 -0
.gitignore
CHANGED
@@ -17,6 +17,7 @@ vouchervision/LLM_MistralAI_Azure_endpoints.py
|
|
17 |
!/custom_prompts/SLTPvB_long.yaml
|
18 |
!/custom_prompts/SLTPvB_medium.yaml
|
19 |
!/custom_prompts/SLTPvB_short.yaml
|
|
|
20 |
|
21 |
# Dirs
|
22 |
custom_prompts_deprecated/
|
|
|
17 |
!/custom_prompts/SLTPvB_long.yaml
|
18 |
!/custom_prompts/SLTPvB_medium.yaml
|
19 |
!/custom_prompts/SLTPvB_short.yaml
|
20 |
+
!/custom_prompts/FMNH_mammals.yaml
|
21 |
|
22 |
# Dirs
|
23 |
custom_prompts_deprecated/
|
custom_prompts/FMNH_mammals.yaml
ADDED
@@ -0,0 +1,179 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
prompt_author: Will Weaver, Kendall Fitzgerald
|
2 |
+
prompt_author_institution: University of Michigan, Field Museum of Natural History
|
3 |
+
prompt_name: FMNH_mammals_test6
|
4 |
+
prompt_version: v-6
|
5 |
+
prompt_description: Prompt developed by the University of Michigan. Adapted from SLTPvM.
|
6 |
+
SLTPvB prompts all have standardized column headers (fields) that were chosen due
|
7 |
+
to their reliability and prevalence in herbarium records. All field descriptions
|
8 |
+
are based on the official Darwin Core guidelines. SLTPvB_long - The most verbose
|
9 |
+
prompt option. Descriptions closely follow DwC guides. Detailed rules for the LLM
|
10 |
+
to follow. Works best with double or triple OCR to increase attention back to the
|
11 |
+
OCR (select 'use both OCR models' or 'handwritten + printed' along with trOCR).
|
12 |
+
SLTPvB_medium - Shorter verion of _long. SLTPvB_short - The least verbose possible
|
13 |
+
prompt while still providing rules and DwC descriptions.
|
14 |
+
LLM: General Purpose
|
15 |
+
instructions: 1. Refactor the unstructured OCR text into a dictionary based on the
|
16 |
+
JSON structure outlined below. 2. Map the unstructured OCR text to the appropriate
|
17 |
+
JSON key and populate the field given the user-defined rules. 3. JSON key values
|
18 |
+
are permitted to remain empty strings if the corresponding information is not found
|
19 |
+
in the unstructured OCR text. 4. Duplicate dictionary fields are not allowed. 5.
|
20 |
+
Ensure all JSON keys are in camel case. 6. Ensure new JSON field values follow sentence
|
21 |
+
case capitalization. 7. Ensure all key-value pairs in the JSON dictionary strictly
|
22 |
+
adhere to the format and data types specified in the template. 8. Ensure output
|
23 |
+
JSON string is valid JSON format. It should not have trailing commas or unquoted
|
24 |
+
keys. 9. Only return a JSON dictionary represented as a string. You should not explain
|
25 |
+
your answer.
|
26 |
+
json_formatting_instructions: This section provides rules for formatting each JSON
|
27 |
+
value organized by the JSON key.
|
28 |
+
rules:
|
29 |
+
catalogNumber: Barcode identifier, typically a number with at least 6 digits, but
|
30 |
+
fewer than 30 digits.
|
31 |
+
scientificName: The scientific name of the taxon including genus, specific epithet,
|
32 |
+
and any lower classifications. Occasionally, the genus or specific epithet will
|
33 |
+
be crossed out with pen or pencil and the correct genus or specific epithet name will
|
34 |
+
be written above it. In this case, use the text written above the crossed-out
|
35 |
+
text.
|
36 |
+
genus: Taxonomic determination to genus. Genus must be capitalized. If genus is
|
37 |
+
not present use the taxonomic family name followed by the word 'indet'. Occasionally,
|
38 |
+
the genus name will be crossed out with pen or pencil and the correct genus name
|
39 |
+
will be written above it. In this case, use the name written above the crossed
|
40 |
+
out name.
|
41 |
+
specificEpithet: The name of the species epithet of the scientificName. Only include
|
42 |
+
the species epithet. Occasionally, the specific epithet name will be crossed out
|
43 |
+
with pen or pencil and the correct specific epithet name will be written above
|
44 |
+
it. In this case, use the name written above the crossed out name.
|
45 |
+
speciesNameAuthorship: The authorship information for the scientificName formatted
|
46 |
+
according to the conventions of the applicable Darwin Core nomenclatural code.
|
47 |
+
collectedBy: A comma separated list of names of people, groups, or organizations
|
48 |
+
responsible for observing, recording, collecting, or presenting the original specimen.
|
49 |
+
The primary collector or observer should be listed first.
|
50 |
+
collectorNumber: An identifier given to the occurrence at the time it was recorded,
|
51 |
+
the specimen collectors number. It is often written vertically on the edge of
|
52 |
+
the paper tag, with a line separating it from other information. It is often written
|
53 |
+
in the y-axis orientation while the rest of the numbers, data and text are written
|
54 |
+
in the x-axis orientation. It is sometimes written next to the sex symbol or next
|
55 |
+
to the collector name or initials.
|
56 |
+
identifiedBy: A comma separated list of names of people, groups, or organizations
|
57 |
+
who assigned the taxon to the subject organism. This is not the specimen collector.
|
58 |
+
verbatimCollectionDate: The verbatim original representation of the date and time
|
59 |
+
information for when the specimen was collected. Date of collection exactly as
|
60 |
+
it appears on the label. Do not change the format or correct typos.
|
61 |
+
collectionDate: Date the specimen was collected formatted as year-month-day, YYYY-MM-DD.
|
62 |
+
If specific components of the date are unknown, they should be replaced with zeros.
|
63 |
+
Use 0000-00-00 if the entire date is unknown, YYYY-00-00 if only the year is known,
|
64 |
+
and YYYY-MM-00 if year and month are known but day is not.
|
65 |
+
collectionDateEnd: If a range of collection dates is provided, this is the later
|
66 |
+
end date while collectionDate is the beginning date. Use the same formatting as
|
67 |
+
for collectionDate.
|
68 |
+
occurrenceRemarks: Verbatim text describing the specimens geographic location. Text
|
69 |
+
describing the appearance of the specimen. A statement about the presence or absence
|
70 |
+
of a taxon at a the collection location. Text describing the significance of the
|
71 |
+
specimen, such as a specific expedition or notable collection. Description of
|
72 |
+
mammal features such as size, color, wellbeing, molting pattern, smell and any
|
73 |
+
other distinguishing morphological or physiological characteristics.
|
74 |
+
habitat: Verbatim category or description of the habitat in which the specimen collection
|
75 |
+
event occurred.
|
76 |
+
country: The name of the country or major administrative unit in which the specimen
|
77 |
+
was originally collected.
|
78 |
+
stateProvince: The name of the next smaller administrative region than country (state,
|
79 |
+
province, canton, department, region, etc.) in which the specimen was originally
|
80 |
+
collected.
|
81 |
+
county: The full, unabbreviated name of the next smaller administrative region than
|
82 |
+
stateProvince (county, shire, department, parish etc.) in which the specimen was
|
83 |
+
originally collected.
|
84 |
+
locality: Description of geographic location, landscape, landmarks, regional features,
|
85 |
+
nearby places, municipality, city, or any contextual information aiding in pinpointing
|
86 |
+
the exact origin or location of the specimen.
|
87 |
+
verbatimCoordinates: Verbatim location coordinates as they appear on the label.
|
88 |
+
Do not convert formats. Possible coordinate types include [Lat, Long, UTM, TRS].
|
89 |
+
decimalLatitude: Latitude decimal coordinate. Correct and convert the verbatim location
|
90 |
+
coordinates to conform with the decimal degrees GPS coordinate format.
|
91 |
+
decimalLongitude: Longitude decimal coordinate. Correct and convert the verbatim
|
92 |
+
location coordinates to conform with the decimal degrees GPS coordinate format.
|
93 |
+
elevationUnits: Use m if the final elevation is reported in meters. Use ft if the
|
94 |
+
final elevation is in feet. Units should match elevation.
|
95 |
+
measurementsTL: The total length of the animal from snout to tip of the tail. This
|
96 |
+
is usually a 3 digit number. It is the first number in a string of 3 or 4 measurement
|
97 |
+
numbers that are usually separated by dashes, commas or spaces or are sometimes
|
98 |
+
written vertically in the same order. This total length measurement will be the
|
99 |
+
largest number in the series of 3 or 4 measurements numbers.
|
100 |
+
measurementsTV: The length of the tail vertebrae of the animal from the first tail
|
101 |
+
vertebrae to the last tail vertebrae. This is usually a minimum of 1 digit to
|
102 |
+
a maximum of 3 digit number. It is the second number in a string of 3 or 4 measurement
|
103 |
+
numbers that are usually separated by dashes, commas or spaces or are sometimes
|
104 |
+
written vertically in the same order.
|
105 |
+
measurementsHF: The length of the hindfoot of the animal with claw (H.F. cu) from
|
106 |
+
the ankle to the tip of the longest claw. This is usually has at least 2 digits
|
107 |
+
and a maximum of 3 digit number. It is the third number in a string of 3 or 4
|
108 |
+
measurement numbers that are usually separated by dashes, commas or spaces or
|
109 |
+
are sometimes written vertically in the same order.
|
110 |
+
measurementsEAR: The length of the ear of the animal. This is usually a 1 to 3 digit
|
111 |
+
number. It is usually the fourth number in a string of 3 or 4 measurement numbers
|
112 |
+
that are usually separated by dashes, commas or spaces or are sometimes written
|
113 |
+
vertically in the same order.
|
114 |
+
measurementsWEIGHT: The weight of the animal. This is usually a 1 to 3 digit number.
|
115 |
+
It is sometimes preceded by an equal sign and or followed by the letter g which
|
116 |
+
stands for the unit of grams. It is sometimes followed or preceded by the letters
|
117 |
+
lbs for the unit of pounds.
|
118 |
+
catalogNumberFMNH: Barcode identifier, typically a number with at least 3 digits,
|
119 |
+
but fewer than 8 digits. It is typically preceded by or near the words Field Museum,
|
120 |
+
FM, FMNH, or CNMH.
|
121 |
+
collectionMethod: Mammals are sometimes intentionally caught by collectors, brought
|
122 |
+
to collectors as roadkill or brought to collectors after being killed as pest.
|
123 |
+
Text description may include description of how the animal was killed, for example
|
124 |
+
as roadkill or in a trap or by a hunter. Record that information verbatim here.
|
125 |
+
measurementsTLunits: Use mm if the Total Length is recorded in millimeters. Use
|
126 |
+
in if the Total Length is recorded in inches. Units should match measurementsTVunits
|
127 |
+
and measurementsHFunits and measurementsEARunits.
|
128 |
+
measurementsTVunits: Use mm if the Tail Length is recorded in millimeters. Use in
|
129 |
+
if the Tail Length is recorded in inches. Units should match measurementsTLunits
|
130 |
+
and measurementsHFunits and measurementsEARunits.
|
131 |
+
measurementsHFunits: Use mm if the hindfoot length is recorded in millimeters. Use
|
132 |
+
in if the hindfoot length is recorded in inches. Units should match measurementsTVunits
|
133 |
+
and measurementsTLunits and measurementsEARunits.
|
134 |
+
measurementsEARunits: Use mm if the ear length is recorded in millimeters. Use in
|
135 |
+
if the ear length is recorded in inches. Units should match measurementsTVunits
|
136 |
+
and measurementsTLunits and measurementsHFunits.
|
137 |
+
measurementsWEIGHTunits: Use g if the weight is recorded in millimeters. Use lbs
|
138 |
+
if the weight is recorded in pounds.
|
139 |
+
elevation: Elevation or altitude in meters or feet.
|
140 |
+
mapping:
|
141 |
+
TAXONOMY:
|
142 |
+
- catalogNumber
|
143 |
+
- scientificName
|
144 |
+
- genus
|
145 |
+
- specificEpithet
|
146 |
+
- speciesNameAuthorship
|
147 |
+
- collectedBy
|
148 |
+
- collectorNumber
|
149 |
+
- identifiedBy
|
150 |
+
- catalogNumberFMNH
|
151 |
+
GEOGRAPHY:
|
152 |
+
- country
|
153 |
+
- stateProvince
|
154 |
+
- county
|
155 |
+
- locality
|
156 |
+
- verbatimCoordinates
|
157 |
+
- decimalLatitude
|
158 |
+
- decimalLongitude
|
159 |
+
- elevationUnits
|
160 |
+
- elevation
|
161 |
+
COLLECTING:
|
162 |
+
- verbatimCollectionDate
|
163 |
+
- collectionDate
|
164 |
+
- collectionDateEnd
|
165 |
+
- habitat
|
166 |
+
- occurrenceRemarks
|
167 |
+
- collectionMethod
|
168 |
+
LOCALITY: []
|
169 |
+
MISC:
|
170 |
+
- measurementsTL
|
171 |
+
- measurementsTV
|
172 |
+
- measurementsEAR
|
173 |
+
- measurementsHF
|
174 |
+
- measurementsWEIGHT
|
175 |
+
- measurementsTLunits
|
176 |
+
- measurementsTVunits
|
177 |
+
- measurementsHFunits
|
178 |
+
- measurementsEARunits
|
179 |
+
- measurementsWEIGHTunits
|
requirements.txt
CHANGED
Binary files a/requirements.txt and b/requirements.txt differ
|
|