How to convert data to format required by model for training on custom dataset

#7
by Kushrjain - opened

I need to convert my NER tagged dataset to CONLL format of 4 columns, could you provide some script to do the same?
Thank you

did you solve this?

did you solve this?

You can start with this code. For example, you have a list of IP addresses (v4 and v6) and you want to convert a file with these NERs into simple CONLL format.
'''
import re
def convert_to_conll(input_file, output_file):
with open(input_file, 'r', encoding='utf-8') as f:
lines = f.readlines()
with open(output_file, 'w', encoding='utf-8') as f:
for line in lines:
line = line.strip().rstrip(',')
match = re.search(r'<B-(IPV[46])>(.*?)</B-IPV[46]>', line)
if match:
ip_type, ip_address = match.groups()
line = line.replace(f'<B-{ip_type}>{ip_address}</B-{ip_type}>', ip_address)
tokens = line.split()
for i, token in enumerate(tokens):
if i == 0: # IP-адрес
f.write(f"{token} B-{ip_type}\n")
else:
f.write(f"{token} O\n")
f.write("\n")
convert_to_conll('input.txt', 'output.conll')
'''

Sign up or log in to comment