Formatting unstructured data using OpenAI API and Python

OpenAI provides a simple API for its awesome products, including ChatGPT. One of the cases OpenAI can be helpful with is formatting data that is either poorly formatted or has a format that can’t be easily parsed.

Let’s suppose we have the following piece of text data:

John Doe's phone is: +123435667656, he's a CEO of the company.
His vice president is Samantha Doe (sam@company.com, +323438562342).
Seniour engineer is Daniel Brown (dan@company.com).
Chief of Marketing is Mira Clasko, can be contacted by m@company.com.

We can ask OpenAI to create CSV from this data with certain columns. We’ll use the simple prompt so AI knows how to structure the data:

import openai

def as_csv(data, columns):
  openai.api_key = '<YOUR-API_KEY>'
  prompt = 'Format given text in CSV with header and following columns: ' + ', '.join(columns) + '. Return only CSV.' + "\n\n"

  return openai.Completion.create(
    model="text-davinci-003",
    prompt=prompt + data,
    max_tokens=3900,
    temperature=0
  )['choices'][0]['text'].strip()

data = open('unformatted.txt').read()
csv = as_csv(data, ['person name', 'position', 'phone', 'email'])

print(csv)
Name, Phone, Position, Email
John Doe, +123435667656, CEO, 
Samantha Doe, +323438562342, Vice President, sam@company.com
Daniel Brown, , Seniour Engineer, dan@company.com
Mira Clasko, , Chief of Marketing, m@company.com

Awesome, we now have CSV we can work with instead of unstructured text. But let’s do some improvements to make sure we get what we expect.

CSV quoting, delimiters, and empty values rules

By tweaking our prompt, we can ensure the output CSV is well formatted and follows our requirements:

...
  prompt = """Format given text in CSV 
  (doublequoted, delimited by comma, use "N/A" for empty values)
  with with header and following columns: """ + ','.join(columns) + '. Return only CSV.' + "\n\n"
...
"Name","Position","Phone","Email"
"John Doe","CEO","+123435667656","N/A"
"Samantha Doe","Vice President","+323438562342","sam@company.com"
"Daniel Brown","Seniour Engineer","N/A","dan@company.com"
"Mira Clasko","Chief of Marketing","N/A","m@company.com"

Formatting specific values

We can also add instructions to convert specific columns to a certain standard:

...
  prompt = """Format given text in CSV
  (doublequoted, delimited by comma, use "N/A" for empty values, prettify phone numbers with parenthesis, use last name then first name for person name column)
  with with header and following columns:""" + ','.join(columns) + '. Return only CSV.' + "\n\n"
...
"Name","Position","Phone","Email"
"Doe, John","CEO","(+123) 435-667-656","N/A"
"Doe, Samantha","Vice President","(+323) 438-562-342","sam@company.com"
"Brown, Daniel","Seniour Engineer","N/A","dan@company.com"
"Clasko, Mira","Chief of Marketing","N/A","m@company.com"

Cute! Keep in mind that designing prompts is very important when working with AI. Be as strict and detailed as possible about what kind of formatting you expect to get.

Further reading

Published a year ago in #machinelearning about #python and #openai by Denys Golotiuk

Edit this article on Github
Denys Golotiuk in 2024

I'm a database architect, focused on OLAP databases for storing & processing large amounts of data in an efficient manner. Let me know if you have questions or need help.

Contact me via golotyuk@gmail.com or mrcrypster @ github