GPT Workflow for Cleaning and Structuring CSV Data

Table of Contents

Starting with messy csv files

I cannot be the only one who downloads a csv file thinking it will be clean data and then opens it in Excel to find merged headers, blank columns for no reason, half the rows shifted by one cell, and sometimes even random emoji characters. I had a csv from an e commerce site last week that had product names crammed into the description column and my automation completely broke because Zapier could not match the right field. At this point I copy the file into a folder I call burn pile and then start slowly fixing.

If you are opening csv files with a text editor you will see that the data is just separated by commas. But depending on where you exported it from sometimes it uses semicolons instead of commas or quotes around everything or double line breaks between rows. For beginners it helps to just open it with something like VS Code or Notepad and stare at the structure before touching it. If you see weird spacing or broken rows that is a sign you will want to normalize it before feeding it into GPT or Zapier or any automation tool.

Using gpt to clean broken headers

My first trick with GPT is to literally paste the first ten lines of the csv into ChatGPT and say something like can you rewrite these headers so that they are clean lowercase words with underscores instead of spaces. What comes back is usually a neat list like product_id product_name product_description. I paste that back into my csv file as the first row and suddenly Zapier or Make recognizes the fields correctly.

The funny thing is if you do not provide enough sample rows GPT will make up extra headers that do not exist. I had it invent a discount_percentage column for a file that did not have one. So the safer way is always to paste the raw headers even if they look ugly. GPT has no problem mapping them to cleaner names if you are explicit.

Dealing with extra blank rows

In one file I imported from Google Forms there were like twenty blank rows before the real data started. Every single automation I ran treated the file as empty because the first row was not a header row but just empty cells. I now tell GPT please remove any completely blank rows before processing the data. It actually outputs a corrected csv without those ghost lines and that alone has saved me from hours of backtracking.

Blank rows are deadly if you are feeding csvs into Airtable via automation. Airtable will happily create twenty empty records that do not show up until later when you realize your table is filled with nothing but blank fields. The fix is so simple but easy to miss.

Splitting columns properly

The other recurring problem is columns that contain multiple values jammed together. Example product sizes inside one cell reading Small Medium Large. If you try to filter by size in Google Data Studio or Airtable it will never work because they are stuck as one big string. GPT is surprisingly good at splitting those into separate rows or separate columns.

I usually paste a portion of the csv and tell it to normalize these values into one size per row. That means instead of one row with Small Medium Large it creates three rows, each with one size but same product id. This actually fixed a Shopify import where inventory syncs were failing. The only caveat is you really need to double check because sometimes GPT will split on spaces that were part of the product name. It is more reliable if you can tell it what delimiter the site used like commas or slashes.

Converting data types correctly

I had a situation where prices were exported with dollar signs and commas like $1,200. Zapier read those as text instead of numbers so any math operation broke instantly. GPT can be told convert the price column into plain numbers without currency symbols commas or text. The beauty is you do not have to manually strip characters. I have it regenerate the full column in numeric format and then I paste it back.

Sometimes dates are the worst offender. You think you have month day year format but some rows show up as day month year depending on the form answer. I once had a google sheet where half the birthdays looked fine but others imported a century off. In these cases I literally dump forty sample rows into GPT and ask standardize dates into YYYYMMDD format. That gives me something consistent to work with. 🙂

Automating cleaning inside a workflow

Doing this manually every time is a nightmare. What I figured out is you can use an OpenAI step inside Zapier or Make to automatically clean new csv files uploaded in a folder. For example I set up a zap watching Google Drive folder new_csv. When a file appears it sends the first two hundred lines to GPT with an instruction fix headers remove blank rows normalize dates output a cleaned csv. Then it saves the cleaned result into another folder clean_csv. From there the next zap grabs that file and loads it into Airtable.

The tricky part is file size. GPT input is limited so you cannot feed a giant file. My hack was to split the file in parts using Python or even small zaps that chunk lines. It is not perfect but if you only get weekly exports with a few thousand rows it works fine. I once tried this with a log file that had fifty thousand rows and it died instantly. So yes sometimes you learn the hard way.

Visualizing what you fixed

One thing I kept running into is after GPT cleans the data I forget what actually changed. So I started making manual comparison tables. Something super basic like:

Before After
——- ——
Prod Name product_name
Price $1,200 1200
Date 12 08 2023 20231208

Just staring at this made me feel better that the chaos got normalized. It also made it easier to troubleshoot when a zap failed later because I had a clear record of what I asked GPT to do.

I know it sounds silly but even a plain text table like that can prevent a late night breakdown when you swear the automation changed something behind your back.

When everything breaks again

Last week my carefully built workflow blew up because the exporter on the source website renamed column Title to Product Title. That tiny change cascaded into Zapier fields disappearing. I had to redo the GPT cleaning step because it was looking for title that no longer existed. ¯\\_(ツ)_/¯

I have accepted that every workflow eventually breaks for mysterious reasons. But at least having GPT handle the cleaning gives me a quick way to re stabilize things without going row by row in Excel. If you are starting out just be ready to constantly adjust prompts whenever the source shifts even slightly. It is messy but much less painful than manual data surgery…

Simple experiments that helped

If you are nervous about feeding whole files into GPT start with tiny chunks. Paste ten lines at a time, test the output, then increase. I learned GPT sometimes rearranges columns if I give it too much freedom so I always lock down the order by saying keep the columns in the same order. Once you get comfortable you can automate the process. 😛

If you want a tool outside GPT for checking the csv structure quickly the website csvlint.io is useful. It spots weird encodings or mismatched column numbers quickly. Otherwise I just rely on GPT plus a lot of manual double checking.

And that is basically how I keep patching this system every week though it probably will explode again by next month when some field name changes once more