An OpenAI trainer explains how coders can use the LLM

Post-training lead Michelle Pokrass explains how coders can use the latest GPT model.

IT Brew Q&A series featuring Michelle Pokrass. — Michelle Pokrass

May 16, 2025

• 4 min read

You thought training your puppy to not eat your socks was hard.

Michelle Pokrass, post-training lead at OpenAI, has the challenge of fine-tuning the reward mechanisms that steer the company’s large language models (LLMs) to the most desirable answers.

OpenAI’s latest model GPT-4.1 has features especially geared to help programmers, to ensure they’re getting helpful, usable code, and not half-eaten socks.

The AI lab announced on May 14 that GPT-4.1 would be incorporated into paid tiers of ChatGPT.

“If you want to use 4.1 in ChatGPT, I think it’s a great fit for developers who are asking questions about code or want the model to write some code for them,” Pokrass told us.

An August 2024 survey from GitHub found that more than 97% of 2,000 surveyed respondents reportedly used AI programming tools at work—even as skepticism remains on LLM’s abilities to build production-ready code.

According to OpenAI’s April 2025 post announcing 4.1’s availability in the API, SWE-bench⁠—an evaluator of LLMs’ abilities to solve real-world software issues sourced from GitHub—rated GPT-4.1 at 54.6% accuracy. The mark surpasses earlier GPT-4o’s 21.4% and GPT‑4.5’s 26.6% performance.

“This reflects improvements in model ability to explore a code repository, finish a task, and produce code that both runs and passes tests,” the April 14 announcement read.

(An SWE-bench leaderboard, which does not feature the GPT models, shows a top score of 65.8% in verified open source systems for GitHub’s OpenHands.)

Pokrass talked to us about the best coding questions for their latest LLM—and how to ask them.

Responses have been edited for length and clarity.

What is a coding question that might come up?

Something that you can use it for is asking it to produce a full app. You can say, “Create a to-do list app. I want the icons to animate in this way, and I want it connected to a back-end [server]. And I want this whole thing built.”

How much is reliant upon tweaks to the prompt? I imagine it doesn’t come out exactly how you would want it right away.

We put out a prompting guide for GPT 4.1 and that’s based on our experience with the model and how to get the best results. We have some specific recommendations there about how to structure your instructions, how to define your tools, and how to get the model to keep going. But in general, 4.1 is a very literal model. It follows instructions quite prescriptively, and so we recommend that people have as much specificity as possible when prompting 4.1.

You mentioned you’ve used this to make “toy apps.” What toy apps have you created with this?

In the past, I’ve built some software for helping to split expenses for my credit card, and that’s something that you might spend a day or two working on in the past. Now you can just ask one of our models and it’ll produce a working website for you within minutes.

What would that prompt look like, for something like splitting expenses?

You’d probably want to paste in the format of the credit card statement and then write out very clearly: Create a web app which should have three buttons on every transaction [to say “mine,” “yours,” or “joint” very clearly]…then you click the “next” button and it produces this final result. Basically, you need to be a bit of a designer to design the ideal application, but then the models can take it from there and produce the code for it.

Can you help me support an argument that it’s an accurate output, and coders can use this reliably?

It's super dependent on the use case...Using a model to assist with software engineering works really well if you understand the spec of what you want to build and you can write tests around it, or even use the model to generate tests. I think one workflow that works well is: Describe what you want, have the model generate some tests for you, look at the tests, and see if they match your expectations. Then, you can feel more confident about asking it to produce the final code, because you’ve kind of validated your expected behavior.

How would you ease any concerns from coders who might think that these kinds of capabilities might automate them out of the job?

I think software engineering is one of the first industries to experience a lot of AI-driven productivity gains. What every software engineer can accomplish has just increased a lot…I think they’ll just be able to get more done.

Top insights for IT pros

From cybersecurity and big data to cloud computing, IT Brew covers the latest trends shaping business tech in our 4x weekly newsletter, virtual events with industry experts, and digital guides.