OpenAI has released a new suite of AI models—GPT-4.1, 4.1 Mini, and 4.1 Nano—designed to support real-world software development tasks.
Highlights
These models, available exclusively through OpenAI’s API, offer an extended context window of up to 1 million tokens, equivalent to approximately 750,000 words, enabling more comprehensive processing of large codebases and documents.
The GPT-4.1 release comes amid intensifying competition in the long-context AI space, with companies like Google and Anthropic also launching advanced models tailored for software development.
Google’s Gemini 2.5 Pro and Anthropic’s Claude 3.7 Sonnet, for instance, have demonstrated strong benchmark performances. OpenAI’s latest models reflect an ongoing strategy to enhance coding capabilities while addressing the growing demand for AI-powered software tools.
OpenAI describes GPT-4.1 as part of its broader initiative to develop AI agents capable of managing end-to-end software workflows.
These agents are expected to eventually perform tasks such as application development, quality assurance, bug fixing, and technical documentation generation with reduced reliance on human input.
Improvements in Instruction Handling and Code Accuracy
The GPT-4.1 models have been refined based on feedback from developers, aiming to improve instruction adherence, consistency in output format, and performance in real-world environments.
Notable enhancements include improved handling of diff formats, repository exploration, and generation of unit tests.
These refinements contribute to more efficient and reliable development processes, particularly for tools like Aider, where precision in diff handling directly affects cost and latency.
According to benchmark results on SWE-Bench Verified—a human-validated metric for software engineering tasks—GPT-4.1 scored between 52% and 54.6%.
While this places it ahead of OpenAI’s previous models, such as GPT-4o and GPT-4o Mini, it trails competitors like Gemini 2.5 Pro and Claude 3.7 Sonnet, which scored 63.8% and 62.3% respectively.
Context Retrieval and Long-Term Memory Performance
All three models in the GPT-4.1 lineup successfully passed OpenAI’s “Needle in a Haystack” test, retrieving information from within 1 million-token-long contexts.
This extended context capability is designed to assist developers working with large repositories or complex documentation, improving the model’s ability to maintain coherence over long sessions.
However, internal evaluations indicate that performance can decline with larger context sizes. OpenAI’s MRCR test showed that accuracy dropped from 84% at 8,000 tokens to 50% at the 1 million-token level.
The models also interpret instructions more literally than their predecessors, which may require users to craft prompts with greater specificity.
Model Variants and Pricing
OpenAI has positioned the GPT-4.1 series to accommodate various performance and budget needs:
- GPT-4.1: Offers the most advanced capabilities and supports the full 1-million-token context window.
- GPT-4.1 Mini: Balances performance with cost-effectiveness, suited for standard development tasks.
- GPT-4.1 Nano: Optimized for speed and affordability, ideal for lightweight or real-time applications.
In terms of pricing, GPT-4.1 Nano is OpenAI’s most affordable option, priced at $0.10 per million input tokens and $0.40 per million output tokens.
4.1 Mini is available at $0.40 and $1.60 per million input and output tokens, respectively, while the full GPT-4.1 model is priced at $2 for input and $8 for output per million tokens. Notably, GPT-4.1 is approximately 26% less expensive than GPT-4o.
Availability
The GPT-4.1 models are currently available only via OpenAI’s API and are not integrated into ChatGPT.
As part of this transition, OpenAI has announced that the GPT-4.5 preview in the API will be deprecated by July 14, 2025. This change is intended to streamline offerings and guide users toward the latest model infrastructure.
OpenAI has indicated plans to expand the multimodal capabilities of GPT-4.1 to support tasks involving image, video, and audio data. This aligns with the company’s broader vision of building versatile AI systems capable of working across different data types.
In a separate benchmark using the Video-MME test, 4.1 achieved a 72% accuracy rate in the “long, no subtitles” category, suggesting potential for extended applications beyond code generation.
While GPT-4.1 models show notable improvements in instruction-following, coding accuracy, and context handling, OpenAI acknowledges that challenges remain. Context degradation, literal interpretation of prompts, and cost considerations are ongoing areas of optimization.
.