
Apple study shows fine-tuned open-source model beating GPT-5 on UI design
- 20somethingmedia
- Feb 6
- 2 min read
Summary
• Apple researchers found that training AI with designers' natural feedback—sketches and hands-on edits—produces far better UI generators than traditional rating systems.
• The team collected 1,460 annotations from professional designers, then fine-tuned the open-source Qwen2.5-Coder model to outperform OpenAI's GPT-5 in human evaluations.
• Apple released the trained models on GitHub, suggesting high-quality expert feedback can help smaller AI models surpass larger proprietary ones.
showing that professional designers can train AI models to create superior user interfaces using their everyday tools like sketches and edits, far surpassing traditional feedback methods.
This approach challenges the standard AI training playbook and positions smaller open-source models to outpace giants like OpenAI's GPT-5.
The Experiment's Core Setup
Researchers enlisted 21 seasoned designers, spanning 2 to over 30 years in UI/UX, product, and service design.
They generated 1,460 annotations via four methods: pairwise rankings, natural language comments, visually-grounded sketches, and direct revisions in design software.
Independent evaluators validated these, finding 76.1% agreement on revisions and 63.6% on sketches, compared to just 49.2% for rankings—barely above random.
Native Feedback Wins Big
Traditional RLHF relies on simplistic likes, dislikes, or rankings, which the study deems misaligned with designers' nuanced workflows.
Instead, "native" methods like sketching (averaging 42 characters of text) and hands-on edits captured rich, tacit knowledge, producing preference pairs with far less evaluator disagreement.
Revisions took the longest at 3.45 minutes each versus 12 seconds for rankings, but sketches offered a sweet spot for efficiency and quality.
Small Model Tops GPT-5
Using this data, the team trained a reward model and fine-tuned Qwen2.5-Coder (later variants like Qwen3-Coder) via ORPO optimization.
In arena-style human evaluations, these models beat baselines, including GPT-5, on UI quality.
This builds on Apple's prior UICoder work, proving modest high-quality feedback lets smaller open-source models rival proprietary behemoths.
Broader Impact on AI Training
The findings upend RLHF norms by prioritizing "rich rationale" from real workflows over crude signals.
Authors note subjectivity introduces variance, yet native feedback consistently yields better designs.
Apple released the models on GitHub under "ml-rldf," inviting community use.
Why It Matters Now
As AI tools infiltrate design pipelines, this method could democratize top-tier UI generation for smaller teams.
It highlights a shift: future AI alignment may lean on expert-native inputs, blending human creativity with machine scale.



Comments