Have you used PaddleOCR? I'm surprised they're claiming SOTA without comparing against Amazon Textract or Azure doc intelligence (LayoutLM v3 under the hood, as far as I know).
I've played around with doc recognition quite a bit, and as far as I can tell those two are best-in-class.
This comes back to the SLM vs LLM debate (sizes in relative terms), where an SLM can be optimised for a specific task, and out-perform an LLM. But it's not worth it (time, effort) for most tasks unless 1. they are very sensitive to precision or 2. it is ultra-high volume.
Just coming out of founding one of the first LLM fine tuning startups - Lamini - I disagree
Our thesis was that fine tuning would be easier than deep learning for users to adopt because it was starting from a very capable base LLM rather than starting from scratch
However, our main finding with over 20 deployments was that LLM fine tuning is no easier to use than deep learning
The current market situation is that ML engineers who are good enough at deep learning to master fine tuning can found their own AI startup or join Anthropic/OpenAI. They are underpaid building LLM solutions. Expert teams building Claude, GPT, and Qwen will out compete most users who try fine tuning on their own.
RAG, prompt engineering, inference time compute, agents, memory, and SLMs are much easier to use and go very far for most new solutions
They will hire anyone who can produce a model better than GPT5, which is the bar for fine tuning
Otherwise, you should just use gpt5
Preparing a few thousands training examples and pressing fine tune can improve the base LLM in a few situations, but it also can make the LLM worse at other tasks in hard to understand ways that only show up in production because you didn’t build evals that are good enough to catch them. It also has all of the failure modes of deep learning. There is a reason why deep learning training never took off like LLMs did despite many attempts at building startups around it.
It’s quite easy to produce a model that’s better than GPT-5 at arbitrarily small tasks. As of right now, GPT-5 can’t classify a dog by breed based on good photos for all but the most common breeds, which is like an AI-101 project.
Try doing a head to head comparison using all LLM tricks available including prompt engineering, rag, reasoning, inference time compute, multiple agents, tools, etc
Then try the same thing using fine tuning. See which one wins.
I’d be interested to be proven wrong
I think it is easy for strong ML teams to fall into this trap because they themselves can get fine tuning to work well. Trying to scale it to a broader market is where it fell apart for us.
This is not to say that no one can do it. There were users who produced good models. The problem we had was where to consistently find these users who were willing to pay for infrastructure.
I’m glad we tried it, but I personally think it is beating a dead horse/llama to try it today
I mean, at the point where you’re writing tools to assist it, we are no longer comparing the performance of 2 LLMs. You’re taking a solution that requires a small amount of expertise, and replacing it with another solution that requires more expertise, and costs more. The question is not “can fine tuning alone do better than every other trick in the book plus a SOTA LLM plus infinite time and money?” The question is: “is fine tuning useful?”
What models did you try to find tune? Were the models at the time even good enough to fine tune? Did they suffer from catastrophic forgetting?
We have a lot of more capable open source models now. And my guess is that if you designed models specifically for being fine tuned, they could escape many of the last generation pitfalls.
Companies would love to own their own models instead of renting from a company that seeks to replace them.
We used the best models available and went from the Pythia/gpt2 to Deepseek generations.
One annoying part was switching to new and better models that came out literally every week.
I don’t think it substantially changes anything. If anything I think the release of more advanced models like qwen-next makes things like fp4, moe, and reasoning tokens an even higher barrier of entry.
Fine-tuning is a good technique to have in a toolbox, but in reality, it is feasible only in some use cases. On one hand, many NLP tasks are already easy enough for LLMs to have near perfect accuracy and fine tuning is not needed. On the other hand, really complex tasks are really difficult to fine-tune and clevem data collection might be pretty expensive. Fine-tuning can help with the use cases somewhere in the middle, not too simple, not too complex, feasible for data collection, etc.
An example I just found worked very well with fine-tuning: I wanted to extract any frame that contained a full-screen presentation slide from a various videos I've archived, only when it's full-screen, and also not capture videos, and some other constraints.
Naturally I reached for CLIP+ViT which got me a ~60% success rate out of the box. Then based on that, I created a tiny training script that read `dataset/{slide,no_slide}` and trained a new head based on that. After adding ~100 samples of each, the success rate landed at 95% which was good enough to call it done, and circle back to iterate once I have more data.
I ended up with a 2.2K large "head_weights.safetensors" that increased the accuracy by ~35% which felt really nice.
Lots of caveats here in the following statement: if your application is not fully leaning in to frontier model capabilities, you are probably building a previous generation product.
> Finally, companies may have reached the ceiling of what can be achieved with prompting alone. Some want models that know their vocabulary, their tone, their taxonomy, and their compliance rules.
Together with speed and const, this is from my point of view this is the only "case" for the return of fine-tuning here. And this can be managed by context management.
With growing context sizes, first RAG replaced fine-tuning and later even RAG was replaced by just a good-enough prompt preparation for more and more usage pattern.
Sure, speed and costs are important drivers. But like with FPGAs vs. CPUs or GPUs, the development costs and delivery time for high-performance solutions, eliminate the benefit most the time.
This website loads at impressive speeds (from Europe)! Rarely seen anything more snappy. Dynamic loading of content as you scroll, small compressed images without looking like it (webp). Well crafted!
There is growing emphasis on efficiency as more companies adopt and scale with LLMs in their products.
Developers might be fine paying GPT-5-Super-AGI-Thinking-Max prices to use the very best models in Cursors, but (despite what some may think about Silicon Valley), businesses do care about efficiency.
And if you can fine-tune an 8b-parameter Llama model on GPT-5 data in < 48 hours and save $100k/mo, you're going to take that opportunity.
The OpenAI fine-tuning api is pretty good - you need to label an evaluation benchmark anyway to systematically iterate on prompts and context, and it’s often creates good results if you give it a 50-100 examples, either beating frontier models or allowing a far cheaper and faster model to catch up.
It requires no local gpus, just creating a json and posting to OpenAI
Fine tuning was never really hard to do locally if you had the hardware. What I’d like to read in an article like this is more details into why they’re making a comeback.
Why would you choose a model where the trained in priors don't match your use case? Also, keep in mind that RL'd in behavior includes things like reasoning and how to answer questions correctly, so you're literally taking smart models and making them dumber by doing SFT. To top it off, SFT only produces really good results when you have traces that closely model the actual behavior you're trying to get the model to display. If you're just trying to fine tune in a knowledge base, a well tuned RAG setup + better prompts win every time.
Because you need a solution for your problem and the available tools are what they are and nothing else and you don't have enough resources to train your own model.
A couple of examples I have seen recently which makes me agree with OP:
- PaddleOCR, a 0.9B model that reaches SOTA accuracy across text, tables, formulas, charts & handwriting. [0]
- A 3B and 8B model which performs HTML to json extraction at GPT-5 level accuracy at 40-80x less cost, and faster inference. [1]
I think it makes sense to fine tune when you're optimizing for a specific task.
[0] https://huggingface.co/papers/2510.14528
[1] https://www.reddit.com/r/LocalLLaMA/comments/1o8m0ti/we_buil...
Have you used PaddleOCR? I'm surprised they're claiming SOTA without comparing against Amazon Textract or Azure doc intelligence (LayoutLM v3 under the hood, as far as I know).
I've played around with doc recognition quite a bit, and as far as I can tell those two are best-in-class.
This comes back to the SLM vs LLM debate (sizes in relative terms), where an SLM can be optimised for a specific task, and out-perform an LLM. But it's not worth it (time, effort) for most tasks unless 1. they are very sensitive to precision or 2. it is ultra-high volume.
Just coming out of founding one of the first LLM fine tuning startups - Lamini - I disagree
Our thesis was that fine tuning would be easier than deep learning for users to adopt because it was starting from a very capable base LLM rather than starting from scratch
However, our main finding with over 20 deployments was that LLM fine tuning is no easier to use than deep learning
The current market situation is that ML engineers who are good enough at deep learning to master fine tuning can found their own AI startup or join Anthropic/OpenAI. They are underpaid building LLM solutions. Expert teams building Claude, GPT, and Qwen will out compete most users who try fine tuning on their own.
RAG, prompt engineering, inference time compute, agents, memory, and SLMs are much easier to use and go very far for most new solutions
Will Anthropic/OpenAI really hire anyone who can fine-tune an LLM?
They will hire anyone who can produce a model better than GPT5, which is the bar for fine tuning
Otherwise, you should just use gpt5
Preparing a few thousands training examples and pressing fine tune can improve the base LLM in a few situations, but it also can make the LLM worse at other tasks in hard to understand ways that only show up in production because you didn’t build evals that are good enough to catch them. It also has all of the failure modes of deep learning. There is a reason why deep learning training never took off like LLMs did despite many attempts at building startups around it.
Andrej karpathy has a rant about it that captures some of the failure modes of fine tuning - https://karpathy.github.io/2019/04/25/recipe/
It’s quite easy to produce a model that’s better than GPT-5 at arbitrarily small tasks. As of right now, GPT-5 can’t classify a dog by breed based on good photos for all but the most common breeds, which is like an AI-101 project.
Try doing a head to head comparison using all LLM tricks available including prompt engineering, rag, reasoning, inference time compute, multiple agents, tools, etc
Then try the same thing using fine tuning. See which one wins.
I’d be interested to be proven wrong
I think it is easy for strong ML teams to fall into this trap because they themselves can get fine tuning to work well. Trying to scale it to a broader market is where it fell apart for us.
This is not to say that no one can do it. There were users who produced good models. The problem we had was where to consistently find these users who were willing to pay for infrastructure.
I’m glad we tried it, but I personally think it is beating a dead horse/llama to try it today
I mean, at the point where you’re writing tools to assist it, we are no longer comparing the performance of 2 LLMs. You’re taking a solution that requires a small amount of expertise, and replacing it with another solution that requires more expertise, and costs more. The question is not “can fine tuning alone do better than every other trick in the book plus a SOTA LLM plus infinite time and money?” The question is: “is fine tuning useful?”
Fair didn’t seem to matter to users who just wanted to build solutions with reasonable time and budget
If your customers can't fine tune, do it for them instead.
How can you hire enough people to scale that while making the economics work?
Why would they join you rather than founding their own company?
I think you are saying to go after the very high end of the market.
That’s fair, one market segment of this is sometimes called sovereign compute.
Another common model that I have seen is to become the deepmind for one very large and important customer.
I think this works.
> How can you hire enough people to scale that while making the economics work?
Pick the right customers.
> Why would they join you rather than founding their own company?
The network effects of having enough resources in one place. For having other teams deal with the training data, infrastructure, deployment, etc.
[dead]
What models did you try to find tune? Were the models at the time even good enough to fine tune? Did they suffer from catastrophic forgetting?
We have a lot of more capable open source models now. And my guess is that if you designed models specifically for being fine tuned, they could escape many of the last generation pitfalls.
Companies would love to own their own models instead of renting from a company that seeks to replace them.
We used the best models available and went from the Pythia/gpt2 to Deepseek generations.
One annoying part was switching to new and better models that came out literally every week.
I don’t think it substantially changes anything. If anything I think the release of more advanced models like qwen-next makes things like fp4, moe, and reasoning tokens an even higher barrier of entry.
Fine-tuning is a good technique to have in a toolbox, but in reality, it is feasible only in some use cases. On one hand, many NLP tasks are already easy enough for LLMs to have near perfect accuracy and fine tuning is not needed. On the other hand, really complex tasks are really difficult to fine-tune and clevem data collection might be pretty expensive. Fine-tuning can help with the use cases somewhere in the middle, not too simple, not too complex, feasible for data collection, etc.
>Fine-tuning is a good technique to have in a toolbox, but in reality, it is feasible only in some use cases.
Yes, 100s of housands of them
What would you say is an example of one of those “middle” tasks it can help with?
An example I just found worked very well with fine-tuning: I wanted to extract any frame that contained a full-screen presentation slide from a various videos I've archived, only when it's full-screen, and also not capture videos, and some other constraints.
Naturally I reached for CLIP+ViT which got me a ~60% success rate out of the box. Then based on that, I created a tiny training script that read `dataset/{slide,no_slide}` and trained a new head based on that. After adding ~100 samples of each, the success rate landed at 95% which was good enough to call it done, and circle back to iterate once I have more data.
I ended up with a 2.2K large "head_weights.safetensors" that increased the accuracy by ~35% which felt really nice.
Lots of caveats here in the following statement: if your application is not fully leaning in to frontier model capabilities, you are probably building a previous generation product.
> Finally, companies may have reached the ceiling of what can be achieved with prompting alone. Some want models that know their vocabulary, their tone, their taxonomy, and their compliance rules.
Together with speed and const, this is from my point of view this is the only "case" for the return of fine-tuning here. And this can be managed by context management.
With growing context sizes, first RAG replaced fine-tuning and later even RAG was replaced by just a good-enough prompt preparation for more and more usage pattern.
Sure, speed and costs are important drivers. But like with FPGAs vs. CPUs or GPUs, the development costs and delivery time for high-performance solutions, eliminate the benefit most the time.
This website loads at impressive speeds (from Europe)! Rarely seen anything more snappy. Dynamic loading of content as you scroll, small compressed images without looking like it (webp). Well crafted!
Magic of a CDN? Plus avoiding JS probably. Haven't checked source though.
Creator of inference.net / schematron here.
There is growing emphasis on efficiency as more companies adopt and scale with LLMs in their products.
Developers might be fine paying GPT-5-Super-AGI-Thinking-Max prices to use the very best models in Cursors, but (despite what some may think about Silicon Valley), businesses do care about efficiency.
And if you can fine-tune an 8b-parameter Llama model on GPT-5 data in < 48 hours and save $100k/mo, you're going to take that opportunity.
I wrote about this recently as well: https://madiator.substack.com/p/finetuning-is-so-back
And here I am thinking we'd be discussing the teleological argument.
The OpenAI fine-tuning api is pretty good - you need to label an evaluation benchmark anyway to systematically iterate on prompts and context, and it’s often creates good results if you give it a 50-100 examples, either beating frontier models or allowing a far cheaper and faster model to catch up.
It requires no local gpus, just creating a json and posting to OpenAI
https://platform.openai.com/docs/guides/model-optimization
They don't offer it for GPT-5 series, as a result much of the time fine-tuning Gemini 2.5-Flash is a better deal.
Fine tuning was never really hard to do locally if you had the hardware. What I’d like to read in an article like this is more details into why they’re making a comeback.
Curious to hear others’ thoughts on this
Which minimum hardware spec would qualify as making this not really hard to do locally?
Return? Did it run away?
I don't think anyone thought fine tuning was dead.
For some of us fine-tuning is a constant activity...
Fine tuning by pretraining over a RL tuned model is dumb AF. RL task tuning works quite well.
You may have no choice in how the model you are fine tuning was trained, and may have no interest in verticals it was RL tuned for.
In any case, platforms like tinker.ai support both SFT and RL.
Why would you choose a model where the trained in priors don't match your use case? Also, keep in mind that RL'd in behavior includes things like reasoning and how to answer questions correctly, so you're literally taking smart models and making them dumber by doing SFT. To top it off, SFT only produces really good results when you have traces that closely model the actual behavior you're trying to get the model to display. If you're just trying to fine tune in a knowledge base, a well tuned RAG setup + better prompts win every time.
Because you need a solution for your problem and the available tools are what they are and nothing else and you don't have enough resources to train your own model.