it does seem like a step change in token efficiency, though based on the earlier artificial analysis reporting it's also quite the cost lottery and i'm not sure i am comfortable with that
This doesn't seem to be controlling for the number of turns in any way. Am I missing something?
Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.
They also don't mention what their sample size is, or anything about the distribution of input and response lengths.
It'd be interesting to see the distributions if the author actually plotted the data, so we could see if their analysis holds water or not.
A plot of the input lengths using ggplot2 geom_density with color and fill by model, 0.1 alpha, and an appropriate bandwidth adjustment would allow us to see if the input data distribution looks similar across the two, and using the same for the output length distributions, faceted by the input length bins would give us an idea if those look the same too.
Edit: Or even a faceted plot using input bins of output length/input length.
anything that compares proprietary models will be very miscalibrated and may not be indicative, there have been too many model changes in both chat and the api where model providers did not even say the word before it got too noticable
Quality would be performance against different given benchmarks, I assume?
There's multiple open weight models you can run on a pretty standard computer at home, which match the quality of GPT 4. I guess that would also change the equation.
I feel that the recent iterations of LLM haven't provided an intuitive qualitative leap. Have they entered a bottleneck period so quickly?
it does seem like a step change in token efficiency, though based on the earlier artificial analysis reporting it's also quite the cost lottery and i'm not sure i am comfortable with that
This doesn't seem to be controlling for the number of turns in any way. Am I missing something?
Stronger models needing fewer turns to achieve a task feels like a prime source of efficiency gains for agentic coding, more so than individual responses being shorter.
They also don't mention what their sample size is, or anything about the distribution of input and response lengths.
It'd be interesting to see the distributions if the author actually plotted the data, so we could see if their analysis holds water or not.
A plot of the input lengths using ggplot2 geom_density with color and fill by model, 0.1 alpha, and an appropriate bandwidth adjustment would allow us to see if the input data distribution looks similar across the two, and using the same for the output length distributions, faceted by the input length bins would give us an idea if those look the same too.
Edit: Or even a faceted plot using input bins of output length/input length.
Has any enterprising hacker here yet graphed price vs "output" over time since 2023, taking "quality" into account?
That's got to be a very tricky analysis given how subjective quality is. But I'm sure there are people trying to pin it down.
anything that compares proprietary models will be very miscalibrated and may not be indicative, there have been too many model changes in both chat and the api where model providers did not even say the word before it got too noticable
Quality would be performance against different given benchmarks, I assume?
There's multiple open weight models you can run on a pretty standard computer at home, which match the quality of GPT 4. I guess that would also change the equation.
[dead]