robotswantdata a day ago

Very interesting.

The way I understand this works is that the researchers found a clever architectural hack to stop AI from hoarding memory when reading long documents.

Normally, when an AI transcribes a 100 page PDF, it tries to remember every single word it has already ingested. This short-term memory (the KV cache) grows linearly O(N) until the model runs out of VRAM and crashes (or caps it) To avoid this, developers are forced to build janky code that chops PDFs into individual pages, processes them one by one, and glues the text back together.

Unlimited OCR uses Reference Sliding Window Attention (R-SWA) to split the AI's focus into two paths:

Global Reference: The AI keeps full, uncompromised sight of the original document image so it never loses context.

Local Generation: The AI restricts its memory of its own typed text to a tight, moving window (like the last 128 words) and safely forgets the rest.

Will be very interesting for local AI and can’t wait to see what the community builds and extends with it!

  • _puk a day ago

    This hits a sweet spot I think for conversations too. I've been playing (for quite a while) on trying to encapsulate long running conversations.

    You have the overriding context, facts that don't change very often at all. The participants names, their backgrounds etc.

    Then you have some very fine grained facts (what they ate for breakfast this morning) which might be useful right now, but are irrelevant outside of a general trend over the longer term.

    When trying to reconstruct a conversation you really need to find the right balance without pulling in everything that has ever been discussed.

    This definitely is worth further investigation.

    • ewild a day ago

      This sounds like we are trying to add an LSTM into a transformer

      • htrp a day ago

        Sepp would like a word

    • jeena 21 hours ago

      I tried to do that for very long translations, I had a sliding window, I had a memory for the important things to keep it consistent, a loop for repairs etc. https://jeena.net/loop-engineering

      But for some reason the local models I used back then that was almost 2 years ago) weren't good enough so none of my optimizations did anything good for the translation quality.

    • timwis a day ago

      Can you say more about how this applies to long-running conversations? I've been thinking about them as well, but can't write wrap my head around how this would be better than (or even different to) standard compaction.

      • dominotw a day ago

        standard compactions doesnt really distinguish between long term vs short term ephemeral facts ?

        • timwis a day ago

          Forgive me if I'm being naive, but can't you just tweak the compaction prompt to differentiate? Presumably that's what you would do in the separate prompt anyway, right?

  • storywatch a day ago

    Haven't read the full paper but thr local generation window is a little small, especially since image inputs are especially token heavy. Depending on where the local attention layer is located, it would be nicer if it's bigger e.g. 4096 words at least.

  • MattRogish a day ago

    I do OCR of images, and that's exactly what I do. I take one big image and slice it into many smaller ones, and send those to the LLM. Perfect every time, unlike using the whole image which resulted in hot garbage.

    • freefaler a day ago

      It works with relatively good scans, when there are bad/skewed scans and especially something with many label/value pairs, that aren't nicely tucked inside sentences, the more context you have, the more you can find the correct words and fix the errors.

      There is a whole class of tricky documents. A decent (if you ignore the marketing bias) post about this problem can be found here:

      https://getomni.ai/blog/ocr-benchmark

    • ryanisnan a day ago

      How do you know where to slice an image? What if you slice an image mid-word?

      • vrc 17 hours ago

        Paddle-VL and GLM-OCR do this by using PP-DocLayoutv3 as their "detector/slicer" and then just batch the OCR on the clips to do pretty darn well at a tiny size.

        A lower-tech version is to use a good detector and XY-cut or just a naive Y-cut or orientation-away cut to slice up the page. But if you're doing that you're getting closer and closer to DocVLM style OCR+low res image. Been playing around with something like this using the new PPOCRv6 which itself punches well above most traditional OCRs and is multi-language without the hassle of language detection and dict-loading for rec.

      • MattRogish a day ago

        I calculate* the appropriate overlap and the slicer overlaps a certain amount of the previous slice. There is some post-processing assembly required, but it's trivial.

        [*] SWAG line height, trial and error to figure out the right amount of overlap given LLM error rates, etc.

        • ryanisnan a day ago

          Interesting. Do you have a uniform data set? E.g. documents of a specific type that you know consistently have similar formats, or is this training something you need to do per-document?

          • MattRogish 20 hours ago

            We have some broad shapes - it’s a finite set of “things that are interesting to us” and the dataset is bounded. It’s not “Google Image Search”. But it is kinda like “we have a giant pile of PDFs, pictures, etc and the user wishes to run an arbitrary query on them and extract the information they want. Ex: “I need the to know $something about the data embedded in the corpus, that look like excel data with line charts describing some particular class of metric that are to the left of gray dogs and are about $something_else earlier in the document”

            Gemini has a very specific mode where it has been trained on making boxes normalized to a 1000x1000 grid (https://docs.cloud.google.com/gemini-enterprise-agent-platfo...) and in our experience this “just works” AND is very fast on 3.5 and 3.1 models without needing much thinking (so it is not terrifically expensive).

            (BTW A+++ gold star triple thumbs up give this person a bonus to whomever did that magic it basically made this task for us tractable. When we first found it nobody else had anything like it - it’s worked so well I haven’t felt any need to look. )

            So we say, “Hey Gemini draw box_2d […] around #{things we are interested in}” and then it is pretty easy to then go - ok if this is here and that is there, let’s slice the image in this particular way, making sure to overlap by some amount because the boxes are fuzzy, then send the chunks to a thing that turns it into JSON, then we use something like edge detection to reconstruct the whole from the parts. (Squint and it looks like whole genome shotgun sequencing)

  • ranger_danger a day ago

    I thought all the major LLM tools already supported sliding window attention?

    • krackers 14 hours ago

      I mean sliding window attention is the most basic way of getting long context window. For the OCR case it seems like it should be even simpler, since you don't even need to have the "sliding" portion, unless I"m missing something you don't need to retain anything about the previous pages to OCR a new page so you could just pick a short context window and restart from scratch each time. [^1]

      Were people really trying to do OCR with vanilla attention?

      [^1] Although maybe I guess looking at their demo, tables that span multiple pages might be a use-case for having some look back.

  • d675 a day ago

    See, leetcode is useful. As I do this leetcode grind, I’ve been why techniques exist / how they’re used irl. Lots of interesting stuff there

    • ai_fry_ur_brain a day ago

      Who said it wasnt useful, dont listen to those people.

      • Xevion a day ago

        People who are applying to jobs and are tested with LeetCode problems to assess their skill level, despite the two not really being correlated or relevant for the position

        • galbar a day ago

          As someone that gets very annoyed when having to do LeetCode in interviews...

          Knowing algorithms, data structures and their memory and time complexities is very relevant for SWE. I've had teammates that didn't understand them and everything was fine until when it wasn't (scaling and performance issues).

          Or, as I put it to a teammate: "Would you rather review the PR of someone that understands the difference between a set and a list or the PR of someone who doesn't?". This was after we interviewed a candidate with ~15 YoE, on paper, that didn't know the difference.

          • elliottcarlson a day ago

            > Knowing algorithms, data structures and their memory and time complexities is very relevant for SWE

            Agree with this; however knowing how to roll your own BFS/LRU/etc isn't -- in that case I'd rather review the PR of someone who understands how to leverage tested and known implementations than the PR of someone who decided to roll their own.

        • ai_fry_ur_brain a day ago

          Who care's if the leetcode question doesn't relate to the job itself, it shows whether or not the person is willing to put in the work and gives you a glimpse into their ability to reason about hard problems.

      • d675 20 hours ago

        just the level of questions being asked seems to be high idk, just passed round 1 for big tech. Not feeling great about the rest.

        main comment was a bit tongue in cheek

peatmoss a day ago

I recently bought a tablet for sheet music, mostly to replace a stack of jazz "Real Books" at jam sessions. And the phone camera scans I made are okay, but fixed in size and have a lot of artifacts. And it would be great to transpose on the fly for e.g. Bb or Eb instruments, but being a scan this is obviously not possible.

I got digging into the state of optical music recognition and came away concluding that music is basically a greenfield for AI wherever you look. Optical music recognition is pretty terrible. AI understanding of music theory is terrible (actually looking at music that is; LLMs do okay at text descriptions of theory concepts where you can imagine some online texts making it in).

I think the issue is that we still don't have great digital formats that encode the dots on paper that musicians read. Music notation is pretty rich. Midi doesn't capture all of what's needed for symbolic understanding, because it was mostly made for capturing aspects relevant for playback or performance. MusicXML seems to be the closest for a digital format that encodes the information a musician would want, but there aren't great corpora of training data that would connect a MusicXML representation to sheet music images or to audio. I think that's because MusicXML falls short of encoding enough information to engrave music. Tools like MuseScore need to track a bunch of layout information that isn't encodable in MusicXML. Lilypond format is less verbose that MusicXML and contains a bit more information that is useful to the score creators, but most people don't create sheet music in lilypond. (As an aside, Lilypond bums me out with the state of jazz fonts. I hate looking at "legit" scores in jazz context)

I realize this is mildly off topic, but every time I see people making incremental gains on OCR, which to my mind is pretty good, I am reminded of how abysmal OMR is.

  • kwon-young a day ago

    So, the format for musicologist and researcher in music is the MEI format: https://music-encoding.org/ for which the reference engraver is verovio: https://www.verovio.org/index.xhtml Note that verovio is able to engrave in svg format while keeping a maximum of information from the original mei score, meaning that you can extract enough metadata to create an actual detection dataset for a deep learning model. This is my horrible hacked up script that will create a coco dataset from a set of scores engraved with verovio: https://github.com/kwon-young/music/blob/main/svg2pl.py I have published a synthetic music score dataset from this: https://www.kaggle.com/datasets/kwonyoungchoi/trompa-coco/da... I anyone wants to try and fit a detector on top is welcome :)

    To understand why OMR is so neglected is because most people widely underestimate the difficulty of the task. It has a specific blend of the most extreme shapes combined with an extremely complicated graphical grammar...

    • peatmoss a day ago

      Thank you for this! Both MEI format and the Verovio engraver are news to me. I will check them out.

      My first thought was whether MEI format is being added to MuseScore (the sheet music editor I use these days). It looks like it is: https://music-encoding.org/musescore-doc/

      As a somewhat related aside, now that the MuseScore people own Hal Leonard and seem to pushing integration with their cloud subscription service, I wonder if they'll see some of these directions as potentially competing with them. I don't think there's anyone who wouldn't love a transposable clean digital version of their Real Books... and if Hal Leonard is in the business of selling Real Books, I can see where good OMR might be a problem for them. I guess piracy of scanned versions is already rampant, so maybe it's a wash.

  • indiv0 a day ago

    > music is basically a greenfield for AI wherever you look

    AIN'T THAT THE TRUTH.

    My girlfriend is studying musicology and she has some physical disabilities that make it difficult for her to write things down sometimes. So I try to help her by writing some AI-powered TTS/OCR/etc. apps here and there. It becomes painfully obvious that music was never considered an important part of any AI training dataset, anywhere.

    These days, I'm pleasantly surprised by how well Opus 4.8 understands/explains music theory (as you said). But ask him to transcribe/OCR/OMR some sheet music and he'll confidently give you the MusicXML/Lilypond equivalent of "2 + 2 = horse".

    I really hope this ignored area will be swept up with the rest of the rising AI wave, but it's still criminally undervalued.

    • mejutoco a day ago

      > how well Opus 4.8 understands [...] and he'll confidently

      I always think of the nun character against AI in Mrs Davis:

      > "Don't give it a name. No one calls Facebook Doug. No one calls Twitter Mary Lou. No one calls them anything, because no one uses them anymore. They use it, and it's not a person. It's code. - Mrs Davis

      • indiv0 19 hours ago

        You're not wrong, but if I'm talking to a Chinese Room, I'm still going to use pronouns and all sorts of meatbag-specific language. It doesn't matter if there really is a real person on the other end or not -- it's easier for me to just default to the assumption that there is. Monkey brain gonna anthropomorphize.

        On the other hand if I try to talk to Facebook, all he says in response is "200 OK".

        • mejutoco 13 hours ago

          I am not saying it is wrong to call Opus "he". I am saying it reminds me of that show, that is all. No further intention from my side.

      • mft_ a day ago

        Eh, humans give names to and/or anthropomorphise lots of things. My partner names all of her cars and bikes; I don't. Isn't it more rational to feel some sort of connection and anthropomorphise a tool with which you can at least have an intelligent conversation, than a simple machine?

    • peatmoss a day ago

      I recently left a job at where I was working with open data producers / providers across a lot of domains. A lot of data is produced and released for free by governments and nonprofits because it's either directly part of the mission, or it's a natural byproduct of the organization's mission. Occasionally, you'd have really great datasets come out of industry / commercial organizations because the data were a byproduct and didn't create a scenario where a data release would create opportunity for competition.

      I've been thinking about what kind of organization could be self-sustaining and also produce good music AI training data as a natural byproduct. An ideal arrangement would be something that provided some incentive or benefit to musicians in exchange for their recorded interpretation of sheet music. Soundslice, mentioned by another user, seems to do that. They let both teachers and students upload recordings of music that has been turned into MusicXML. The recordings, paired to those snippets of sheet music, has to be a gold mine. Assuming they have enough users. If they aren't already working on stem separation and automatic transcription, they probably should be. Still, my hope would be to figure out some kind of sustainable model where that dataset could be created and released for open model development...

      As a domain, I see AI in music as a boon to human creativity. I am very much a novice jazz improvisor, and a passable amateur technician on the trombone. Human instructors can do a lot for me, but there's a lot that is "grinding it out" repetition, where I think AI could be a huge aid. I heard Sam Harris on a podcast recently talk about his bullishness on the humanities (paraphrasing: people don't care if a human reads their MRI if detection is good, but people probably do care that a human wrote the novel they're reading).

      Music might even be a better example of the irreplaceability of people. While some people might bop along to a tune composed by Suno on the radio, live music is just so much more enjoyable for me. And even better than listening to a live show played by masters, is playing together with friends. To the extent that AI can patiently help us learn the skills to express our own creativity, I'm here for it!

  • singpolyma3 a day ago

    What about sheet music typesetting formats like https://abcnotation.com/ ?

    • peatmoss a day ago

      I forgot to mention ABC. I have seen a few LLMs look at that. There was a model / paper published a couple years back called ChatMusician that built around it.

      With the caveat that I'm not terribly fluent in ABC, it seems to me that simple things are simple, but hard things seem to be nearly pathological. And (again, maybe a lapse in my understanding) it seems like there may be a fair number of concepts that are impossible to convey in ABC?

      Lastly, if I understand correctly, ABC got its start and is mostly popular as a simplified format for church songbooks. I'd imagine that would, uh, influence the training corpora towards sounding a bit... church songbooky.

      EDIT: I may have been overly dismissive of ABC on first glance. It does seem like people have extended it quite a bit, and that it's at least, in theory, capable of encoding most of what I'd expect. And it's human readable, which is a benefit. Though, readability does take a stiff penalty the more richness you add (e.g. dynamics, articulations, stacked notes, etc)

      • necubi a day ago

        ABC was originally designed for European folk music [0], not church music. The corpora as a result is largely fiddle tunes, particularly Irish (see for example https://thesession.org).

        ABC started very simple, because most of the performance information for folk music isn't written down, it's inferred by the player according to the idiom of that particular tradition. As usage of ABC has grown it's gotten more powerful but still falls far short of formats designed for western classical, like MusicXML or Lilypond.

        [0] https://abcnotation.com/history

  • elasticdog a day ago

    For just chord analysis, there's "Harte notation", which is meant to be unambiguous representation of the notes (https://ismir2005.ismir.net/proceedings/1080.pdf). That obviously doesn't get you all of the additional information necessary for engraving and full representation of the music, but there are research datasets available using it like https://github.com/smashub/choco. I've also used the https://github.com/MarkGotham/When-in-Rome dataset for some analysis work, but again that's not 100% what you're looking for.

    You might like the "iReal Pro" app for the replacement and transposition of jazz standards on your tablet. It's pretty great for that use case versus camera scans.

  • mcbetz a day ago

    I observe that music OCR space and the only really good solution so far is soundslice. You scan and review some edge cases and get really good results. Paid service by a small company, very worthy to be supported!

    • peatmoss a day ago

      I just signed up a trial, and uploaded a messy Real Book scan. It did very well! It missed the coda markings, but then again the directive in the Real Book was nonstandard. I guess that's a case where a multimodal model might have been able to read the text ("after solos, D.C. al coda") and do something smarter.

  • genxy a day ago

    Create a benchmark for this problem that researchers can easily run and the problem will solve itself.

  • WhitneyLand a day ago

    “there aren't great corpora of training data that would connect a MusicXML representation to sheet music images or to audio”

    It may not be necessary…a lot of the training pairs/data for this could probably be procedurally created via code.

    Would be pretty fun to work on and see it come to life.

    • peatmoss a day ago

      I'd imagine that rendered audio that just used midi voices (even high quality "Real Instruments" midi voices) would be pretty brittle for e.g. stem separation or automatic transcription. In a best case, I think you'd start with a clean digital representation, render sheet music imagery, and then have lots of recordings by a bunch of real instrumentalists playing the same music.

      On the topic of stem separation, I've wondered about creating a quasi-synthetic dataset by taking chunks of recordings by real musicians playing them back in a real space in various combinations and recording the resulting analog-blended cacophony. Could repeat in various environments like cathedrals, basement bars, etc for realism :-)

  • aidenn0 a day ago

    As someone who has never looked at a jazz score, can you share an example of how jazz sheet music would benefit from different fonts?

    • peatmoss a day ago

      It's just an entrenched aesthetic preference. Jazz fonts (fonts in this context refers both to the words and the music symbols) tend to be quite heavy with thick lines. I've heard that the thick hand-written style was originally to make charts more readable in dimly lit clubs, but with tablets and such, that's an anachronism now.

      You can look at samples of Hal Leonard's Real Book(s) on their website to get a sense of what it looks like. Again, just an aesthetic preference, but one I and many others hold nonetheless.

      • elasticdog a day ago

        I also don't love the conventional handwritten aesthetic you often see for jazz fonts. For a project I've been working on, I ended up pulling the handful of chord symbol glyphs out of MuseScore's Leland Text font and adjusting them for use in the UI since I couldn't find a suitable option out there.

  • ramses0 a day ago

    So I made a comment a while back about lilypond: https://news.ycombinator.com/item?id=46148831

    A salient extract:

    ...but why is it so complicated? A novice interpretation of "music" is "a bunch of notes!" ... my amateur interpretation of "music" is "layers of notes".

    You can either spam 100 notes in a row, or you effectively end up with:

        melody   = [ a, b, [c+d], e, ... ]
        bassline = [ b, _, b,     _, ... ]
        music = melody + bassline
        score = [
           "a bunch of helper text",
           + melody,
           + bassline,
           + page_size, etc...
        ]
    
    ...so Lilypond basically made "Tex4Music", and the format serves a few dual purposes...[snip]
KitN a day ago

"We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas."

Class Act.

  • gcr a day ago

    I don’t understand the shade being thrown ?

    • nickspacek a day ago

      It's the opposite of shade, unless GP is being sarcastic. "Class act" is normally a compliment, and in the context here it sounds to me like they're congratulating Baidu/the researchers in being transparent about where their ideas came from.

      • pbhjpbhj a day ago

        To be fair, I think I see "[real] class act" almost always used sarcastically.

        • squidbeak a day ago

          I've never seen it used that way.

          Any compliment can be repurposed as sarcasm, but it's obscenely cynical to immediately assume a compliment is sarcastic - instead of just a compliment. And by the way, there's no 'real' in the poster's message.

          • pbhjpbhj a day ago

            Yet you responded to my comment in the most cynical way possible. I was excusing the misunderstanding -- I assumed that the parent might only have seen it used cynically as that is a charitable way to interpret the apparent miscommunication and is quite possible. You shit on my comment and then told me doing such things is "cynical". Imagine a pipe before the closing square-bracket if you wish -- the standard editorial convention as I learnt it was that square-bracketed terms may be present or not (in English Language prose).

            Cheers.

            • gcr 10 hours ago

              Now you understand why I thought something written in good faith might be interpreted as shade. Happens all the time.

janpeuker a day ago

Paper under https://arxiv.org/abs/2606.23050

(As a side note, I do OCR locally as a small RAG for citations I read in books and also chunk input, but merely to save RAM - interesting this natural approach also work in a streaming model)

lacoolj a day ago

This looks more promising than what Mistral just launched (coincidence?????? i think not.)

This approach feels like it could be used for image gen as well (in some combination). Read/view image, start drawing image using illustrator/inkscape/etc (or just SVG), then fill in with what was missed after

jbarrow 18 hours ago

I'm always glad to see more multi-page work in VLM-based OCR. Especially single-pass. One of the few other multi-page papers from recently, MinerU-Popo, treats fixing up multi-page outputs as a post-processing correction step (https://arxiv.org/abs/2605.24973). Interesting to see the drop-off in quality as you up page count, though.

I also think the attention approach (always attend to the image/prefix, with a sliding window for local context) is neat!

I do wish they updated their comparison table to include more recent work (that scores marginally better on OmniDocBench), like dots.mocr.

  • vrc 17 hours ago

    What are your thoughts on the detector --> VLM pipelines, and if there's ever a world where a small LM or LM augmented detector can be efficient enough to play a role as router. I ask because I recognize you from your handle and am very familiar with your work in the doc+detector space.

arboles a day ago

I'm going to sound like I live under a rock, but what is the true reason companies open-source genuinely good software?

Shouldn't Baidu (or Google) hoard it for themselves to extract the value in a way the competition isn't be able to imitate?

  • SirYandi a day ago

    Some people working in big companies believe in the ideals of open source and convince their employers to allow open sourcing a project.

    Employers get prestige (useful for the hiring funnel) and sometimes strategically disrupt competitors (e.g. Meta releasing Ollama)

  • jerrygenser 21 hours ago

    Releasing open source models can drive revenue away from them US AI LABS. This can help china win by depriving those labs of revenue for further investment in winning the long term race.

pmarreck a day ago

my attempts at using AI to do OCR have always resulted in invented artifacts, which is not production feasible. does this suffer from that as well?

A simple example is words that are supposed to be in other languages being automatically translated to English, which ruins the effect

  • pbhjpbhj a day ago

    You almost don't want [super-]word level ML (ie word-pair/phrase/sentence/document/corpus level).

    In transcription, you want near certainty, or you want marking that the word could not be read with certainty - yes, context lets you guess, but you want - for some OCR - to know when it's a guess based on other than the letters in order forming a word.

    Example, in a census document on familysearch.com the transcriber "corrected" a name as Joseph. The literal letters in the handwritten document spell Josepth ... and sure enough that's a local variant spelling (Eire).

    In another document the writer has used "Joh" as an abbreviation, a [human, I assume] transcriber put that as John ... which is most likely, but happens to be wrong.

    Sometimes you care that it's guessed, sometimes you want just the best guess.

    • messe a day ago

      > Eire

      A nitpick, because it's often a dogwhistle: but almost nobody in Ireland calls it that when speaking English. And that's still incorrect in Irish, the correct spelling is Éire.

      • pbhjpbhj a day ago

        By saying it's a dogwhistle are you saying that not adding the correct diacritics is considered racist by Irish people? If I change the rest of the sentence to Na Gaeilge will that be better.

        • messe a day ago

          No, I'm saying that it's associated with a certain outdated and bigoted attitude toward the Irish.

          Using Éire in English, would be seen as odd. You wouldn't say Deutschland or Danmark.

          > If I change the rest of the sentence to Na Gaeilge will that be better.

          No. And you've used the genitive instead of nominative there, so I have some doubts that you could.

          • pmarreck 3 hours ago

            Not OC but couldn't help commenting here because I think this is a problem of subjectivity vs. intent and the ambiguity introduced by text.

            1) I am German descent so I'd definitely use Deutschland to appear fancy or play with words when speaking of Germany, without any bias implied or meant.

            2) The problem with believing in dogwhistles (whether they exist or not, and I know they do, but bear with me) is that the "perceived dogwhistle surface area" increases in proportion to your belief in the prevalence of dogwhistles. In other words, the more firmly you are looking for "plausibly deniable" racist terms, the more you will find terms that were actually intended to be innocent, to be offensive, and the more upset you will be in the world, AND the more annoyed people will get with you if they are not subscribed to the whole "we must avoid any possible term that could remotely be misconstrued as a plausibly-deniable dogwhistle for fear of offending someone" worldview.

            I would have absolutely used Éire but in a friendly way, and you're saying it would be perceived as a dogwhistle. Best to clarify what the person who typed it meant, before jumping to conclusions, sir. Not everyone is interested in filling their mind with extra rules just to cater to others' insecurities.

            Lastly, your comment violates the https://en.wikipedia.org/wiki/Principle_of_charity , which is a good principle for everyone to maintain.

  • drakmo a day ago

    If I would want to achieve 100% recognition results I would combine this method with an image model recreating the original document from the transcribed text and matching the layout. One can do that with using all but the page or paragraph from the document you want to recreate (to avoid recreating the exact passage under test from the image artifact directly). After reconstructing you can do an optical comparison that specifically matches misaligned characters and find the errors. Rinse and repeat. Expensive but it would guarantee 100% recognition.

  • peterderivaz a day ago

    I've been trying out this model on a 4090 to transcribe a Japanese grammar pdf (written in English with lots of Japanese examples) and it seems to be working very well from the small parts I have double checked. The output contains both the kanji/hiragana and English as appropriate without attempting any translation.

    It has converted about 200 pages in an hour.

  • aliljet a day ago

    I'm curious about this. What models/tools have you been using?

manipalite a day ago

Whatever happened to Reducto, was very promising 12-15 months ago

gettingoverit a day ago

How does it compare against Finereader? Comparisons against transformer-based OCRs don't really tell anything. The last time I checked, neither of them were of "OCR this legal document" quality.

overflowy a day ago

What are the requirements for running this locally?

piterrro a day ago

can someone explain how is this different than feeding the VLM model one page at a time?

alansaber a day ago

We've invented chunking? We are so back.

shevy-java a day ago

Is this an academic paper that is published in year xyz, but in +5 years nobody will remember it anymore?

ramon156 a day ago

I love that the entire goal is to push Deepseek OCR further. The west can learn greatly from these companies

Oras a day ago

OCR has been solved long time ago with vision models. Solutions are consistent, reliable, and stable. What is the point of reinventing the wheel?

I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?

  • joss82 a day ago

    I've been working on Parseur for the last 10 years, and OCR has not been solved yet, let me tell you.

    OCR still sucks in 2026. Hopefully this might improve the situation but I haven't tested it yet.

  • chpatrick a day ago

    It absolutely hasn't been solved, it's just got pretty decent in recent years.

    • malfist a day ago

      Pretty decent might be quiet the stretch. I'd term it almost acceptable, but only if you're using commercial solutions like amazon's textract, doing it with open source tools is at best, extremely painful and vaguely accurate.

      • chpatrick a day ago

        PaddleOCR (also from Baidu) is pretty damn good actually.

        • __rito__ a day ago

          I have shipped with PaddleOCR to prod. Works pretty well. (Usage limited to printed documents in Anglosphere). Runs fully offline, in CPU.

  • gettingoverit a day ago

    Is it? I've never seen a single OCR that would replace a human just typing it by hand.

    What if the goal is something actually useful, such as converting scientific paper PDF back to LaTeX that renders into a pixel-perfect copy? What about converting tables from electronics datasheets into computer-readable form? I wouldn't even expect it in the next decade.

    • SyneRyder a day ago

      I've had success with vision models & OCR, saved me many hours / days / weeks of typing work.

      Last year I finally OCR'd many hundreds of pages of my father's old writings. I found that feeding it to Claude Sonnet 4.x via API gave me results that were perfect. No corrections required. So perfect, that Claude was reading along with the story, and actually pointed out a continuity error in the story where an incorrect character was reciting dialog. Claude asked if it should transcribe exactly as is or if I would like Claude to correct the continuity error.

      Claude also correctly OCR'd some handwriting that was in the margins of the documents. Sonnet came very close to transcribing a Word Sleuth puzzle, but that was where I hit the limits of its capability at the time.

      Mistral OCR was also good (and actually what I started with), but it wasn't quite as good as Claude. And when it was wrong, Mistral could be frighteningly wrong - one API call must have failed, the model must have been presented with a pure black / null image, and I got back a "transcription" that described neverending darkness. It read like something the Woodsman would have broadcast in Twin Peaks S3E8. That poor model.

      Tables from electronics datasheets might be okay, I think I've had success with OCR of technical manuals with tables for 80s synthesizer hardware. But I admit my use cases don't crossover into transcriptions of equations or graphs.

  • sscaryterry a day ago

    Detecting characters almost, layout no.

    • wongarsu a day ago

      Exactly my experience. If you try to OCR hand-filled forms with a fixed structure, traditional OCR models are great. Vision-llms can improve a bit on character recognition, but at the cost of harder to detect failure modes.

      But if you are trying to ingest diverse documents with headings, multi-column layouts, headers and footers, ad space in the middle of your text, etc, vision-llms are a giant step forward. But you need the context of the previous page to make good decisions about the current page, which is where things quickly get janky (or slow, if you choose the naive approach)

      Vision-llms also seem to deal much better with variance in scripts. Cursive, random Japanese in the middle of the text, weird math symbols, handwriting from three centuries ago, all "just works" without you even having to remember that this can happen

  • ljouhet a day ago

    Real question: what tool do you use? (for long/complex documents with tables, code, maths)

    - marker (with --force-ocr) gives me the best results

    - Mistral OCR (seems really great, but I never managed to get it work)

    - Mathpix (tried a long time ago)

    - docling (gives me garbage, I must use it wrong)

    - Unlimited OCR (will try it)

    - ???

    • Oras a day ago

      - Azure Document Intelligence (has an option to return markdown too including headers and footers).

      - AWS Textract

      • badlibrarian a day ago

        Exactly. They're both very expensive and prone to surprising you. Sometimes in a good way, sometimes in a bad way. I'd rate them 85%, but you have to run a test because they both fail in different ways on the 15%.

    • ai_fry_ur_brain a day ago

      poma-ai has really great chunking techniques that chunk the document based on the document structure/heirarchy.

      We use it on 200 page IEEE standards that are notoriously complex, filled with tables and diagram. Highly reccomend.

  • vulture916 a day ago

    I haven't done much long-run OCR, so unsure of the current state, but it would seem they overcome this (from their paper):

    "A widely held view is that employing a large language model (LLM) as the decoder allows the model to leverage the prior distribution of language, leading to improved OCR performance. However, the downside is equally evident: as the output sequence lengthens, the accumulated KV cache drives up memory consumption and progressively slows down generation."

  • cannonpalms a day ago

    I guess, in theory, the prior distribution of language would allow for improved performance in some cases, especially where input quality is low.

    • ta988 a day ago

      This is already used in OCR, tesseract uses that.

  • Aboutplants a day ago

    lol nope it hasn’t been solved. I deal with this constantly and we still have a longggg ways to go

  • mamcx a day ago

    Aside: what is the best to read receipts/bank statements/invoices?

  • ta988 a day ago

    Cost, throughput, latency...

    • Oras a day ago

      Traditional OCR is faster, cheaper, and much more reliable than LLMs

      • j16sdiz a day ago

        If you consider non-English script, traditional OCR is not more reliable.

        CJK have lots of character and high confusion rate.

        Arabic scripts are complex and have lots of morphs.

        Vietnamese have easily confused diacritics.

        Thai have lots of non-standard fonts.

      • ta988 a day ago

        I don't think that's a universal statement that aplies to every kind of documents and languages. Mistral OCR is able to do things no "traditional" OCR was ever able to.

  • mschuster91 a day ago

    > I would definitely understand post processing, like extracting data, answering question .. etc, but why re-doing the OCR engine itself?

    Well... the idea seems to be (as far as I understand it, at least) that optical errors and artifacts can now be compensated as the OCR engine is now context-aware.

    Say, for example, some random long ass name chemical. It's not going to be in a word correction database, but a context-aware engine (ideally, one that has been supplemented with chemistry data) can now correct "bad" reads of the chemical's name.

    Of course, there remains the issue of how to prevent the infamous Xerox bug [1]...

    [1] https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

  • JohnKemeny a day ago

    OCR has definitely not "been solved long time ago", what are you talking about?

    In your opinion, what is SOTA here?