A radiologist reading a CT scan, an agronomist estimating crop yield from a drone photo, and a city planner tallying traffic from a rooftop camera have never shared a tool. Each has leaned on a different specialist system, tuned to its own image type and counting conventions. A new research model called Count Anything tries to collapse those three workflows into one: hand it a sentence, and it counts the named objects in almost any kind of picture.
According to a writeup of the release published today, Count Anything accepts a free-text prompt such as "count the red blood cells" or "count the parked cars" and returns a tally along with visual markers on the image itself the-decoder.com writeup. The same writeup says the model is built on top of Meta's earlier image-segmentation system, the foundation model the same research community has used to build a wave of general-purpose vision tools since Meta open-sourced it. Neither the paper nor the code release is in hand at the time of writing, so those provenance details still need a primary-source check.
The mechanism the writeup describes is a deliberate two-counter design. For large, well-separated objects, the model draws a rectangle around each one, what researchers call a bounding box. For small, dense targets that would be hard to box cleanly, it places a single dot on each instance, a point marker. A final merge step compares the two outputs and, when both counters flag the same object, keeps the higher-confidence prediction. The point of the merge is to avoid double counting the same target just because two methods happened to land on it.
To train the system, the team assembled a custom dataset they call CLOC, the writeup says, covering a wide range of image types and counting scenarios. The same article reports that Count Anything outperforms many existing counting systems on standard benchmarks, though it also flags the honest failure cases. The model still struggles with ambiguous terms, where a user prompt could map to several reasonable object categories, and with extremely dense scenes where targets overlap heavily. Those limits come from the launch writeup itself, not from external testing.
Those limits matter because the use cases the writeup highlights, quantifying cells in a medical scan, estimating agricultural yield, counting vehicles in a traffic feed, are exactly the situations where a wrong count is more than a rounding error. The constructive case for the model is not that it replaces the radiologist or the agronomist. It is that it gives a non-expert a fast first pass, a rough enumeration a specialist can audit, in domains where the only counting option today is to install and learn yet another bespoke system.
What to watch next: whether the underlying paper, once it circulates, confirms the benchmark gaps and the Meta basis reported in the secondary coverage, and whether the open release ships the CLOC training set or only the model weights. Both will shape how quickly independent groups can stress-test the system on the ambiguous-label and dense-scene failure modes the launch writeup already names.