OpenEye: A Parameterizable Open-Source DNN Accelerator, Measured on a Xilinx FPGA
A new preprint out of FH Dortmund and the University of Duisburg-Essen puts a working, parameterizable, sparsity-aware DNN inference accelerator on the table — released as open-source hardware under the Solderpad Hardware License v2.1, with the RTL, the Python-to-RTL toolchain, and the verification harness all in one GitHub repository. The accompanying arXiv paper (2606.01450), authored by Denis Lebold and Hendrik Wöhrle, has been accepted to the ARCS 2026 proceedings and is what you actually have to work from today.
What concretely exists
OpenEye is a hierarchical cluster-of-processing-elements architecture with a streaming dataflow. The basic unit is a 2×8 grid of clusters, each cluster holding 12 processing elements, for a default 192 PEs. Inside each cluster, the dataflow is row-stationary — the same family Chen et al. introduced in EyerissV2 (arXiv:1807.07928) — but with one important difference: OpenEye exposes the knobs EyerissV2 had to bake into a fixed fabricated ASIC as top-level Verilog parameters. You don't edit the RTL to resize the array; you set CLUSTER_COLUMNS, CLUSTER_ROWS, NUM_GLB_WGHT, NUM_GLB_PSUM, IACT_PER_PE, WGHT_PER_PE, PSUM_PER_PE, SERIAL_MACS vs. PARALLEL_MACS, and SPARSITY_EN and resynthesize. That alone changes what you can do with the design.
The compute path is INT8 activations and weights with a 20-bit partial-sum accumulator, and structural sparsity is supported on both input activations and weights — meaning zeros are skipped at the dataflow level rather than via runtime bookkeeping. The on-chip pooling block does 2×2 max-pooling, and the host interface is a 64-bit DMA with an AXI-stream-style handshake.
Two builds are checked into the OpenEye repository. OpenEye_Parallel.v is the bare compute core, written as portable ASIC-style RTL. OpenEye_FPGA.v is the ~3,050-line top-level wrapper that adds the DMA burst interfaces and BRAM buffers you need to actually deploy on a Xilinx FPGA.
The toolchain, end to end
The Python side takes a TFLite or Keras model, runs per-filter affine quantization, compiles each layer into DMA streams, drives the RTL simulation, and verifies the outputs against the reference. Verification is done with CocoTB, which is the practical choice if you want to keep the testbench in Python rather than SystemVerilog/UVM. From a researcher's perspective, this means a single Python script can take a small CNN all the way from a Keras model to bit-exact results in simulation.
The scaling result
The headline result, according to the arXiv preprint, is that routing and interconnect overhead scale near-linearly with the number of PEs as the cluster grid is enlarged — the cost you pay for adding more PEs is roughly proportional, not quadratic. Multiple cluster/PE configurations were synthesized and measured on a Xilinx ZU19EG FPGA, and the authors report favorable performance and resource trade-offs across those design points. This is the falsifiable claim the rest of the paper hangs on: a parameterizable PE array that does not fall off a cliff as you scale it.
What is not demonstrated
It is worth being precise about what is and is not in the source material.
This is an FPGA accelerator, not a fabricated ASIC. There is no silicon result, no tape-out, and no claim of one.
The measurements reported in the paper are on a single Xilinx device family (ZU19EG). The authors have not, in the preprint, provided results on Intel, Lattice, Microchip, or any non-Xilinx part.
The paper is an arXiv preprint submitted on 31 May 2026. It has been accepted to the ARCS 2026 proceedings, but the camera-ready and any subsequent peer-reviewed corrections are not yet out.
There is no public adoption, deployment, or ecosystem metric — no customer, no production traffic, no SDK maturity claim. OpenEye is a research artifact handed over from FH Dortmund to the UDE Electronic Components and Circuits group in March 2025.
The "near-linear scaling of routing and interconnect overhead" is the authors' own characterization based on their own synthesis runs. There is no independent third-party benchmark on the released RTL in the preprint.
What you can actually do with it
If you are an open-hardware researcher, a graduate student, or an FPGA engineer, the design gives you several concrete starting points:
Study the dataflow and buffering choices against a published baseline. The [OpenEye_Parallel.v](https://github.com/Learning-Chips-Lab/OpenEye) core is small enough to read end-to-end.
Replicate the results on the same ZU19EG board, or on a smaller Zynq part, and check whether the scaling behavior holds on your tool version.
Extend the layer set. The current implementation covers convolutional, fully connected, and 2×2 max-pool layers. Adding depthwise convolutions, transposed convolutions, or different activation functions is a bounded piece of work against the existing layer compiler.
Port it. The DMA wrapper is the part most tightly coupled to Xilinx IP; replacing it with an Intel/OpenCL or Lattice host interface is a tractable engineering task.
Use it as a baseline. There are other open DNN accelerators in the wild — Gemmini, NVDLA — but OpenEye's combination of row-stationary dataflow, native sparsity, and parameterizable cluster/PE counts is the differentiator to benchmark against.
Bottom line
The arXiv preprint and the GitHub repository are the source of truth, and both are public. OpenEye is a working, modifiable, parameterizable DNN accelerator for FPGAs with measured scaling results on a real device — not a whitepaper, not a roadmap, not a fabricated chip. Read the paper, clone the repo, and decide for yourself whether the dataflow choices hold up.