Ubitium’s universal processing architecture supports several different operation modes and can reconfigure itself at runtime. When applications require workload optimised processing, the processor reconfigures itself. It repaces CPUs, GPUs, DSPs and FPGAs. This technology explainer details the different operation modes.
Why it matters
Modern embedded systems suffer from exponentially increasing complexity on all levels. A patchwork of heterogeneous specialized cores and processors are leveraged to support the ever changing software and algorithmic requirements. Ubitium’s universal compute architecture allows workload agnostic computation. Resulting in reduced complexity, cost and faster time-to-market.
The Technical Details
Ubitium’s design is a coarse-grained reconfigurable architecture (CGRA): an array of word-level processing elements (PEs) that can be reconfigured to match a task. Unlike FPGAs (fine-grained logic, synthesis/bitstreams) and earlier CGRAs (specialized toolchains), Ubitium processors are programmed like ordinary microprocessors. They reconfigure at runtime using standard RISC-V instructions and register/operand dependencies.
Universal Processing Array
At the core of Ubitium’s architecture is the Universal Processing Array (UPA), a 16 x 16 grid of PEs without conventional CPU structures such as reservation stations, reorder buffers, or register files. The processor is built around a novel execution model that combines three complementary modes of operation.




Out-of-Order / O-mode The UPA behaves like a CPU: the front end issues instructions in program order; PEs execute them out of order as operands become available (implicit, data-driven scheduling). This avoids the usual wakeup/select machinery of classic OoO designs, improving area/energy and exposing more instruction-level parallelism. Multiple independent instruction streams (threads) can share the array without duplicating core structures, increasing thread-level parallelism.
Loop Acceleration / L- mode Loops are detected and rolled out spatially across the array as instruction graphs. The same fabric then runs the loop as a data-flow pipeline (SIMD/MIMD as needed), reusing context across iterations for high throughput and low energy per operation. On loop exit, the array returns to O-mode immediately (no flush/stall) so PEs become available for general execution.
Thread Acceleration / S-mode The array executes SIMT kernels (GPU-like). PEs are grouped into SM-style units; many scalar threads run the same program with different indices. This covers per-element tensor ops and reductions without moving data to a separate device.
O-mode handles irregular control paths with low latency; L-mode sustains deterministic throughput for structured kernels; S-mode provides wide data parallelism. All three live on one fabric and one address space.
System-level effects
Ubitium’s Universal RISC-V Processor is designed to deliver tangible benefits where it matters most: latency, efficiency, simplicity, and cost.
Latency In heterogeneous designs, every offload adds buffering, bus transfers, driver calls, and queueing. Collapsing the pipeline onto one fabric removes those hand-offs; stage boundaries become intra-fabric synchronizations. Deadlines become shorter and more predictable (important for audio frame budgets, radar CPIs, and symbol-level comms).
Efficiency Energy wasted on host-device copies disappears. Pipelines stay resident: sequential control in O-mode, streaming loops in L-mode, and data-parallel inference in S-mode. Developers can update DSP kernels or AI models in software without new accelerators.
Simplicity One processor, one address space, one toolchain (RISC-V). Teams debug and profile in a single environment instead of stitching DSP/GPU/CPU stacks together. Fewer integration edges and shorter development time.
Cost Fewer chips, smaller boards, less external memory bandwidth, and less glue software. The savings apply to BOM and engineering effort – from prototype through maintenance.
The Takeaway
Ubitium’s Universal RISC-V Processer runs control code, streaming signal-processing, and SIMT workloads on a single execution fabric. It reduces system complexity, lowers cost, and accelerates development by eliminating device boundaries. The same software toolchain applies regardless of workload; when algorithms change, the hardware does not become obsolete.