
Gmktec.com
Local AI inference crossed a threshold this month. AMD's own first-party Ryzen AI Halo desktop opened pre-orders in June 2026 at $3,999, the same processor platform that powers a lunchbox-sized Chinese mini PC already available for roughly half the price. The GMKtec EVO-X2, built around AMD's Ryzen AI Max+ 395 chip, is the first x86 machine capable of loading a 235-billion-parameter model into a single unified memory pool — and for developers paying $440 a month in cloud inference subscriptions, the hardware pays for itself in under a year.
That calculation is real, but it is not the whole picture. Understanding what this chip actually does — and where it stops — requires a brief look at the architecture behind it.
Every major bottleneck in consumer AI inference since 2020 has pointed to the same problem: discrete graphics cards run out of video RAM before they can load the models that matter. An Nvidia RTX 4090 — a $1,600 card — tops out at 24 gigabytes of VRAM. A 70-billion-parameter model at Q4 quantization requires roughly 40 gigabytes. The model does not fit. The developer pays for cloud access instead.
The AMD Ryzen AI Max+ 395 solves this with a different architecture. Built on TSMC's 4-nanometer process, it integrates 16 Zen 5 CPU cores, an RDNA 3.5 integrated GPU with 40 compute units, and an XDNA 2 neural processing unit on a single die — all sharing one pool of 128 gigabytes of LPDDR5x-8000 memory. Because the CPU and GPU draw from the same physical memory pool over an on-die interconnect, there is no PCIe bus to cross and no VRAM ceiling imposed by a discrete card's onboard GDDR6X. The GPU, identified in software as gfx1151, can address up to 96 gigabytes of that pool as effective VRAM — four times the capacity of an RTX 4090.
Memory bandwidth on this chip reaches approximately 256 gigabytes per second in theoretical terms, with real-world testing on Linux with ROCm enabled running around 215 gigabytes per second. That bandwidth figure matters more than raw compute for the workload in question: token generation in large language models is a memory-bandwidth-bound operation, not a compute-bound one. Each forward pass reads model weights from memory and writes the next token; the faster the memory bus moves data, the faster the chip produces tokens.
One important caveat for developers planning a Linux deployment: unlocking full GPU performance requires installing AMD's ROCm software stack and configuring the GPU's device identifier to gfx1151. Without this step, Ollama and llama.cpp default to CPU-only inference, cutting throughput roughly in half. The setup is well-documented and takes an afternoon, but it is not zero effort.
An equally important clarification on the chip's published TOPS figure: the XDNA 2 NPU's 50-plus peak TOPS rating does not apply to large language model inference. As of mid-2026, mainstream inference stacks including Ollama, llama.cpp, and LM Studio route LLM workloads to the GPU, not the NPU. The NPU accelerates fixed-function tasks such as video upscaling and image classification. Buyers should not choose a configuration based on NPU TOPS for LLM use.
The GMKtec EVO-X2 is the most widely available implementation of the Ryzen AI Max+ 395 in a mini PC chassis. On the 128-gigabyte configuration, it runs Qwen3-235B at approximately 11 tokens per second. That speed is possible because Qwen3-235B is a Mixture-of-Experts model: it has 235 billion total parameters, but each forward pass activates only about 22 billion of them. The chip moves the active weights, not the full model, and the token rate reflects that smaller working set rather than the model's headline parameter count.
On dense models — architectures where every parameter participates in every forward pass — the performance picture changes. A 70-billion-parameter dense model such as Llama 3.3 70B runs at approximately 5 tokens per second on this chip. That is usable for development and single-user interactive sessions; it is not fast enough for serving multiple simultaneous users or for multi-agent pipelines where dozens of agents are waiting on responses concurrently. For high-concurrency workloads, a discrete GPU with high memory bandwidth — despite its smaller VRAM ceiling — remains the better architectural choice.
DeepSeek V3 runs comfortably within the memory envelope of the 128-gigabyte system. Smaller models in the 7-billion to 13-billion parameter range run at 30 to 45 tokens per second — fast enough that inference feels instant for most interactive use cases.
Read more: AMD Reveals Ryzen AI Halo Desktop to Challenge Cloud-Based AI Computing
AMD's own Ryzen AI Halo, which opened pre-orders in June 2026 at $3,999, runs the same underlying chip. The first-party system is positioned for enterprise developers and comes with AMD's curated software stack and direct support. For a buyer who needs the simplest possible path to a working local inference environment, the Ryzen AI Halo is the cleaner option.
For a buyer comfortable with Linux configuration and willing to spend an afternoon on ROCm setup, more than 30 third-party mini PCs running the same Ryzen AI Max+ 395 chip are now in the market, starting at roughly $2,399 for 128-gigabyte configurations. The EVO-X2 was the first to market with this chip and established the performance baseline; newer entrants from Corsair, Framework, and Beelink have since joined.
Apple's M4 Mac Studio and Mac Mini remain strong alternatives for developers who prioritize ecosystem integration and polished software support. Apple's unified memory architecture preceded AMD's x86 implementation and its memory bandwidth on the M4 Max is comparable. Apple has, however, discontinued configurations above 96 gigabytes across its Mac Studio and Mac Mini lines, which means AMD's platform is currently the only consumer option for running models that require more than 96 gigabytes of addressable GPU memory.
The throughput gap for high-concurrency workloads is real. An Nvidia RTX 4090 delivers approximately 1 terabyte per second of memory bandwidth — roughly four times what the Ryzen AI Max+ 395 provides. For workloads requiring many simultaneous users or high-frequency multi-agent calls, a discrete GPU at 24 gigabytes of VRAM with model quantization may outperform a unified-memory system in tokens per second. The core question is this: does your workload need to fit a very large model into memory, or does it need to serve a very large number of requests quickly? Capacity wins favor the Ryzen AI Max+ 395. Throughput wins favor discrete GPU.
A developer currently subscribed to Claude Code Max at $200 per month, ChatGPT Pro at $200 per month, Cursor at $20 per month, and Gemini Advanced at $20 per month is spending $440 per month — $5,280 per year — before any API overage charges, team seats, or enterprise tiers.
The EVO-X2 in its 128-gigabyte configuration is priced at roughly $1,500 to $1,800 depending on storage configuration and retailer. At $440 per month in avoided subscriptions, the hardware pays for itself in nine to ten months. After that point, inference costs the price of electricity — approximately $9 to $12 per month at Performance Mode's 140-watt draw, depending on local electricity rates.
The workflow requires no exotic tooling: install Ollama, pull the model, and point any API-compatible tool to localhost. The interface is functionally identical to cloud-hosted inference. No data leaves the machine. No rate limits apply during long refactoring sessions.
For developers who need guaranteed privacy for code or proprietary data, local inference is structurally different from cloud inference regardless of the economics. Cloud inference sends prompts to a third-party server; local inference never does.
The honest boundary on this calculation: it assumes a developer who is genuinely substituting local inference for cloud API consumption at the stated monthly rate. Developers who use frontier proprietary models — GPT-4o, Claude Opus 4, Gemini Ultra — for tasks where frontier capability is the point will still pay for cloud access. Open-weight models at the 70-billion to 235-billion parameter range are highly capable but are not identical to closed frontier models. The ROI case is strongest for developers who already use open-weight models through cloud APIs and are paying for convenience rather than unique capability.
Read more: Mini PC Deals: Save Huge For GMKtec's 7th Anniversary Sale
GMKtec unveiled the EVO-X3 at AMD AI Developer Day in Shanghai on May 19, 2026. The successor runs the same Ryzen AI Max+ 395 chip but adds a native OCuLink port — the high-bandwidth interface that allows connecting an external discrete GPU without the throughput penalty of USB4. A second EVO-X3 variant, planned for later in 2026, will carry AMD's upcoming Ryzen AI Max+ PRO 495 with 192 gigabytes of unified memory, which would push the VRAM ceiling significantly beyond what any Apple Silicon configuration currently offers.
AMD confirmed the Ryzen AI Max PRO 400 series on May 20, 2026, with OEM systems from HP and Lenovo expected in Q3 2026. Developers purchasing an EVO-X2 today are buying into a known capability at a well-established price point, against the option of waiting for a successor with a higher memory ceiling at a price and date not yet confirmed.
GMKtec is a private technology company headquartered in Shenzhen, China, founded in 2019. As a Chinese company, it is subject to China's National Intelligence Law of 2017. Article 7 of that law states that all organizations and citizens shall support, assist, and cooperate with national intelligence efforts in accordance with law. Separately, China's Cybersecurity Law of 2017 and Data Security Law of 2021 require companies operating in China to cooperate with government data requests, including data collected outside China's borders.
These are legal conditions governing the manufacturer as a company operating under Chinese law. They are not allegations and they are not contested claims. GMKtec has not been named in any FCC Covered List proceeding, and no independent security audit has confirmed a backdoor or surveillance capability in any EVO-X2 unit. The EVO-X2 holds standard FCC Part 15.247 equipment authorization.
What the legal framework means in practice depends on the device's configuration. The EVO-X2 ships with Windows 11 Pro preinstalled, which includes Microsoft's own telemetry regardless of the machine's country of manufacture. Running Linux — which is well-supported and is the recommended configuration for ROCm-based AI inference — eliminates operating system-level telemetry entirely. The EVO-X2's AI inference workloads do not require network access; the model runs locally and sends no data to external servers.
Practical steps for buyers concerned about supply-chain risk include performing a clean operating system installation before first use, preferring Linux over Windows for AI inference workloads, and network-segmenting the device from sensitive enterprise data in corporate environments. No mitigation fully addresses the structural legal framework governing any manufacturer subject to Chinese intelligence cooperation law. Developers handling regulated or proprietary data should consult their organization's security team before deployment.
Can the AMD Ryzen AI Max+ 395 run 70-billion-parameter models locally without a discrete GPU?
Yes. The Ryzen AI Max+ 395 can address up to 96 gigabytes of its shared 128-gigabyte memory pool as effective VRAM — enough to load a full 70-billion-parameter model at Q4 quantization, which requires roughly 40 gigabytes. On Linux with ROCm enabled, a 70B dense model runs at approximately 5 tokens per second, which is adequate for single-user development and interactive sessions. High-concurrency multi-user workloads may still favor a discrete Nvidia GPU despite its smaller VRAM capacity, because discrete GPUs deliver substantially higher memory bandwidth per token.
Is running AI locally cheaper than paying for cloud subscriptions?
For developers spending $200 to $440 per month on cloud inference subscriptions, the math favors local hardware within a year. A 128-gigabyte EVO-X2 at $1,500 to $1,800 pays for itself against $440 per month in roughly nine to ten months. The calculation holds for developers substituting open-weight model inference for cloud API use; it does not apply to use cases that specifically require frontier proprietary models such as Claude Opus or GPT-4o, which are not available as open-weight alternatives.
What are the security and privacy implications of a mini PC from a Chinese manufacturer?
GMKtec is subject to China's National Intelligence Law of 2017, which legally obligates Chinese organizations to cooperate with state intelligence requests. No backdoor or surveillance capability has been confirmed in any EVO-X2 unit by an independent named security auditor, and the device is not on any government restricted list. The meaningful risk mitigation steps are: install a clean operating system before use, prefer Linux for AI workloads, and network-segment the device from sensitive enterprise data. Developers handling regulated or proprietary data should consult their organization's security team before deployment.
What is the difference between the EVO-X2 and AMD's own Ryzen AI Halo?
Both run the same AMD Ryzen AI Max+ 395 chip with 128 gigabytes of unified LPDDR5x-8000 memory. The AMD Ryzen AI Halo first-party system opened pre-orders in June 2026 at $3,999 and is positioned for enterprise developers, with AMD's curated software environment and direct support. Third-party implementations including the EVO-X2 are available at lower prices — from roughly $1,500 for the EVO-X2 to $2,399 for newer entrants — but require more manual software configuration to reach full GPU performance on Linux. GMKtec's upcoming EVO-X3 will add a native OCuLink port for external GPU expansion, with a higher-memory variant carrying AMD's next-generation Ryzen AI Max+ PRO 495 chip expected later in 2026.
