Detailed setup and reference for running flutter_gemma on macOS, Windows, and
Linux. Desktop platforms run LiteRT-LM directly via dart:ffi
— no
Kotlin/JVM gRPC server, no Java required, no separate process, no IPC overhead.
Engine startup is ~2 s instead of ~10–15 s.
Desktop is served exclusively by the flutter_gemma_litertlm package; see
Installation and Packages.
Architecture#
┌─────────────────────────────────────────────────────┐
│ Flutter Desktop App │
│ │
│ ┌──────────────────────────────────────────────┐ │
│ │ FlutterGemmaDesktop (lib/desktop/) │ │
│ │ ↓ │ │
│ │ LiteRtLmFfiClient (lib/core/ffi/) │ │
│ │ ↓ dart:ffi │ │
│ │ ─────────────────────────────────── │ │
│ │ libLiteRtLm.{dylib,dll,so} │ │
│ │ + libLiteRt.{dylib,dll,so} │ │
│ │ + libLiteRtMetalAccelerator.dylib (macOS) │ │
│ │ + libLiteRtWebGpuAccelerator.{dll,so} │ │
│ │ + dxil.dll + dxcompiler.dll (Windows GPU) │ │
│ └──────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────────┘
Native libraries are fetched at build time by the package's hook/build.dart
from the GitHub release, SHA256-verified, and bundled by Flutter
Native Assets
into the application bundle. The Dart FFI layer is shared with mobile — Android
and iOS use the same LiteRtLmFfiClient against the same C API. Only the dynamic
library loading sequence differs per platform.
Supported platforms#
| Platform | Architecture | GPU backend | Vision | Audio | Notes |
|---|---|---|---|---|---|
| macOS | arm64 (Apple Silicon) | Metal | ✅ | ✅ | Vision verified on Gemma 4 + Gemma 3n via Metal |
| macOS | x86_64 | — | — | — | Not supported (Apple Silicon only) |
| Windows | x86_64 | DirectX 12 (via Dawn/WebGPU) | ✅ | ✅ | Requires VS 2019+ runtime (vcredist) for DXC |
| Windows | arm64 | — | — | — | Not supported |
| Linux | x86_64 | Vulkan (via Dawn/WebGPU) | ✅ | ✅ | glibc ≥ 2.34 (Ubuntu 22.04+, Debian 12+, RHEL 9+) |
| Linux | arm64 | Vulkan (via Dawn/WebGPU) | ✅ | ✅ | Same glibc requirement |
Requirements#
- Flutter ≥ 3.44.0
- macOS: Apple Silicon (arm64)
- Windows: 10/11 64-bit, Microsoft Visual C++ Redistributable 2019+
- Linux: glibc ≥ 2.34, libstdc++ ≥ 6.0.30 (Ubuntu 22.04+, Debian 12+, Fedora 36+, RHEL 9+)
- GPU drivers: any vendor driver with WebGPU/Vulkan/Metal/DX12 support; falls back to CPU if not available
No Java/JVM/JRE required.
Quick Start#
import 'package:flutter_gemma/flutter_gemma.dart';
Future<void> chat() async {
// Install model (downloads on first run, cached after).
await FlutterGemma.installModel(
modelType: ModelType.gemma4,
fileType: ModelFileType.litertlm,
).fromNetwork(
'https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/resolve/main/gemma-4-E2B-it.litertlm',
token: 'hf_...',
).install();
// Create model with full capabilities — keep it for the app's lifetime.
final model = await FlutterGemma.getActiveModel(
maxTokens: 4096,
preferredBackend: PreferredBackend.gpu,
supportImage: true,
supportAudio: true,
);
// Each chat / conversation is a session. Sessions are cheap to create
// and destroy; the engine is reused across them.
final session = await model.createSession(temperature: 0.8, topK: 1);
await session.addQueryChunk(Message(text: 'Hi!', isUser: true));
await for (final chunk in session.getResponseAsync()) {
print(chunk);
}
await session.close();
}
For the high-level chat API with history + thinking + tool calling, use
model.createChat(...) and chat.generateChatResponseAsync().
Platform-specific setup#
macOS#
Native libs are fetched and bundled automatically via Native Assets. The only
manual step is adding a post_install block to your app's macos/Podfile
so
the bundled companion .frameworks get matching lib*.dylib
symlinks
(LiteRT-LM's gpu_registry calls dlopen("libLiteRtMetalAccelerator.dylib")
by
basename and won't find a bare framework binary on its own). Without it,
engine_create returns null on PreferredBackend.gpu and the model silently
falls back to CPU.
Paste this into your macos/Podfile (replacing any existing post_install
block) and run pod install:
post_install do |installer|
installer.pods_project.targets.each do |target|
flutter_additional_macos_build_settings(target)
end
# flutter_gemma: bundle Apple accelerator dylibs as .framework bundles into
# Contents/Frameworks/ and re-point LiteRtLm.dylib's LC_LOAD_DYLIB reference.
installer.aggregate_targets.each do |aggregate_target|
aggregate_target.user_targets.each do |user_target|
phase_name = '[flutter_gemma] Setup LiteRT-LM macOS'
# Only the app target embeds the Frameworks/ this phase patches.
unless user_target.name == 'Runner'
user_target.build_phases
.select { |p| p.respond_to?(:name) && p.name == phase_name }
.each { |p| user_target.build_phases.delete(p) }
next
end
existing = user_target.shell_script_build_phases.find { |p| p.name == phase_name }
phase = existing || user_target.new_shell_script_build_phase(phase_name)
phase.output_paths = ['$(DERIVED_FILE_DIR)/flutter_gemma_litertlm_macos.stamp']
phase.shell_script = <<~SHELL
set -e
FRAMEWORKS="${BUILT_PRODUCTS_DIR}/${PRODUCT_NAME}.app/Contents/Frameworks"
if [ ! -d "${FRAMEWORKS}" ]; then
exit 0
fi
for base in LiteRtMetalAccelerator LiteRtTopKMetalSampler GemmaModelConstraintProvider; do
rm -f "${FRAMEWORKS}/lib${base}.dylib"
done
# Resolve dylib source — Native Assets cache (pub.dev), then path-dep fallbacks.
for candidate in \
"${HOME}/Library/Caches/flutter_gemma/native/macos_arm64" \
"${PODS_ROOT}/../Flutter/ephemeral/.symlinks/plugins/flutter_gemma/native/litert_lm/prebuilt/macos_arm64" \
"${SRCROOT}/../../native/litert_lm/prebuilt/macos_arm64"; do
if [ -f "${candidate}/libGemmaModelConstraintProvider.dylib" ]; then
PLUGIN_PREBUILT="${candidate}"
break
fi
done
if [ -z "${PLUGIN_PREBUILT:-}" ]; then
echo "[flutter_gemma] ERROR: macOS companion dylibs not found. Run 'flutter clean && flutter pub get'."
exit 1
fi
for base in GemmaModelConstraintProvider LiteRtMetalAccelerator LiteRtTopKMetalSampler; do
src="${PLUGIN_PREBUILT}/lib${base}.dylib"
if [ ! -f "${src}" ]; then
echo "[flutter_gemma] WARNING: ${src} not found — runtime dlopen will fail"
continue
fi
fw_dir="${FRAMEWORKS}/${base}.framework"
mkdir -p "${fw_dir}/Versions/A/Resources"
cp "${src}" "${fw_dir}/Versions/A/${base}"
install_name_tool -id "@rpath/${base}.framework/Versions/A/${base}" \\
"${fw_dir}/Versions/A/${base}" 2>/dev/null || true
(cd "${fw_dir}" && ln -sfh A Versions/Current && ln -sfh "Versions/Current/${base}" "${base}" && ln -sfh "Versions/Current/Resources" Resources)
cat > "${fw_dir}/Versions/A/Resources/Info.plist" <<EOF
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>CFBundleExecutable</key><string>${base}</string>
<key>CFBundleIdentifier</key><string>dev.flutterberlin.flutter_gemma.${base}</string>
<key>CFBundleVersion</key><string>1</string>
<key>CFBundleShortVersionString</key><string>1.0</string>
<key>CFBundlePackageType</key><string>FMWK</string>
</dict>
</plist>
EOF
done
LITERTLM="${FRAMEWORKS}/LiteRtLm.framework/Versions/A/LiteRtLm"
if [ -f "${LITERTLM}" ]; then
install_name_tool -change \\
@rpath/libGemmaModelConstraintProvider.dylib \\
@rpath/GemmaModelConstraintProvider.framework/Versions/A/GemmaModelConstraintProvider \\
"${LITERTLM}" 2>/dev/null || true
codesign --force --sign - "${LITERTLM}" 2>/dev/null || true
fi
mkdir -p "$(dirname "${SCRIPT_OUTPUT_FILE_0}")"
touch "${SCRIPT_OUTPUT_FILE_0}"
SHELL
end
end
end
Entitlements required for the LLM to load weights and run inference. Add to
macos/Runner/DebugProfile.entitlements and Release.entitlements:
<key>com.apple.security.cs.disable-library-validation</key>
<true/>
<key>com.apple.security.network.client</key>
<true/>
<key>com.apple.security.app-sandbox</key>
<true/>
For large models (≥1 GB) you may also want
com.apple.developer.kernel.extended-virtual-addressing and
com.apple.developer.kernel.increased-memory-limit.
Windows#
flutter_gemma_litertlm bundles every required DLL — no manual setup. The bundle
includes:
LiteRtLm.dll,LiteRt.dll,libGemmaModelConstraintProvider.dlllibLiteRtWebGpuAccelerator.dll,libLiteRtTopKWebGpuSampler.dll-
dxil.dll+dxcompiler.dll(DirectX Shader Compiler runtime — required for WebGPU/DX12 shader compilation; from microsoft/DirectXShaderCompiler v1.9.2602)
StreamProxy.dll exposes a LoadLibraryExA(LOAD_WITH_ALTERED_SEARCH_PATH) helper
that the plugin uses to pre-load libLiteRt.dll, libLiteRtWebGpuAccelerator.dll,
and libLiteRtTopKWebGpuSampler.dll before opening LiteRtLm.dll. Without this,
modern Windows DLL search order doesn't always include the application directory
for secondary LoadLibrary calls — they would fail to find the GPU accelerator
DLL and silently fall back to CPU.
End-users need the Microsoft Visual C++ Redistributable 2019+ (LLM DLLs depend
on vcruntime140.dll/msvcp140.dll). Most modern Windows 10/11 systems already
have it.
Linux#
The bundle includes:
-
libLiteRtLm.so,libLiteRt.so,libGemmaModelConstraintProvider.so -
libLiteRtWebGpuAccelerator.so,libLiteRtTopKWebGpuSampler.so,libStreamProxy.so
libStreamProxy.so exposes stream_proxy_load_global (an RTLD_GLOBAL
dlopen). The plugin uses it to pre-load libLiteRt.so before
libLiteRtLm.so
so the WebGPU accelerator's runtime dlsym(RTLD_DEFAULT, "LiteRt*") resolves —
without RTLD_GLOBAL, Dart's default RTLD_LOCAL would hide the symbols.
Build dependencies:
sudo apt install clang cmake ninja-build libgtk-3-dev lld
Linux GPU uses Dawn/WebGPU on top of Vulkan, so you need a working vendor Vulkan driver. On NVIDIA install the proprietary driver; on Intel/AMD the open-source Mesa driver works on most distros.
sudo apt install vulkan-tools libvulkan1
# Plus your vendor driver, e.g. NVIDIA:
sudo apt install nvidia-driver-535-server
Model lifecycle#
One model, many sessions#
The recommended (and only well-supported) pattern:
// At app startup, ONCE:
final model = await FlutterGemma.getActiveModel(
maxTokens: 4096,
preferredBackend: PreferredBackend.gpu,
supportImage: true,
supportAudio: true,
);
// During app runtime, MANY TIMES:
final session = await model.createSession(...);
// ... chat, generate, etc.
await session.close(); // cheap
// At app shutdown:
await model.close();
Sessions are cheap to create/destroy. The expensive part is engine_create
(2–10 s depending on backend and model size), which happens once when the model is
first opened.
Why not "one model per chat"?#
Upstream LiteRT-LM keeps LiteRtEnvironment as a process singleton for GPU
paths. Once the env is initialized with the first model's settings (cache_dir,
backend, capabilities), those become process-fixed. Recreating the engine with
different settings causes GPU-stack conflicts (notably wgpu::Instance already set
from the WebGpu sampler binary on Linux/Windows).
The plugin avoids this by reusing the same InferenceModel when params match, and
by disabling GPU sampler preload on Linux (CPU-sampler fallback) so runtime model
swap works. To swap models at runtime, call model.close() first, then
getActiveModel(...) again. Switching backend (CPU ↔ GPU) works the same way.
Known limitations#
Per-token sampler runs on CPU on all desktop platforms#
When preferredBackend: PreferredBackend.gpu, the forward pass (prefill +
decode) runs on the GPU accelerator (Metal, DX12, Vulkan). The per-token
sampler (top-k / top-p / argmax) runs on CPU — roughly 1–5 ms per token vs. the
full LLM generation, which is dominated by the forward pass.
-
macOS, Windows — upstream
libLiteRtTopKMetalSampler/libLiteRtTopKWebGpuSamplership with incomplete C ABI exports (3 of 7 functions); the factory falls back to the CPU chain. (#1990, #2073) -
Linux — the prebuilt sampler
.soholds a process-staticwgpu::Instancethat any secondengine_createrejects. Since runtime model swap matters more than the few ms saved, the plugin doesn't preload it and lets the factory fall back to CPU.
randomSeed / temperature / topK / topP on GPU#
Sampler params are honored on CPU and GPU on all platforms that ship the patched
libLiteRtLm build (macOS, iOS, Linux, Windows, Android). This required a
two-layer downstream patch to upstream LiteRT-LM applied at build time (offered
upstream as #2080
/
PR #2081).
Audio modality requires LiteRT-LM models#
Audio input only works with .litertlm models that include the audio adapter
(Gemma 3n E2B/E4B, Gemma 4 E2B/E4B). See Multimodal.
iOS Simulator: GPU disabled#
iOS Simulator's Metal has a 256 MB single-allocation cap that LLM weight tensors exceed. Use CPU on the simulator, or test on a physical iPhone for GPU validation.
Troubleshooting#
Engine create fails with no native log on Linux#
In debug builds the plugin redirects native stderr to
<tmpdir>/litertlm_native.log and dumps it via debugPrint
after a failed
engine_create. In release builds stderr goes to the systemd journal / app's own
stderr.
glibc 2.38 not found on Linux#
You're hitting a stale local binary. Clear it and let hook/build.dart re-fetch
the correct glibc-2.34 binary:
rm -rf native/litert_lm/prebuilt/linux_x86_64/
flutter clean && flutter run
Windows GPU shaders fail to compile#
Symptom: engine_create returns null with no Dart-side error, app silently falls
back to CPU. Verify dxcompiler.dll and dxil.dll are next to your
app.exe
(Native Assets bundles them). If present but still failing, check the user has the
VS 2019+ Visual C++ Runtime.
Model file not found#
On desktop the model is downloaded to the platform's standard "app support"
directory (see Troubleshooting → desktop storage). Use
FlutterGemma.installModel(...).fromNetwork(...).install() to download, or
.fromFile(absolutePath) if you already have it locally.
Pre-cached engine + new code = stale cache#
LiteRT-LM caches compiled GPU shaders next to the model file
(<model>.litertlm_<random>_mldrift_program_cache.bin). After upgrading the
plugin or the model, delete that file and the engine rebuilds the cache on first
run.