Troubleshooting

Common issues — downloads, memory, iOS simulator GPU, Android minSdk, web caching, and desktop storage.

Common issues and fixes. For desktop-specific problems (Linux native logs, glibc, Windows DXC, stale GPU shader cache) see Desktop Support → Troubleshooting.

Downloads#

Resume isn't supported by the HuggingFace CDN. flutter_gemma uses smart retry with exponential backoff and automatic restart of interrupted downloads instead. Tune the attempt count via maxDownloadRetries in FlutterGemma.initialize(...) (default: 10).
Large downloads on Android (>500MB) automatically use a foreground service (shows a notification) to bypass Android's 9-minute background execution limit. iOS uses native URLSession and needs no special handling. See Models → downloads.
Custom servers on Web must enable CORS headers. HuggingFace is already configured correctly; for Firebase Storage see the CORS configuration docs.

Memory#

iOS: ensure Runner.entitlements contains the memory entitlements and the Podfile sets platform :ios, '16.0'. See Installation → iOS.
Reduce maxTokens if you hit memory pressure — but keep it at 1024 or higher for .litertlm models (see "maxTokens vs maxOutputTokens" below). To shorten replies, use maxOutputTokens, not a smaller maxTokens.
Use smaller models (1B-2B parameters) for devices with <6GB RAM. Multimodal models (Gemma 4, Gemma3n) need 8GB+.
Close sessions and models when not needed; monitor usage with sizeInTokens().

maxTokens vs maxOutputTokens#

maxTokens (on getActiveModel/createModel) is the context window — the total budget shared by the input (system prompt + history + your message) and the generated output (the KV-cache size). It is not the reply length.

.litertlm models require a context window of at least 1024. Passing a smaller maxTokens (e.g. 100) used to crash with DYNAMIC_UPDATE_SLICE failed to prepare / Failed to allocate tensors (even on CPU — the failure is at graph compile, before the backend runs). As of 1.0.2 a too-small maxTokens is clamped up to 1024 automatically with a log warning.

To limit how many tokens the model generates, use maxOutputTokens on createSession/openSession/createChat/openChat instead:

final model = await FlutterGemma.getActiveModel(maxTokens: 1024); // context window
final chat = await model.createChat(maxOutputTokens: 100);        // reply cap

(maxOutputTokens is honored on .litertlm; the MediaPipe .task path has no session-level output cap and ignores it.)

iOS#

Build issues: ensure minimum iOS version is 16.0, use static linking (use_frameworks! :linkage => :static), and clean/reinstall pods with cd ios && pod install --repo-update.
Simulator GPU disabled: iOS Simulator's Metal has a 256 MB single-allocation cap that LLM weight tensors exceed (e.g. Gemma 3 1B's KV cache alone is 288 MB). Use CPU on the simulator, or test GPU on a physical iPhone. This is a simulator limit, not a plugin bug.

Android#

.litertlm models require minSdk 30. libLiteRtLm.so depends on API 30+ Bionic syscalls (pthread_cond_clockwait, sem_clockwait) that can't be shimmed on older devices. MediaPipe .task models work on lower API levels.
.litertlm / embeddings / vision are arm64-v8a only. MediaPipe text inference (.task / .bin) also runs on x86_64 and armeabi-v7a. If you only use arm64-only features, add ndk { abiFilters 'arm64-v8a' } so the Play Store doesn't offer broken APKs. See Installation → Android architecture.
GPU: add the libOpenCL.so <uses-native-library> tags to AndroidManifest.xml. See Installation → Android.

Web#

GPU only. MediaPipe has no web CPU backend, so web models must run on PreferredBackend.gpu.
Mobile .task models often don't work on web — use the -web.task (MediaPipe) or .litertlm (LiteRT-LM) web variant.
Memory / cache limits:

Browser	Max Model Size	Notes
Chrome/Firefox	~2 GB	ArrayBuffer limit
Safari	~50 MB	⚠️ Not suitable

Large models (>2GB): use WebStorageMode.streaming (OPFS) to bypass the ~2 GB blob limit. Check support with await FlutterGemma.isStreamingSupported(). See Installation → web storage.
Storage modes: cacheApi (default, persists across restarts, <2GB), streaming (OPFS, large models, requires Chrome 86+/Edge 86+/Safari 15.2+), none (ephemeral, testing only).

Web `.litertlm` (early preview) feature matrix#

Web .litertlm inference runs Gemma .litertlm models in the browser via the upstream @litert-lm/core package (WebGPU + WASM). It is an early preview and a subset of the native path. MediaPipe .task on web is unaffected and remains fully supported.

Works on web .litertlm: text generation (sync + streaming), multi-turn chat with history, system instruction, concurrent sessions (serialized), large models via OPFS streaming, GPU only.

Not supported on web .litertlm yet (mobile/desktop only):

❌ Vision / image input — image inputs are dropped with a debug warning.
❌ Audio input — no Audio executor config in the JS API.
❌ Thinking mode — extraContext thinking channel is not wired on web.
❌ Function calling / tool calls — not available on the web runtime.
❌ LoRA weights — loraPath throws UnsupportedError.

For full vision / audio / thinking / function calling on web today, use MediaPipe `.task` web models instead. These web `.litertlm` limits track the upstream `@litert-lm/core` early-preview API and will lift as Google extends the JS executor surface.

Desktop storage locations#

Desktop builds store downloaded models outside the user's Documents/ folder to avoid OneDrive / iCloud / Domain-Roaming sync corrupting FFI mmap of large .litertlm files:

Windows: %LOCALAPPDATA%\flutter_gemma\ (never OneDrive-synced)
macOS: ~/Library/Application Support/<bundle>/flutter_gemma/
Linux: ~/.local/share/<app>/flutter_gemma/

Models installed by older 0.14.x / 0.15.0 builds that still live under Documents/ keep working via a fallback read.

Multimodal#

Ensure you're using a multimodal model (Gemma 4, Gemma3n E2B/E4B, FastVLM).
Set supportImage: true (and supportAudio: true for audio) when creating the model.
Check device memory — multimodal models require more RAM.
Use the GPU backend for better performance. See Multimodal.

Function calling#

Function calling is supported only by select models (Gemma 4, Gemma3n, Gemma 3 1B, FunctionGemma, DeepSeek, Qwen, Phi-4). Unsupported models log a warning and ignore tools — they still work for text generation. Check supportsFunctionCalls. See Function Calling.

Troubleshooting

Downloads#

Memory#

maxTokens vs maxOutputTokens#

iOS#

Android#

Web#

Web .litertlm (early preview) feature matrix#

Desktop storage locations#

Multimodal#

Function calling#

Web `.litertlm` (early preview) feature matrix#