Logoflutter_gemma

Troubleshooting

Common issues — downloads, memory, iOS simulator GPU, Android minSdk, web caching, and desktop storage.

Common issues and fixes. For desktop-specific problems (Linux native logs, glibc, Windows DXC, stale GPU shader cache) see Desktop Support → Troubleshooting.

Downloads#

  • Resume isn't supported by the HuggingFace CDN. flutter_gemma uses smart retry with exponential backoff and automatic restart of interrupted downloads instead. Tune the attempt count via maxDownloadRetries in FlutterGemma.initialize(...) (default: 10).
  • Large downloads on Android (>500MB) automatically use a foreground service (shows a notification) to bypass Android's 9-minute background execution limit. iOS uses native URLSession and needs no special handling. See Models → downloads.
  • Custom servers on Web must enable CORS headers. HuggingFace is already configured correctly; for Firebase Storage see the CORS configuration docs.

Memory#

  • iOS: ensure Runner.entitlements contains the memory entitlements and the Podfile sets platform :ios, '16.0'. See Installation → iOS.
  • Reduce maxTokens if you hit memory pressure — but keep it at 1024 or higher for .litertlm models (see "maxTokens vs maxOutputTokens" below). To shorten replies, use maxOutputTokens, not a smaller maxTokens.
  • Use smaller models (1B-2B parameters) for devices with <6GB RAM. Multimodal models (Gemma 4, Gemma3n) need 8GB+.
  • Close sessions and models when not needed; monitor usage with sizeInTokens().

maxTokens vs maxOutputTokens#

maxTokens (on getActiveModel/createModel) is the context window — the total budget shared by the input (system prompt + history + your message) and the generated output (the KV-cache size). It is not the reply length.

.litertlm models require a context window of at least 1024. Passing a smaller maxTokens (e.g. 100) used to crash with DYNAMIC_UPDATE_SLICE failed to prepare / Failed to allocate tensors (even on CPU — the failure is at graph compile, before the backend runs). As of 1.0.2 a too-small maxTokens is clamped up to 1024 automatically with a log warning.

To limit how many tokens the model generates, use maxOutputTokens on createSession/openSession/createChat/openChat instead:

final model = await FlutterGemma.getActiveModel(maxTokens: 1024); // context window
final chat = await model.createChat(maxOutputTokens: 100);        // reply cap

(maxOutputTokens is honored on .litertlm; the MediaPipe .task path has no session-level output cap and ignores it.)

iOS#

  • Build issues: ensure minimum iOS version is 16.0, use static linking (use_frameworks! :linkage => :static), and clean/reinstall pods with cd ios && pod install --repo-update.
  • Simulator GPU disabled: iOS Simulator's Metal has a 256 MB single-allocation cap that LLM weight tensors exceed (e.g. Gemma 3 1B's KV cache alone is 288 MB). Use CPU on the simulator, or test GPU on a physical iPhone. This is a simulator limit, not a plugin bug.

Android#

  • .litertlm models require minSdk 30. libLiteRtLm.so depends on API 30+ Bionic syscalls (pthread_cond_clockwait, sem_clockwait) that can't be shimmed on older devices. MediaPipe .task models work on lower API levels.
  • .litertlm / embeddings / vision are arm64-v8a only. MediaPipe text inference (.task / .bin) also runs on x86_64 and armeabi-v7a. If you only use arm64-only features, add ndk { abiFilters 'arm64-v8a' } so the Play Store doesn't offer broken APKs. See Installation → Android architecture.
  • GPU: add the libOpenCL.so <uses-native-library> tags to AndroidManifest.xml. See Installation → Android.

Web#

  • GPU only. MediaPipe has no web CPU backend, so web models must run on PreferredBackend.gpu.
  • Mobile .task models often don't work on web — use the -web.task (MediaPipe) or .litertlm (LiteRT-LM) web variant.
  • Memory / cache limits:
BrowserMax Model SizeNotes
Chrome/Firefox~2 GBArrayBuffer limit
Safari~50 MB⚠️ Not suitable
  • Large models (>2GB): use WebStorageMode.streaming (OPFS) to bypass the ~2 GB blob limit. Check support with await FlutterGemma.isStreamingSupported(). See Installation → web storage.
  • Storage modes: cacheApi (default, persists across restarts, <2GB), streaming (OPFS, large models, requires Chrome 86+/Edge 86+/Safari 15.2+), none (ephemeral, testing only).

Web .litertlm (early preview) feature matrix#

Web .litertlm inference runs Gemma .litertlm models in the browser via the upstream @litert-lm/core package (WebGPU + WASM). It is an early preview and a subset of the native path. MediaPipe .task on web is unaffected and remains fully supported.

Works on web .litertlm: text generation (sync + streaming), multi-turn chat with history, system instruction, concurrent sessions (serialized), large models via OPFS streaming, GPU only.

Not supported on web .litertlm yet (mobile/desktop only):

  • Vision / image input — image inputs are dropped with a debug warning.
  • Audio input — no Audio executor config in the JS API.
  • Thinking modeextraContext thinking channel is not wired on web.
  • Function calling / tool calls — not available on the web runtime.
  • LoRA weightsloraPath throws UnsupportedError.
For full vision / audio / thinking / function calling on web today, use MediaPipe `.task` web models instead. These web `.litertlm` limits track the upstream `@litert-lm/core` early-preview API and will lift as Google extends the JS executor surface.

Desktop storage locations#

Desktop builds store downloaded models outside the user's Documents/ folder to avoid OneDrive / iCloud / Domain-Roaming sync corrupting FFI mmap of large .litertlm files:

  • Windows: %LOCALAPPDATA%\flutter_gemma\ (never OneDrive-synced)
  • macOS: ~/Library/Application Support/<bundle>/flutter_gemma/
  • Linux: ~/.local/share/<app>/flutter_gemma/

Models installed by older 0.14.x / 0.15.0 builds that still live under Documents/ keep working via a fallback read.

Multimodal#

  • Ensure you're using a multimodal model (Gemma 4, Gemma3n E2B/E4B, FastVLM).
  • Set supportImage: true (and supportAudio: true for audio) when creating the model.
  • Check device memory — multimodal models require more RAM.
  • Use the GPU backend for better performance. See Multimodal.

Function calling#

  • Function calling is supported only by select models (Gemma 4, Gemma3n, Gemma 3 1B, FunctionGemma, DeepSeek, Qwen, Phi-4). Unsupported models log a warning and ignore tools — they still work for text generation. Check supportsFunctionCalls. See Function Calling.