Game Crypto Lab Docs
  • Welcome
  • Getting Started
    • Our Method
    • Gallery
  • Basics
    • OpenAPI
    • Token
  • Github
  • Twitter
Powered by GitBook
On this page
  1. Getting Started

Our Method

PreviousWelcomeNextGallery

Last updated 4 months ago

We present GameCrypto, a generalizable world model that learns from a small-scale dataset of Minecraft game videos. By leveraging the prior knowledge of a pretrained video diffusion model, it can create new games in an open domain.

Our work consists of several key components and innovations:

Overview: As shown in Figure 1, GameCrypto builds upon pre-trained video generation models, extending them with a pluggable action control module. This design effectively leverages both large-scale unlabeled data and small-scale high-quality Minecraft action data.

Action Control Module: Illustrated in Figure 2, our module integrates with Diffusion Transformer blocks through distinct control mechanisms for mouse and keyboard inputs. To address granularity mismatch between action signals and frame latents, we implement group operations. A sliding window mechanism is adopted to handle delayed action effects (e.g., jump).

Multi-Phase Training Strategy: Figure 3 outlines our four-phase training approach for scene generalization. Starting with open-domain pretraining, followed by game-specific style learning, action control training, and finally enabling open-domain action-controlled generation, this strategy ensures both action control capability while preserving open-domain scene generation ability.

Autoregressive Generation: As demonstrated in Figure 4, our autoregressive generation mechanism creates continuous gameplay by using previous frames as conditions for generating new ones.

Figure 1: GameFactory overview. The blue section shows pre-trained model's generation ability, while the green section shows the pluggable action control module.
Figure 2: Action Control Module architecture. It integrates into transformer blocks with separate mechanisms for mouse and keyboard inputs. Action sequences are grouped to handle temporal compression and delayed effects.
igure 3: Multi-phase training pipeline. Phase 0: pretrain on open-domain data; Phase 1: finetune for game style; Phase 2: train action control;Phase 3: generate action-controlled open-domain content.
Figure 4: Autoregressive video generation process. Training uses (k+1) initial frames as conditions to predict remaining frames, while inference iteratively uses latest frames to generate new ones.