DeepMind program finds diamonds in Minecraft without being taught

248 points by Bender 7 months ago

An important caveat from the paper

>Moreover, we follow previous work in accelerating block breaking because learning to hold a button for hundreds of consecutive steps would be infeasible for stochastic policies, allowing us to focus on the essential challenges inherent in Minecraft.

toxik 6 months ago

Like all things RL, it is 99.9% about engineering the environment and rewards. As one of the authors stated elsewhere here, there is a reward for completing each of 12 steps necessary to find diamonds.
Mostly I'm tired of RL work being oversold by its authors and proponents by anthropomorphizing its behaviors. All while this "agent" cannot reliably learn to hold down a button, literally the most basic interaction of the game.
- red75prime 6 months ago
  
  The "no free lunch" theorem. You can't start from scratch and expect your program to repeat 4 billion years of evolution collecting inductive biases useful in our corner of our Universe in a matter of hours[1].
  While it's possible to bake in this particular inductive bias (repetitive actions might be useful), they decided not to (it's just not that interesting).
  [1] And you certainly can't reproduce the observation selection effect in a laboratory. That is the thing that makes it possible to overcome the "no free lunch" theorem: our existence and intelligence are conditional on evolution being possible and finding the right biases.
  We have to bake in inductive biases to get results. We have to incentivize behaviors useful (or interesting) to us to get useful results instead of generic exploration.
  - toxik 6 months ago
    
    You don't have to repeat 4 billion years of evolution, an RL agent lives inside a strange universe where the basic axioms happen to be exactly aligned with what you can do in that universe.
    Its actions are not muscular, they are literal gameplay actions. It is orders of magnitude easier to learn that the same action should be performed until completion, than that the finger should be pressed against a surface while the hand is stabilized with respect to the cursor on a screen.
    One of the most interesting (and pathological) things about humans is that we learn what is rewarding. Not how to get a reward, but actually we train ourselves to be rewarded by doing difficult/novel/funny/etc things. Notably this is communicated largely by being social, i.e., we feel reward for doing something difficult because other people are impressed by that.
    In Castaway, Hanks' only companion is a mute, deflated ball, but nonetheless he must keep that relationship alive---to keep himself alive. The climax of the movie is when Hanks returns home and people are so impressed, his efforts are validated.
    Contrast that to RL, there is no intrinsic motivation. The agents do not play, or meaningfully explore, really. The extent of its exploration is a nervous tic that makes it press the wrong button with probability ε. The reason it cannot hold down buttons is because it explores by having Parkinson's disease, by accident, not because it thought it might find out something useful/novel/funny/etc. In fact, it can't even have a definition of those words, because they are defined in the space between beings.
    
    orbifold 6 months ago
    
    Personally I am almost certain that the current framing of RL and its relationship to animal behavior is deeply misguided. It proves close to impossible to train animals using this paradigm (not for a lack of trying), i.e. animals such as mice only make any progress when water deprived and under conditions that exploit their natural instincts. Nevertheless they are capable of far more complex natural behaviors. There is a non-zero chance that RL as an explanation of animal behavior is just plain wrong or not applicable.
    
    nomel 6 months ago
    
    I naively believe that the lack of performance is one of connectivity. Animal brains don't use directed graphs, probably for the very reason that latching states, like holding a button, become unreasonable. Our brains probably use small network graphs [1][2].
    [1] definition: https://en.wikipedia.org/wiki/Small-world_network
    [2] evidence for our brains: https://www.semanticscholar.org/paper/Small-world-directed-n...
    
    red75prime 6 months ago
    
    > The agents do not play, or meaningfully explore, really
    As others already pointed it's not an intrinsic limitation of RL agents.
    > In fact, it can't even have a definition of those words, because they are defined in the space between beings.
    In fact an agent doesn't need to know definitions to act. A bacterium don't know what it means to reproduce, but reproduces anyway.
    
    wegfawefgawefg 6 months ago
    
    Go read the Intrisic Curiosity Module papers, 1 and 2.
    
    exe34 6 months ago
    
    Are you referring to this one?
    https://arxiv.org/pdf/1905.10071
    
    wegfawefgawefg 6 months ago
    
    no not that one. the first icm paper:
    https://pathak22.github.io/noreward-rl/
    and the followup which address the noise impredictability problem.
    there are more after that which i believe fail the black pill and miss the point of ml, asicifying the architecture with human priors. But the broader point is to show that rl is not just discovering solutions by chance in random actions. Nature starts with priors, and curiosity is one of the universal policy bootstrapping techniques. (others might be imitation, next state prediction, total nearby replication count)
    There is also a paper that deployed ICM on a physical robot and it just played with a ball because it was the only source of novel stimuli, and inadvertantly learned how to operate its arms. There was no other reward in the environment except for curiosity. It is amazing, and slightly creepy. I think the ICM will be rediscovered later in ML tech.
  - rebeccaskinner 6 months ago
    
    > While it's possible to bake in this particular inductive bias (repetitive actions might be useful), they decided not to (it's just not that interesting).
    What's interesting to me about this is that the problem seems really aligned with the research they are doing. From what I can tell, they build a system where the agent has a simplified "mental" model of the game world and it uses to predict actions that will lead to better rewards.
    I don't think what's missing here is teaching the model that it should just try to do things a lot until they succeed. Instead, what I think is missing is the context that it's playing a game, and what that means.
    For example, any human player who sits down to play minecraft is likely to hold down the button to mine something. Younger children might also hold the jump button down and jump around aimlessly, but older children and adults probably wouldn't. Why? I suspect it's because people with experience in video games have set expectations for how game designers communicate the gameplay experience. We understand that clicking on things to interact with them is a common mode of interaction, and we expect that games have upgrade mechanics that will let us work faster or interact with high level items. It's not that we repeat any action arbitrarily to see that it pays off, but rather that we're speaking a language of games and modeling the mind of the game designers and anticipating what they expect from us.
    I would think that trying to expand the model of the world to include this notion of the language of games might be a better approach to overcoming the limitation instead of just hard-coding the model to try things over and over again to see if there's a payoff.
  - d0mine 6 months ago
    
    Isn’t it exactly what alphazero did?
    “AlphaZero was trained solely via self-play using 5,000 first-generation TPUs to generate the games and 64 second-generation TPUs to train the neural networks, all in parallel, with no access to opening books or endgame tables. After four hours of training, DeepMind estimated AlphaZero was playing chess at a higher Elo rating than Stockfish 8; after nine hours of training, the algorithm defeated Stockfish 8 in a time-controlled 100-game tournament (28 wins, 0 losses, and 72 draws).” [emphasis added] https://en.wikipedia.org/wiki/AlphaZero
    
    red75prime 6 months ago
    
    I thought that it might be a rare chance to invoke the NFL theorem appropriately, but I guess I was wrong. The NFL talks about a uniform distribution of problems. A case that is probably never the case. At least for habitable universes.
    Nevertheless, the theorem basically states that there are games where AlphaZero will be beaten by another algorithm. Even if those games are nonsensical from our point of view.
    
    voidmain 6 months ago
    
    Games drawn from this uniform distribution can't even be implemented in our physical universe (you would need exponentially large lookup tables to store the rules). There is no chance of ever encountering any of them.
    Of course, there are "games" like "invert sha-512" that can be implemented in our world but are probably impractical to learn. But NFL has nothing to say about them; a game that simple has zero measure in a uniform distribution over problems.
    
    tmtvl 6 months ago
    
    I forget, was it Alpha or one of the others (Leela, Kata, FineArt,...) which had a weakness against... I wanna say the Micro Chinese (?), where it would consistently play the same suboptimal sequence that let players beat it easily if they took that path.
    
    Xcelerate 6 months ago
    
    > I thought that it might be a rare chance to invoke the NFL theorem appropriately, but I guess I was wrong
    Haha, I wouldn’t feel bad. It’s one of the most misunderstood theorems, and I don’t think I’ve ever seen it invoked correctly on a message board.
  - 827a 6 months ago
    
    Given that a computer should be able to simulate at least some applicable aspects and processes of reality billions of times faster than the speed at which our own universe runs at: Yes, I think it is entirely reasonable to have these agents follow at least some kind of from-scratch evolutionary history. It might also be valuable: As it could further research in understanding what the word "applicable" there even means; what parts of our evolutionary history are important toward inductively reasoning your way toward a diamond in Minecraft? What parts aren't? How can that generalize?
    If you code a reward function for each step necessary to get a diamond, you are teaching the AI how to do it. There is no other way to look at it. Its extremely unethical to claim, as Nature does, that it did this without "being taught", and it is in my eyes academic malpractice to claim, as their paper does, that it did this "without human data or curricula"; though mitigated by the reality that they admit this in the paper. If this is the case; I am still digesting the paper, as it is quite technical.
    This isn't an LLM, I'm aware of this, but I am at the point where if I could bet on the following statement being true, I'd go in at five figures: Every major AI benchmark, advancement, or similar accomplishment in the past two years can almost entirely be explained by polluted training data. These systems are not nearly as autonomously intelligent as anyone making money on them says they are.
  - kypro 6 months ago
    
    > You can't start from scratch and expect your program to repeat 4 billion years of evolution collecting inductive biases useful in our corner of our Universe in a matter of hours
    Really? Minecraft's gameplay dynamic are not particularly complex... The AI here isn't learning highly complex rules about the nuances of human interaction or learning to detect the relatively subtle differences between various four legged creatures based on small differences in body morphology. In these cases I could see how millions of years of evolution is important to at least give us and other animals a head start when entering the world. If the AI had to do something like this to progress in Minecraft then I'd get why learning those complexities would be skipped over.
    But in this case a human would quickly understand that holding a button creates a state which tapping a button does not, and therefore would assume this state could be useful to explore further states. Identifying this doesn't seem particularly complex to me. If the argument is that it will take slightly longer for an AI to learn patterns in dependant states then okay, sure, but I think arguing that learning that holding a button creates a new state is such a complex problem that we couldn't possibly expect an AI to learn it from scratch within a short timeframe is a very weak argument. It's just not that complex. To me this suggests that current algorithms are lacking.
    
    blueflow 6 months ago
    
    It seems easy to you because you can't remember the years when you were a toddler and had to learn basic interactions with the world around you. It seems natural to an adult but it is quite complex.
    
    geysersam 6 months ago
    
    But this argument applies just as well to tons of other tasks AIs can handle just fine. So it doesn't explain why this particular action is so much harder compared to anything else.
    
    SkyBelow 6 months ago
    
    In particular, the task requires understanding that one can impact the world through action. This is learned by humans through a constant feedback loop running for months to a year+. The very way we train AIs doesn't seem to teach this agency, only teach the ability to mimic having that agency in ways that we can capture data for (such as online discussions). Will that training eventually give rise to such agency? I'm doubtful with most current models given that the learning process is so disconnected from the execution and that execution is prompted and not inherently on going. Maybe some agent swarm that is always running and always training and upgrading its members could achieve that level of agency, which is why I'm not saying it is impossible, but I expect we are going to have to wait for some newer model that is always running and which is training as it is running to see true agency develop.
    Until then, it is a question of if we can capture the appearance of agency in the training set well enough for learn it with training and not depend upon interactions to learn more.
    
    blueflow 6 months ago
    
    Which tasks?
    
    geysersam 6 months ago
    
    > basic interactions with the world around you, tasks that seem easy to us but are actually quite complex
    Tasks such as:
    - recognizing objects in our surroundings, - speaking, - reasoning about other people's thoughts and feelings, - playing go?
    All of those were at some point "easy for us but very hard for computer programs".
    
    kypro 6 months ago
    
    I don't think I am, and for context here I have built my own DQNs from scratch to learn to play games like Snake.
    I'd argue if you consider the size of the input and output space here it's not as complex you're implying.
    To refer back to my example, to tell the difference between four legged creatures is complicated because there's a huge number of possible outputs and the visual input space is both large and complex. Learning how to detect patterns in raw image data is complicated and is why we and other animals are preloaded with the neurological structures to do this. It's also why we often use pretrained models when training models to label new outputs – simply learning how detect simple patterns in visual data is difficult enough so if this step can be skipped it often makes sense to skip it.
    In constrast the inputs to Minecraft are relatively very simple – you have a handful of buttons which can be pressed and those buttons can be pressed for different durations. Similarly the output space here while large is relatively simple and presumably simply detecting that an action like holding a button results in a state change shouldn't be that complex to learn... I mean it's already learning that pressing a button results in a state change so I think you'd need to explain to me why adding a tiny bit of additional complexity here is so unreasonable. Maybe I'm missing something.
    
    red75prime 6 months ago
    
    > I think you'd need to explain to me why adding a tiny bit of additional complexity here is so unreasonable
    As far as I understand DreamerV3 doesn't employ intrinsic rewards (like in novelty-based exploration). It adopts stochastic exploration which makes it practically impossible to get to rewards that require to consistently repeat an action with no intermediate rewards.
    And finding intrinsic rewards that work good across diverse domains is a complex problem in itself.
    
    blueflow 6 months ago
    
    Example: When humans play Minecraft, they already know object permanence from the real world. I did not see anywhere that AI got trained to learn object permanence. Yet it is required for basics like searching for your mineshaft after turning around.
    
    nkrisc 6 months ago
    
    And yet have you seen what toddlers are capable of learning on their own? It is natural to them.
    
    blueflow 6 months ago
    
    > It is natural to them.
    This is where the "inductive biases" from comment 43609692 are hidden in, and this is what AI currently lacks.
    
    red75prime 6 months ago
    
    > Minecraft's gameplay dynamic are not particularly complex...
    I think you underestimate complexity of going from 12288+400 changing numbers to a concept of gameplay dynamics in the first place. Or in other words your complexity prior is biased by experience.
- LPisGood 6 months ago
  
  When I was a child and first played Minecraft I clicked instead of held and after 10 minutes I gave up, deciding that Minecraft was too hard.
  - zvitiate 6 months ago
    
    What if you were in an environment where you had to play Minecraft for say, an hour. Do you think your child brain would've eventually tried enough things (or had your finger slip and stay on the mouse a little extra while), noticed that hitting a block caused an animation, (maybe even connect it with the fact that your cursor highlights individual blocks with a black box,) decide to explore that further, and eventually mine a block? Your example doesn't speak to this situation at all.
    
    danijar 6 months ago
    
    I think learning to hold a button down in itself isn't too hard for a human or robot that's been interacting with the physical world for a while and has learned all kinds of skills in that environment.
    But for an algorithm learning from scratch in Minecraft, it's more like having to guess the cheat code for a helicopter in GTA, it's not something you'd stumble upon unless you have prior knowledge/experience.
    Obviously, pretraining world models for common-sense knowledge is another important research frontier, but that's for another paper.
  - daedrdev 6 months ago
    
    I had the same problem, learned from a roblox mining game where mining a block required clicking it a bunch of times.
- freeone3000 6 months ago
  
  RL is useful for action selection and planning. Actually determining the mechanics of the game can be achieved with explicit instruction and definition of an action set.
  I suppose whether you find this result intriguing or not depends on if you’re looking to build result-building planning agents over an indeterminate (and sizable!) time horizon, in which case this is a SOTA improvement and moderately cool, or if you’re looking for a god in the machine, which this is not.
- SpaceManNabs 6 months ago
  
  If you have an alternative for RL in these use cases, please feel free to share.
  When RL works, it really works.
  The only alternative I have seen is deep networks with MCTS, and they are quickly to ramp up to decent quality. But they hit caps relatively quickly.
o11c 6 months ago

And a relevant piece of ancient wisdom (exact date not known, but presumably before 1970):
> In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6.
> “What are you doing?”, asked Minsky.
> “I am training a randomly wired neural net to play Tic-Tac-Toe” Sussman replied.
> “Why is the net wired randomly?”, asked Minsky.
> “I do not want it to have any preconceptions of how to play”, Sussman said.
> Minsky then shut his eyes.
> “Why do you close your eyes?”, Sussman asked his teacher.
> “So that the room will be empty.”
> At that moment, Sussman was enlightened.
lgeorget 6 months ago

Well, to be fair... I (a human) had to look it up online the first time I played as well. I was repeatedly clicking on the same tree for an entire minute before that. I even tried several different trees just in case.
- fusionadvocate 6 months ago
  
  But it is possible to discover by holding down the button and realizing the block is getting progressively more "scratched".
kharak 6 months ago

In my mind, this generalizes to the same problem with other non-stochastic (deterministic) operations like logical conclusions (A => B) .
I have a running bet with friend that humans encode deterministic operations in neural networks, too, while he thinks there has to be another process at play. But there might be something extra helping our neural networks learn the strong weights required for it. Or the answer is again: "more data".
FrustratedMonky 6 months ago

"accelerating block breaking because learning to hold a button for hundreds of consecutive steps "
This is fine, and does not impact the importance of figuring out the steps.
Anybody that has done any tuning on systems that run at different speeds, the adjusting for the speed difference is just engineering, and allows you to get on with more important/inventive work.
JohnKemeny 6 months ago

I'm not sure it's a serious caveat if the "hint" or "control" is in the manual.
- suddenlybananas 6 months ago
  
  Sorry, I don't quite follow what you mean?
  - franktankbank 6 months ago
    
    I didn't read the manual and when I was trying to help my kid play the game I couldn't figure out how to break blocks.
Hamuko 6 months ago

Turns out that AI are much better at playing video games if they're allowed to cheat.
thesz 6 months ago

"It allows AI to understand its physical environment and also to self-improve over time, without a human having to tell it exactly what to do."
- ks1723 6 months ago
  
  I my view, the 'exactly' is crucial here. They do implicitly tell the model what to do by encoding it in the reward function:
  In Minecraft, the team used a protocol that gave Dreamer a ‘plus one’ reward every time it completed one of 12 progressive steps involved in diamond collection — including creating planks and a furnace, mining iron and forging an iron pickaxe.
  This is also why I think the title of the article is slightly misleading.
  - wongarsu 6 months ago
    
    It's kind of fair, humans also get rewarded for those steps when they learn Minecraft
    
    xwolfi 6 months ago
    
    But they don't learn that way at all, my 7yo learns by watching youtubers. There's a whole network of people teaching each other the game, that's almost more fun than playing it alone.

Animats 6 months ago

Key to Dreamer’s success, says Hafner, is that it builds a model of its surroundings and uses this ‘world model’ to ‘imagine’ future scenarios and guide decision-making.

Can you look at the world model, like you can look at Waymo's world model? Or is it hidden inside weights?

Machine learning with world models is very interesting, and the people doing it don't seem to say much about what the models look like. The Google manipulation work talks endlessly about the natural language user interface, but when they get to motion planning, they don't say much.

danijar 6 months ago

Yes, you can decode the imagined scenarios into videos and look at them. It's quite helpful during development to see what the model gets right or wrong. See Fig. 3 in the paper: https://www.nature.com/articles/s41586-025-08744-2
- Animats 6 months ago
  
  So, prediction of future images from a series of images. That makes a lot of sense.
  Here's the "full sized" image set.[1] The world model is low-rez images. That makes sense. Ask for too much detail and detail will be invented, which is not helpful.
  [1] https://media.springernature.com/full/springer-static/image/...
lnsru 6 months ago

I implemented an acoustic segmentation system in FPGA recently. The whole world model was a long list of known events and states with feasible transitions. Plus novel things not observed before. Basically rather dumb state machine with machine learning part attached to acoustic sensors. Of course, both parts could be hidden behind weights. But state machine was easily readable and that was the biggest advantage of it.
- mnky9800n 6 months ago
  
  Why would an accounting system need acoustic sensors?
  - lnsru 6 months ago
    
    Sorry. Terrible typo. Acoustic system was cheap though.
    
    mnky9800n 6 months ago
    
    Oh haha. I work on an acoustic detection project so I was quite excited about new applications.
    How exactly does your machine learning model work?
    
    lnsru 6 months ago
    
    I would say, there was not much new in this. The key part of the project was the real time approach. Acquire samples, process them, find peaks, do FFTs, sum, multiply, divide. Get a float number, turn on the proper LEDs. The data was moved in C code between DMA blocks written in VHDL. Actually far away from optimized version. But it worked. IP does not belong to me and I would like to avoid technical details. The project was ended immediately when the company we worked for offered 25000€ for all IP created during the project. Very bad joke. I am still confused, because there was massive potential in this cooperation for everybody involved.
jtsaw 6 months ago

I’d say it’s more like Waymo’s world model. The main actor uses a latent vector representation of the state of the game to make decisions. This latent vector at train time is meant to compress a bunch of useful information about the game. So while you can’t really understand the actual latent vector that represents state, you do know it encodes at least the state of the game.
This world model stuff is only possible in environments that are sandboxed. Ie you can represent the state of the world in an and have a way of producing the next state given a current state and action. Things like Atari games, robot simulations, etc
TeMPOraL 6 months ago

> Can you look at the world model, like you can look at Waymo's world model? Or is it hidden inside weights?
I imagine it's the latter, and in general, we're already dealing with plenty of models with world models hidden inside their weights. That's why I'm happy to see the direction Anthropic has been taking with their interpretability research over the years.
Their papers, as well as most discussions around them, focus on issues of alignment/control, safety, and generally killing the "stochastic parrot" meme and keeping it dead - but I think it'll be even more interesting to see attempts at mapping how those large models structure their world models. I believe there's scientific and philosophical discoveries to be made in answering why these structures look the way they do.
- namaria 6 months ago
  
  > killing the "stochastic parrot" meme
  This was clearly the goal of the "Biology of LLMs" (and ancillary) paper but I am not convinced.
  They used a 'replacement model' that by their own admission could match the output of the LLM ~50% of the time, and the attribution of cognition related labels to the model hinges entirely on the interpretation of the 'activations' seen in the replacement model.
  So they created a much simpler model, that sorta kinda can do what the LLM can do in some instances, contrived some examples, observed the replacement model and labeled what it was doing very liberally.
  Machine learning and the mathematics involved is quite interesting but I don't see the need to attribute neuroscience/psychology related terms to them. They are fascinating in their own terms and modelling language can clearly be quite powerful.
  But thinking that they can follow instructions and reason is the source of much misdirection. The limits of this approach should make clear that feeding text to a text continuation program should not lead to parsing the generated text for commands and running these commands, because the tokens the model outputs are just statistically linked to the tokens inputted to them. And as the model takes more tokens from the wild, it can easily lead to situations that are very clearly an enormous risk. Pushing the idea that they are reasoning about the input is driving all sorts of applications that seeing them as statistical text continuation programs would make clear are a glaring risk.
  Machine learning and LLMs are interesting technology that should be investigated and developed. Reasoning by induction that they are capable of more than modelling language is bad science and drives bad engineering.

reportgunner 6 months ago

Article makes it seem like finding diamonds is some kind of super complicated logical puzzle. In reality the hardest part is knowing where to look for them and what tool you need to mine them without losing them once you find them. This was given to the AI by having it watch a video that explains it.

If you watch a guide on how to find diamonds it's really just a matter of getting an iron pickaxe, digging to the right depth and strip mining until you find some.

danijar 6 months ago

Hi, author here! Dreamer learns to find diamonds from scratch by interacting with the environment, without access to external data. So there are no explainer videos or internet text here.
It gets a sparse reward of +1 for each of the 12 items that lead to the diamond, so there is a lot it needs to discover by itself. Fig. 5 in the paper shows the progression: https://www.nature.com/articles/s41586-025-08744-2
- itchyjunk 6 months ago
  
  Since diamonds are surrounded by danger and if it dies, it loses its items and such, why would it not be satisfied after discovering iron pick axe or somesuch? Is it in a mode where it doesn't lose its item when it dies? Does it die a lot? Does it ever try digging vertically down? Does it ever discover other items/tools you didn't expect it to? Open world with sparse reward seems like such a hard problem. Also, once it gets the item, does it stop getting reward for it? I assume so. Surprised that it can work with this level of sparse rewards.
  - taneq 6 months ago
    
    In all reinforcement learning there is (explicitly as part of a fitness function, or implicitly as part of the algorithm) some impetus for exploration. It might be adding a tiny reward per square walked, a small reward for each block broken and a larger one for each new block type broken. Or it could be just forcing a random move every N steps so the agent encounters new situations through “clumsiness”.
    
    kevindamm 6 months ago
    
    That is right, there is usually a parameter on the action selection function -- the exploitation vs exploration balance.
  - danijar 6 months ago
    
    When it dies it loses all items and the world resets to a new random seed. It learns to stay alive quite well but sometimes falls into lava or gets killed by monsters.
    It only gets a +1 for the first iron pickaxe it makes in each world (same for all other items), so it can't hack rewards by repeating a milestone.
    Yeah it's surprising that it works from such sparse rewards. I think imagining a lot of scenarios in parallel using the world model does some of the heavy lifting here.
    
    SpaceManNabs 6 months ago
    
    > Yeah it's surprising that it works from such sparse rewards. I think imagining a lot of scenarios in parallel using the world model does some of the heavy lifting here.
    This is such gold. Thanks for sharing. Immediately added to my notes.
- SpaceManNabs 6 months ago
  
  I just want to express my condolences in how difficult it must be to correct basic misunderstandings that can be immediately corrected from reading the fourth paragraph under the section "Diamonds are forever"
  Thanks for your hard work.
  - danijar 6 months ago
    
    Haha thanks!
- ryan-duve 6 months ago
  
  For the curious, from the link above:
  > log, plank, stick, crafting table, wooden pickaxe, cobblestone, stone pickaxe, iron ore, furnace, iron ingot, iron pickaxe and diamond
kuu 6 months ago

While I agree with your comment, this sentence:
"This was given to the AI by having it watch a video that explains it."
This was not as trivial as it may seem just a few months ago...
- rcxdude 6 months ago
  
  EDIT: Incorrect, see below
  it didn't watch 'a video', it watched many, many hours of video of playing minecraft (with another specialised model feeding in predictions of keyboard and mouse inputs from the video). It's still a neat trick, but it's far from the implied one-shot learning.
  - danielbln 6 months ago
    
    The author replied in this thread and says the opposite.
    
    rcxdude 6 months ago
    
    Ah, I was incorrect. I got that impression from one of the papers linked at the end of the article, but I suspect that's actually some previous work.
    
    SpaceManNabs 6 months ago
    
    I applaud you for acknowledging your mistake. So many people double down, especially in this pernicious and polarized age.
- NVHacker 6 months ago
  
  Alpha Star was also trained initially from youtube videos of pros playing Starcraft. I would argue that it was pretty trivial a few years ago.
  - rcxdude 6 months ago
    
    I don't think it was videos. Almost certainly it was replay files with a bunch of work to transform them into something that could be compared to the model's outputs. (Alphastar never 'sees' the game's interface, only a transformed version of information available via an API)
    
    stingraycharles 6 months ago
    
    This was my understanding as well, as the replay files are all available anyway.
    The YouTube documentary is actually very detailed about how they implemented everything.
    
    SpaceManNabs 6 months ago
    
    Which documentary? Is it this one?
    https://www.youtube.com/watch?v=UuhECwm31dM
    
    stingraycharles 6 months ago
    
    It was a ~1h documentary
  - ismailmaj 6 months ago
    
    Do you know if it was actual videos or some simpler inputs like game state and user inputs? I’d be impressed if it was the former at that time.
    
    johnny22 6 months ago
    
    starcraft provides replay files that start with the initial game state and then every action in the game. Not user inputs, but the actions bound to them.
skwirl 6 months ago

>This was given to the AI by having it watch a video that explains it.
That is not what the article says. It says that was separate, previous research.
Bluglionio 6 months ago

I don't get it. How can you reduce this achievement down to this?
Have you gotten used to some ai watching a video and 'getting it' so fast that this is boring? Unimpressive?
- jerf 6 months ago
  
  The other replies have observed that the AI didn't get any "videos to watch" but I'd also observe that this is being used as an English colloquialism. The AIs aren't "watching videos", they're receiving videos as their training data. That's quite different from what is coming to your mind as "watching a video" as if the AI watched a single YouTube tutorial video once and got the concept.
- reportgunner 6 months ago
  
  I feel like you are jumping to conclusions here, I wasn't talking about the achievement or the AI, I was talking about the article and the way it explains finding diamonds in minecraft to people who don't know how to find diamonds in minecraft.
rowanG077 6 months ago

The AI is able to learn from video and you don't find that even a little bit impressive? Well I disagree.
- reportgunner 6 months ago
  
  see [0]
  [0] https://news.ycombinator.com/item?id=43609826

DeborahEmeni_ 6 months ago

The “holding a button” thing actually resonated. It feels like the real work here is engineering the reward structure to make exploration even remotely viable. Dreamer’s world model might be cool, but most of the heavy lifting still seems to come from how forgiving the Minecraft environment is for training.

I do wonder though: if you swapped Minecraft for a cloud-based synthetic world with similar physics but messier signals, like object permanence or social reasoning, would Dreamer still hold up? Or is it just really good at the kind of clean reward hierarchies that games offer?

lupusreal 6 months ago

Characterizing finding diamonds as "mastering" Minecraft is extremely silly. Tantamount to saying "AI masters Chess: Captures a pawn." Getting diamonds is not even close to the hardest challenge in the game, but most readers of Nature probably don't have much experience playing Minecraft so the title is actually misleading, not harmless exaggeration.

zimpenfish 6 months ago

> Getting diamonds is not even close to the hardest challenge in the game
Mining diamonds isn't even necessary if you build, e.g., ianxofour's iron farm on day one and trade that iron[0] with a toolsmith, armourer, and weaponsmith. You can get full diamond armour, tools, and weapons pretty quickly (probably a handful of game weeks?)
[0] Main faff here is getting them off their base trade level.
- lupusreal 6 months ago
  
  True, and if the objective is to get some raw diamonds as fast as possible demonstrating mastery of the game, I'd expect a strategy like making a boat, finding a shipwreck and then a buried treasure chest. Takes just a few minutes usually.
  Really though, if AI wants to impress me it needs to collect an assortment of materials and build a decent looking base. Play the way humans usually play.
danijar 6 months ago

I agree with you, this is just the start and Minecraft has a lot more to offer for future research!

CodeCompost 6 months ago

I didn't know that Nature did movie promotions.

YeGoblynQueenne 6 months ago

Reinforcement learning is very good with games.

>> In Minecraft, the team used a protocol that gave Dreamer a ‘plus one’ reward every time it completed one of 12 progressive steps involved in diamond collection — including creating planks and a furnace, mining iron and forging an iron pickaxe.

And that is why it is never going to work in the real world: games have clear objectives with obvious rewards. The real world, not so much.

danijar 6 months ago

For a lot of things, VLMs are good enough already to provide rewards. Give them the recent images and a text description of the task and ask whether the task was accomplished or not.
For a more general system, you can annotate videos with text descriptions of all the tasks that have been accomplished and when, then train a reward model on those to later RL against.
IshKebab 6 months ago

Plenty of real world situations have clear objectives with obvious rewards.
- YeGoblynQueenne 6 months ago
  
  Example.
  - IshKebab 6 months ago
    
    Fold clothes -> clothes are folded.
    Take children to school -> they safely arrive on time.
    Autonomous driving -> arrive at destination without crashing.
    Call centre -> customers are happy.
    
    TeMPOraL 6 months ago
    
    Those don't look like rewards, or at least don't get processed as such for many people (myself included).
    Or maybe there is some art to finding happiness in simple things like having folded clothes or surviving the commute?
    
    IshKebab 6 months ago
    
    In RL rewards can be anything you want. They don't have to be things that humans like.
    
    TeMPOraL 6 months ago
    
    Fair enough!
    I guess you can always find some well-specified, measurable goal/reward, but then that choice limits the performance of your model. It's fine when you're building a very specialized system; it gets more difficult the more general you're trying to be.
    For a general system meant to operate in human environment, the goal ends up approaching "things that humans like". Case in point, that's what the overall LLM goal function is - continuations that make sense to humans, in fully-general meaning of that.
    
    YeGoblynQueenne 6 months ago
    
    >> Fold clothes -> clothes are folded.
    >> Take children to school -> they safely arrive on time.
    >> Autonomous driving -> arrive at destination without crashing.
    >> Call centre -> customers are happy.
    Define a) "folded", b) "safely", c) "destination", d) "happy".
    Also define the reward functions for each of the four objectives above.
    
    IshKebab 6 months ago
    
    Safely -> no crashes
    Destination -> Like, close to the destination? I don't see how that's hard.
    Happy -> you can use customer feedback for this
    Folded -> this is indeed the trickiest one, but I think well within the capabilities of modern vision models.
    
    YeGoblynQueenne 6 months ago
    
    >> Safely -> no crashes
    Really? What about fires? Falling off cliffs? Causing others to crash?
    Your "examples" are all hand-wavy and vague and no good to train an RL agent. You've also not provided a reward function.
  - xwolfi 6 months ago
    
    Work a job, receive money
    
    TeMPOraL 6 months ago
    
    That's a weak example it context of at least salaried jobs, especially in context of RL, as "receive money" part is usually both significantly delayed from "work a job" part, and only loosely affected by it.
    
    IshKebab 6 months ago
    
    The delay between action and reward is a pretty fundamental problem with RL in general. I don't think they've come up with a really good solution yet.
    Of course the delay is much bigger with working a job than most RL games but fundamentally it's the same problem.
SpaceManNabs 6 months ago

> And that is why it is never going to work in the real world: games have clear objectives with obvious rewards. The real world, not so much.
I encourage you to read deepmind's work with robots.
- YeGoblynQueenne 6 months ago
  
  Oh I have. For example I remember this project:
  >> Quantitatively, the QT-Opt approach succeeded in 96% of the grasp attempts across 700 trial grasps on previously unseen objects. Compared to our previous supervised-learning based grasping approach, which had a 78% success rate, our method reduced the error rate by more than a factor of five.
  https://research.google/blog/scalable-deep-reinforcement-lea...
  That was in 2018.
  So what do you think, is vision-based robotic manipulation and grasping a solved problem, seven years later? Is QT-Opt now an established industry standard in training robots with RL?
  Or was that just another project that was announced with great fanfare and hailed as a breakthrough that would surely lead to great increase of capabilities... only to pop, fizzle and disappear in obscurity without any real-world result, a few years later? Like most of DeepMind's RL projects do?
  - SpaceManNabs 6 months ago
    
    Let's look at 2025
    https://www.youtube.com/watch?v=x-exzZ-CIUw
    It looks pretty awesome. Let's see what happens.
    
    YeGoblynQueenne 6 months ago
    
    Nice robot demo. Here's another one:
    https://youtu.be/03p2CADwGF8?si=BXeWXqu1_3WMS4yy
    A robot assembling a puzzle with machine vision!
    And it's only from the 1970's.
    
    SpaceManNabs 6 months ago
    
    i dont think those demos are comparable and cool for sharing your link!
    
    YeGoblynQueenne 6 months ago
    
    Absolutely comparable. Consider what can be done today with hardware as powerful as in the 1970's and it's obvious that the needle hasn't budged one tick.
    But, like you say- let's wait and see. I always do the former but I'm still waiting for the latter.
smokel 6 months ago

> games have clear objectives with obvious rewards. The real world, not so much.
Tell that to the people here who are trying to turn their startup ideas into money.
- zamadatix 6 months ago
  
  I don't think folks go the startup path because the steps to go from idea to making money are obvious and clear.
janalsncm 6 months ago

> it is never going to work in the real world
DeepSeek used RL to train R1, so that is clearly not true. But ignoring that, what is your alternative? Supervised learning? Good luck finding labels if you don’t even know what the objective is.
- YeGoblynQueenne 6 months ago
  
  No, let's not ignore DeepSeek: text is not the real world any more than Minecraft is the real world.
  And why do I have to offer an alternative? If it's not working, it's not working, regardless of whether there's an alternative (that we know of) or not.

colechristensen 6 months ago

Who would have thought you could get your TAS run published in Nature if you used enough hot buzzwords. (they have been using various old-school-definition "artifical intelligence" algorithms for a long time)

https://tasvideos.org/

FrustratedMonky 6 months ago

Minecraft is ubiquitous now.

But I remember the alpha version, and NOBODY knew how to make a pick ax. Humans were also very bad at figuring out these steps.

People were de-compiling the java and posting help guides on the internet.

How to break a tree, get sticks, make a wood pick. In Alpha, that was a big deal for humans also.

ryoshu 6 months ago

Or you could watch Notch build it.

N-Krause 6 months ago

https://archive.is/XutGu

ljdtt 6 months ago

Slightly off-topic from the article itself, but… does anyone else feel like Nature’s cookie banner just never goes away? I have vivid memories of trying to reject cookies multiple times, eventually giving up and accepting them just to get to the article only for the banner to show up again the next time I visit. I swear it’s giving me déjà vu every single visit.. Am I the only one experiencing this, or is this just how their site works?

textlapse 6 months ago

Could this perform better by having the internal representation of Minecraft instead of raw pixels?

It seems rather tenuous to keep pounding on 'training via pixels' when really a game's 2D/3D output is an optical trick at best.

I understand Sergey Brin/et al had a grandiose goal for DeepMind via their Atari games challenge - but why not try alternate methods - say build/tweak games to be RL-friendly? (like MuJoCo but for games)

I don't see the pixel-based approach being as applicable to the practical real world as say when software divulges its direct, internal state to the agent instead of having to fake-render to a significantly larger buffer.

I understand Dreamer-like work is a great research area and one that will garner lots of citations for sure.

EMIRELADERO 6 months ago

> I understand Sergey Brin/et al had a grandiose goal for DeepMind via their Atari games challenge - but why not try alternate methods - say build/tweak games to be RL-friendly?
Because the ultimate goal (real-world visual intelligence) would make that impossible. There's no way to compute the "essential representation" of reality, the photons are all there is.
- textlapse 6 months ago
  
  There is no animal on planet earth that functions this way.
  Visual cortex and plenty of other organs compress the data into useful, semantic information before feeding into a 'neural' network.
  Simply from an energy and transmission perspective an animal would use up all its store to process a single frame if we were to construct such an organism based on just 'feed pixels to a giant neural network'. Things like colors, memory, objects, recognition, faces etc are all part of the equation and not some giant neural network that runs from raw photons hitting cones/rods.
  So this isn't biomimicry or cellular automata - it's simply a fascination similar to self-driving cars being able to drive with a image -> {neural network} -> left/right/accelerate simplification.
  - janalsncm 6 months ago
    
    Brains may operate on a compressed representation internally, but they only have access to their senses as inputs. A model that needs to create a viable compressed representation is quite different from one which is spoon fed one via some auxiliary data stream.
    Also I believe the DeepMind StarCraft model used the compressed representation, but that was a while ago. So that was already kind of solved.
    > simply a fascination similar to self-driving cars being able to drive with a image
    Whether to use lidar is more of an engineering question of the cost/benefit of adding modalities. LiDAR has come down in price quite a bit so it’s less wise in retrospect.
    
    textlapse 6 months ago
    
    Brains also have several other inputs that an RL algorithm trained from raw data (pixels/waves etc) don't have:
    - Millions of years of evolution (and hence things like walking/swimming/hunting are usually not acquired characteristics even within mammals)
    - Memory - and I don't mean the neural network raw weights. I mean concepts/places/things/faces and so on that is already processed and labeled and ready to go.
    - Also we don't know what we don't know - how do cephalopods/us differ in 'intelligence'?
    I am not trying to poo-poo the Dreamer kind of work: I am just waiting for someone to release a game that actually uses RL as part of the core logic (Sony's GT Sophy comes close).
    Such a thing would be so cool and would not (necessarily) use pixels as they are too far downstream from the direct internal state!

protocolture 6 months ago

Finally a use case for AI

Xelynega 6 months ago

Isn't this DeepMind achievement from 2023?

smokel 6 months ago

Yup, the technical details are described in this paper from early 2023. Seems like Nature takes its time. That used to be a good thing :)
https://arxiv.org/abs/2301.04104

successful23 6 months ago

Pretty impressive. Minecraft’s a complex environment, so for an AI to figure out how to find diamonds on its own shows real progress in learning through exploration — not just pattern recognition.

charcircuit 6 months ago

You can literally find diamonds by moving the mouse down and holding left click.
- mjamesaustin 6 months ago
  
  No you can't. Blocks broken without the correct tools do not yield items. In order to find diamonds the agent must first craft an iron pickaxe, which itself requires several other steps.
  - charcircuit 6 months ago
    
    Finding diamonds is different than acquiring diamonds.
    
    fastball 6 months ago
    
    If you actually read the paper, DeepMind's program ("Dreamer") is indeed acquiring diamonds.

fine_tune 6 months ago

Attempting to train this on a real workload I converted over the weekend after, "step" 8M~ so far and rarely scores above 5% and most are 0% but has scored 60% once 7M~ steps ago.

Adding more than 1 GPU didn't improve speed but that's pretty standard as we don't have fancy interconnect. Bit annoying they didn't use tensorboard for logging, but overall seems like a pretty cool lib - will leave it a few days and see if it can learn (no other algo has so I dont have much hope).

theOGognf 6 months ago

This looks like an article about the recent Nature publication. Was confused at first because DreamerV3 is a couple of years old now

sbuttgereit 6 months ago

There's a YouTube channel that does a lot of videos focused on LLMs in Minecraft:

https://www.youtube.com/@EmergentGarden

I very much like the comparative approach this guy takes looking at how different LLMs fare... including how they interact together. Worth a look.

ninetyninenine 6 months ago

It’s still being a stochastic parrot. Now it’s just parroting the human creativity and imagination so I’m still not impressed.

If all you’re going to do is parrot things like human consciousness or human ingenuity then I will never be impressed so long that it’s just parroting.

smokel 6 months ago

You must be confused. This research is about reinforcement learning, not about large language models.
- ninetyninenine 6 months ago
  
  It's parroting human reinforcement.
  - danijar 6 months ago
    
    It actually has no human data as input and learns by itself in the environment, that's the point of the accomplishment! :)
    
    ninetyninenine 6 months ago
    
    That's what humans do right? So it's parroting us.

jonathanyc 6 months ago

They write: "Below, we show uncut videos of runs during which Dreamer collected diamonds."

... but the first video only shows the player character digging downwards without using any tools and eventually dying in lava. What?

danijar 6 months ago

It gets diamonds at 1:48 in the top left video (might need to full screen to seek) [1].
The tools are admittedly really hard to see in the videos because of the timelapse and MP4 struggles a bit on the low resolution, but they are there :)
[1]: https://danijar.com/dreamerv3/
kbelder 6 months ago

Proving that the AI can play Minecraft as well as my wife?

camel-cdr 6 months ago

How robust is this?

Isn't something like finding dimonds in minecraft something that old-school AI could already do decently?

breakyerself 6 months ago

Those were trained on human play. This had to figure it out from scratch.
- camel-cdr 6 months ago
  
  Ah, is this full RL?
  I was reading something about LLMs earlier and was thinking that LLMs could probably write a simple case based script for controlling a player, that could accive a decent success rate.
  - danijar 6 months ago
    
    Yes, it's RL from scratch and sparse rewards

nottorp 6 months ago

Isn't "masters" when you build a working copy of Minas Tirith or something like that?

Ntrails 6 months ago

I'd accept "build a tnt trap for your buddy" or "defeated the end dragon"
tmtvl 6 months ago

Yeah, wake me when it builds a villager trading hall, zombified piglin farm, piglin trading hall, enderman farm, golem farm, and automated wool farm.

_vere 6 months ago

So can i and no one needed to teach me either, but you dont see nature writing articles on it...

Aachen 6 months ago

You don't want anyone to be working on replicating things humans can already do? We'll just continue tilling fields to eternity..
johnisgood 6 months ago

This is too dismissive, and there are a zillion articles of human learning.

fxtentacle 6 months ago

I guess we can look forward

to a bright future

where we focus 100% on work

and AI will play our games

TeMPOraL 6 months ago

Once again, we see that it's much easier to teach machines to perceive and decide well, in many cases well above human performance - while at the same time, making machines that can navigate the same physical environment humans do, and do a variety of manual tasks that mix power and precision, remains extremely challenging.
The message this sends is pretty clear: machines are better at thinking, humans are better at manual work. That is the natural division of labor that plays into strengths and weaknesses of both computers and human beings.
And so, I'm sorry to say this, but the near future is that in which computers play our games and do the thinking and creative work and management (and ultimately governance), because they're going to be better at this than us, leaving us to do all the physical labor, because that's one thing we will remain better at for a while.
That, or we move past the existing economic structures, so that we no longer need to worry about being competitive with AI labor.
/s, but only a little.
- danielbln 6 months ago
  
  In a world where AI takes over cognitive work it's not far fetched to imagine a future where robotics is going to hockey stick, at which case we will either end up like citizens of The Culture, or if not citizens of the Belt.
- bigbacaloa 6 months ago
  
  [dead]
weatherlite 6 months ago

> where we focus 100% on work
Lol that's crazy optimistic, what work ?
- fxtentacle 6 months ago
  
  Picking up dropped pencils, for example. Robots are still hilariously bad at that. Or driving your new AI overload around the country from LAN to LAN.
  - weatherlite 6 months ago
    
    > Picking up dropped pencils, for example. Robots are still hilariously bad at that
    It's only hilarious because we're allowed to laugh. For now. Wait a few years its possible these things will demand respect.
    
    recursive 6 months ago
    
    I can't believe you'd laugh at a sentient agent who's trying their best to pick up a pencil and obviously struggling. Maybe you should practice some empathy and help them pick it up?
helpfulContrib 6 months ago

[dead]

rationalfaith 6 months ago

[dead]

animanoir 6 months ago

[dead]