Spatial Understanding
Copyright 2025 Google LLC.
2D spatial understanding with Gemini
This notebook introduces object detection and spatial understanding with the Gemini API like in the Spatial understanding example from AI Studio and demonstrated in the Building with Gemini 2.0: Spatial understanding video.
You'll learn how to use Gemini the same way as in the demo and perform object detection like this:

There are many examples, including object detection with
- simply overlaying information
- searching within an image
- translating and understanding things in multiple languages
- using Gemini thinking abilities
Note
There's no "magical prompt". Feel free to experiment with different ones. You can use the dropdown to see different samples, but you can also write your own prompts. Also, you can try uploading your own images.
Setup
Install SDK
Setup your API key
To run the following cell, your API key must be stored in a Colab Secret named GOOGLE_API_KEY. If you don't already have an API key, or you're not sure how to create a Colab Secret, see Authentication
for an example.
Initialize SDK client
With the new SDK you now only need to initialize a client with your API key.
Select and configure a model
Spatial understanding works best Gemini 2.0 Flash model. It's even better with 2.5 models like gemini-2.5-pro but slightly slower as it's a thinking model.
Some features, like segmentation, only works with 2.5 models.
The Object detection contains good examples of what previous models were able to do.
For more information about all Gemini models, check the documentation for extended information on each of them.
System instructions
With the new SDK, the system_instructions and the model parameters must be passed in all generate_content calls, so let's save them to not have to type them all the time.
The system instructions are mainly used to make the prompts shorter by not having to reapeat each time the format. They are also telling the model how to deal with similar objects which is a nice way to let it be creative.
The Spatial understanding example is using a different strategy with no system instructions but a longer prompt. You can see their full prompts by clicking on the "show raw prompt" button on the right. There no optimal solution, experiment with diffrent strategies and find the one that suits your use-case the best.
It is also recommend to always disable the thinking, as so far it adds latency without improving the results.
Import
Import all the necessary modules.
Utils
Some scripts will be needed to draw the bounding boxes. Of course they are just examples and you are free to just write your own.
For example the Spatial understanding example from AI Studio uses HML to render the bounding boxes. You can find its code in the Github repo.
Reading package lists... Done Building dependency tree... Done Reading state information... Done Suggested packages: fonts-noto-cjk-extra The following NEW packages will be installed: fonts-noto-cjk 0 upgraded, 1 newly installed, 0 to remove and 35 not upgraded. Need to get 61.2 MB of archives. After this operation, 93.2 MB of additional disk space will be used. Get:1 http://archive.ubuntu.com/ubuntu jammy/main amd64 fonts-noto-cjk all 1:20220127+repack1-1 [61.2 MB] Fetched 61.2 MB in 3s (17.9 MB/s) Selecting previously unselected package fonts-noto-cjk. (Reading database ... 126319 files and directories currently installed.) Preparing to unpack .../fonts-noto-cjk_1%3a20220127+repack1-1_all.deb ... Unpacking fonts-noto-cjk (1:20220127+repack1-1) ... Setting up fonts-noto-cjk (1:20220127+repack1-1) ... Processing triggers for fontconfig (2.13.1-4.2ubuntu5) ...
Get example images
Overlaying Information
Let's start by loading an image, the origami one for example:
Let's start with a simple prompt to find all items in the image.
To prevent the model from repeating itself, it is recommended to use a temperature over 0, in this case 0.5. Limiting the number of items (25 in the systemp instructions) is also a way to prevent the model from looping and to speed up the decoding of the bounding boxes. You can experiment with these parameters and find what works best for your use-case.
It is also recommend to always disable the thinking, as so far it adds latency without improving the results.
```json
[
{"box_2d": [390, 64, 574, 203], "label": "red sprinkles"},
{"box_2d": [382, 250, 537, 369], "label": "pink and blue sprinkles"},
{"box_2d": [365, 397, 501, 509], "label": "pink frosting"},
{"box_2d": [355, 529, 521, 650], "label": "pink frosting with blue balls"},
{"box_2d": [384, 737, 535, 866], "label": "chocolate frosting"},
{"box_2d": [443, 432, 595, 564], "label": "pink frosting with googly eyes"},
{"box_2d": [477, 627, 638, 770], "label": "white frosting with colorful sprinkles"},
{"box_2d": [556, 40, 726, 200], "label": "white frosting with colorful sprinkles"},
{"box_2d": [510, 799, 690, 959], "label": "white frosting with colorful candies"},
{"box_2d": [545, 295, 702, 444], "label": "white frosting with googly eyes"},
{"box_2d": [559, 514, 712, 663], "label": "white frosting with googly eyes"},
{"box_2d": [429, 597, 638, 770], "label": "white frosting with colorful sprinkles"},
{"box_2d": [713, 271, 874, 497], "label": "white frosting with googly eyes"},
{"box_2d": [655, 350, 816, 516], "label": "white frosting with googly eyes"},
{"box_2d": [743, 133, 921, 307], "label": "white frosting with googly eyes"}
]
```
As you can see, even without any instructions about the format, Gemini is trained to always use this format with a label and the coordinates of the bounding box in a "box_2d" array.
Just be careful, the y coordinates are first, x ones afterwards contrary to common usage.
Output hidden; open in https://colab.research.google.com to view.
Search within an image
Let's complicate things and search within the image for specific objects.
```json
[
{"box_2d": [46, 246, 385, 526], "label": "light blue sock with cat face on left"},
{"box_2d": [233, 661, 650, 862], "label": "light blue and grey sock with cat face on right"}
]
```
(640, 482)
Try it with different images and prompts. Different samples are proposed but you can also write your own.
Multilinguality
As Gemini is able to understand multiple languages, you can combine spatial reasoning with multilingual capabilities.
You can give it an image like this and prompt it to label each item with Japanese characters and English translation. The model reads the text and recognize the pictures from the image itself and translates them.
(640, 640)
Use Gemini reasoning capabilities
The model can also reason based on the image, you can ask it about the positions of items, their utility, or, like in this example, to find the shadow of a speficic item.
(640, 482)
You can also use Gemini knowledge to enhanced the labels returned. In this example Gemini will give you advice on how to fix your little mistake.
As you can see this time, you're only resizing the image to 1024px as it helps the model getting the bigger picture and give you advice. There's no clear rule about when to do it, experiment and find what works the best for you.
(640, 480)
'```json\n[\n {"box_2d": [447, 0, 992, 420], "label": "Use this cloth to absorb the spilled liquid by gently dabbing or wiping the area."},\n {"box_2d": [431, 603, 663, 871], "label": "This spill needs to be cleaned immediately to prevent staining the countertop."},\n {"box_2d": [303, 147, 461, 314], "label": "Move these knives away from the spill area to avoid contamination or accidental cuts during cleaning."},\n {"box_2d": [0, 465, 320, 808], "label": "Ensure this sugar container is sealed or moved to prevent it from getting wet or contaminated during cleaning."},\n {"box_2d": [196, 301, 508, 514], "label": "Carefully move this sugar pot away from the spill to protect its contents and prevent it from being knocked over."},\n {"box_2d": [182, 0, 998, 1000], "label": "This white marble-patterned countertop needs to be wiped clean of the spill. After wiping, you may want to use a damp cloth to remove any residue and then dry it thoroughly."}\n]\n```' And if you check the previous examples, the Japanese food one in particular, multiple other prompt samples are provided to experiment with Gemini reasoning capabilities.
Experimental: Segmentation
2.5 models are also able to segment the image and not only draw a bounding box but to also provide a mask of the contour of the items. It's especially useful if you are planning on editing images like in the Virtual try-on example.
```json
[
{"box_2d": [168, 97, 360, 228], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACHElEQVR42u3dwY7bMBBEwfn/n55gD7klGy+syJJe9cUXw0AXSJqmCWhG5JTsv/L1jlThF5MtfiOO/Vza7T/psJdMuftJEnufhKsfT7C3Tbz+AQJ7/6TLvyOwz0m5u/4Afi6w2xbYOMDGATYOsBsX2LoAAAAAAAAAAKBLAAAAAABpAQAAAABICwAAACANsAAAALAG2gg6D+j2dyyeByDgjki9v2tSLsq5KgnAdeHJjwACA4AAAAC2Ag8D8IMAwDgVcC7kZPCF9vXrknWAfXbq/bfeH8DW+38vsI3U+/9doA6woQCI9996/z8I7MYFAMQFFkCboN4fwJoCbYDduACAugAAU4AAAABWQQDmAABzwBAAACAKMAAAAADga8AIMAIAAABQXQX9Rw4AQBxgAAAAAABAGcCNYQAAAAAgUO4PAAAAAgAAAAj3BzAAzAEAAABYBcP9AQCwBgAAAACArwEAAAAAANAEGADx/kYAAAAAAAAAYCMAAAAAAAAAALATCo4Az1yyCAAAAAAAAAAAANgLR/sDyAMQMALqAJ5EX++fEagDzMQF9G8LTBxgJi4wcYEB0BaYaQsMgLbAtAHm9dT7P1Fgfph6/4nXf5jATFpgpg0wkxaYd1Lvf3+BeTvx+ncWmKMSr39TgTk27fY3I5j/lXb7GxDMKWm3vyjBnJx2+8sgzOdT7v4Zhbla+RMN5uoJVz8O4fcnfL2OyAn5BQvWXX8AGwU/AAAAAElFTkSuQmCC", "label": "glass jar with sprinkles"},
{"box_2d": [276, 804, 467, 936], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACPklEQVR42u3dSXbCMAAE0br/pTvhAMnLYMB2Ve9Z9Ec28iABb84eYdw2+18enxfW/gLDV/kCDHtjzN3fL7JTR13+iQq7YNztDxTYlSOvfwDCbhJ7/78S7F6x9/+twW4ZPcD0ALP3/7HA7AKzCwRgF2gEBBBAAAEE0EwwgI6ABkAA3Q4IoCOgARBAh0DToAAC0AAsADfA5AALIAA1wAIIQA2wyQUCCKCTQAABBBBAU8GuB8V3hLopGkCPxno6GkAAnQTNAvVvIhxA/bsY7nZIAEKByQEmB9jUApsbYHMLTA6wANwAm1tgAbgFFoAbYJML2AEWgFwgADnAApALBCAHWAABBKAWCCCAANQCCyCAAMwCawQ0AgIIIIAAAggggAACMAIsgAACCCCAfgYD0PYPIIAAAghADdCzQT1Ar4n1qmgvzLdqpnVTrR1s+ayPgADaSKS9ZNQGbSjVlmLtqtfGkgG0u3Cb7GsNsAuAnAC5AMgFsAtgFwjALoBcAOQC2AVAToBdAOQE2AVATgByAnATcET0AJ8EyAGuOwywC4CcALsABwsgB7jcKAC5AMgJArALYBfALoBdALsAdgGQEwRgF0AuwAuiB0APcGYC7ALIBXhhArADEIAdAHv/APQA6AFORxCAXQC5AMgFsAtgF7AD8L4kYO9/DoEA5AIE4AYAuUAAdgE7AMgFApADIAcAtwByAOQA4BZADkAAbgDcAJwu9VfXfx3AY1X3OSP94l8HwOlj6HisAbfOt/8ATXliPgD0TRTLO11fEAAAAABJRU5ErkJggg==", "label": "glass jar with sprinkles"},
{"box_2d": [260, 699, 431, 810], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACNklEQVR42u3a3UorQRQG0Xr/l95yQMSDCEZnku6u+rzSq9TK/CQm0NpLN+8/H7+/j//+PgeEXjRZ7i4Onx/ZPHenP9PLGsyCE6c/T2P2mTr+eoTZd+76CwzmjMnzfykwh81d/yjCjFtgDp69/ycIM26Bccze/63AiKYHmADk/QFMAAG4+78K6AG6C3QNDCCA+gPoJhhArwMDCCAAH8AE4AaYkQt0BnQGBBBAl8AAArAKTIdAd4EAAui9QAABBKDt72ORPhsNQA/QV6T0AH1Z3t7fEdBFsCOgIyCAAHozEEAAUoAJIIAAAggggAACCKBXwgEEEEDXwI6AAOoPIIAAPP0BBBCAuz+AAOT9Adj7AwgggAACCCAAb38AAQQQQADdBQPoDAggAGd/AAF0DQig/gACCCCAAALo3wG9Eg4ggAACCCCAAAIIoLcCAQQQQADdBAT9AQQQQAABBBBAAAEEEICzP4AA5P0BBCDvp353P/b+AOT55wMgB0AOgBsAtAD/ukALwCPTPvNHAfCXect3B+CiyfN3BcANAGYALp48fy8A7pg8fyMA1ADcN3s/9n7c9csDcP/s/ZjbFwYANwByAFALgBoA1AC8YvZ+5PnrACAHALcAcgBwCxCAWwA5AMgFkAsgBwC3AMgFkAvgBmCp2ftfIABuAnALgFuAVWfvfxIBa0+efzsBW8zefx8BuAXYavb+6w3YcnqAywjYefb+Cww4Y/b+3zpw3uT5jzlgmLv+K8fSj+4NG1SUWRDiKGwAAAAASUVORK5CYII=", "label": "glass jar with milk"},
{"box_2d": [272, 296, 376, 396], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACFklEQVR42u3b0U6DQBQE0Pn/n75GX9TEkipLBebMk0mNZY53t5RA8opM5ufk44X317/99k3ysPZG8vnDZSFmaY49ynV/bHHtbY89h77Od86RvQvyyt2fLrF2ic1J82DPXLrHzAXy2yO9VfnD9pC5cbrb79xA7w8wBeluvwUwAGoBpirl9ScAqutPef3x77f81Qfg9Kewf/npb7r7d3/1ewQwAMqvfgCobu/yZ8fV73H5vxogALoBYgJMAAD3ABgAAK0CAWATNAEAANgC3AwLoBMgALoBAgAAAI8EAagVKH8m8OmnIgGU9weQ9v4A6gGiP4BugAAA0N0fAIB2gJTXBwAAAAAAAJr7AwDQDhATAMASAGAJFA9AAAAAAAAAAAAAAAAAAAAAAABt9QGkfQUAGGvABJgAEwAAAAAAAAAAcCLU2B8AAAAAAABwGgAAAACbIAAAAAAAAAAAAAAAAAAAAAAAAAAAFQABAAAAAAAAAAAAAAAAAAAAAAAAAKAMICbABAAAAMAuaAIAWAIAAAAAAAAAAAAAAAAA0CQAAAAAuyAAAAAA+BgAAMAm0NkfAIB2gAAAUA6Q9v4A2vvfQgAAgG6AAOgGSMoFUi4QAOUCKRdIN0BWpb1/2vtfkyDlAEm3QACUC6RcIIekvP5VBHJsyusDOLdAXpTy+qdEyH/kyztPqpqfZTJy6jR3P1Ij1011+T+KRF6eN+0Awyp4XkX9AAAAAElFTkSuQmCC", "label": "glass body of red jar"},
{"box_2d": [272, 296, 376, 396], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACE0lEQVR42u3cwWqEQBAE0Pr/n54QiLksGlldo1OvIaeVYL3pHiaLJrmixlotH2V8X7VcnTlqHK7W3JcgnPjbx9mZ/wQ5cuvn+Y571NF1enL23SG2evT56X+TjLdud4rsr4E+MUKTVnX4fQBj7iqPvw0wGqo8/gbAAFALMKqqPD4A618df1h+4y8+AMefwvzlx9905+/+028NYAAo//bDDlC9/AAqADafu+le/g6AAOgGiA7QAQA8A6ABALQKBIBNUAcAMAEAPAwLoBMgALoBAgAAAK8EAagVKH8ncPdbkQDK8wNIe34A7fnnFIgOAGAEABQDBAAAAMUC0/yDLAAAAAC4Ir8O0AEAAHQDRAcAMAIAjEBxAwQAAAAAAAAAAAAAAAAAAAAA0BYfQNonAAAAAHZBHQAAAAAAAAAAAOAgCABAH0AAAABgEwQAAAAAAAAAOAgBAAAAAAAAAAAAAAAAAAAAADoAAgAAAAAAAAAAAAAAAAAAAAAAAKANIDpABwAAAMAuqAMAGAEAAAAAAAAAAABHQR0AAAAAAgAAAAAAwFlYB+gAAADmBwgAAOUAAQAAAIDm/HMItAMEAAAAzQIB0A2QlAukXCAAygXaAZJqgQRAs0DOrPb8ac//PIEAKBdIuUDSLZA0C+SDVR7/CQJJt0CqAXJRVYe/HUH+qcrj30EgN6if+6jLvcqx/HTmfw8t6mb1BbN1QbqSUk8QAAAAAElFTkSuQmCC", "label": "metal lid of red jar"},
{"box_2d": [267, 617, 368, 706], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACIklEQVR42u3dDWrCUBAA4bn/pbe2FIQibcUkvmRm8AL7sc9E8Qfq1twe3wmHf9yXinj+nxpygCsjzJPpAS4HMS+lB7gAxWyYHuCUFLNjeoBTUMzx6QHWEpmFSuAtGrN0B7won3Nkn/++EZuDzLnTA7wOMteMf7+pOdcuAL3A33cTo+g3gzGlB3iEMMYSGD3AnWDaALuA/QhoARLoEATQhSCAAAIIIIAEAggggAAC6DoYQABqAL1AR6AjEEAAAagBCKAj0AIEEIB3fqNAAAEEEEAAAfSxea1A35xoAwLoCATQGQgggAACCCCAAAIIIIAAEggggQACCCCAAALoRsA7Py1AGxBAAAEEIAZQEfQjswEE0HNAAB2BNsArEEAAAQQQQAABeAkCCCCAANQCAQQQQADuPyEMwA1AAHKBAPT/xtsKBOAWIIAAAlALBGAXCCAAuUAAAcgFsAsEEIBcIIAA5AIB2AWwCwRgFwjALhCAXQC7QAAByAWwCwQgF4AE3AIgFwjALhCAXSAAuwB2AewCdgCQCwRgF8AuEIBdALtAAMgJ7PO3AdgFsAtgFwjALoBdwA4AcgHsAtgFsAtgF8AugF0Au4AdAOQCICewz7/DBqAHQA+Aff4zEWAXADkBuAnYuwTQA6xNwEHpAZYlwC7AsdnnX48A1AS8MT3ACgSskB7gjQwslh7gaAIWTT7+URScJvn422twlT5HeWbu60xeVfv0AauqMNlxozMQAAAAAElFTkSuQmCC", "label": "small metal bowl"},
{"box_2d": [147, 391, 362, 584], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACVUlEQVR42u3aQU7DMAAEwP3/p40QB1oJKC04cbyz9yreqe0kihMREXmQ8ZHy+q0A4ya9zQsBxtcBsHHjz+I/AGxpcNduPE7JZL8wwVNDHK9l+X90NsCqBHdreLLAmgjPj238MctuZ0cBLIPw6qjG/2WNlX8iwEEMN1f5j5GMWTn/qeVkgFcNvv/1nOuP2cmZf8QSAg9HNPd6DyVGVd77dgt8NRdHO8EgUC9gFdgKUz8DSg3qnwnqnw3rH5AXek1aEKCBYLGX5fX67y6QdBMEQLfAYZ/PLl5/U4DZhyh2qr+hwCEHaTZqvxHAkUepdqq/BcDxp+k2an9xgLPOU+5U/7IA5x6p3an+JQFSDpBygADoBgiAboGYAeUAsQQAAAAAoBcgAAAAAAAAAAAABAAAqAQIAAAAAAAAAAAAAAAAAAAAUAcQAAAAAADQCxAAAAAAAOAuaAYAAGAPMAMAAABgF9QfAAAAAAAAAODrOAAAAAAAAADAgxAAAPYAAAAAAAAAAACALfsDAAAAAAAAAAAAAADAuxAAAAB8GQIAAAAAdwEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADAdwEAAADYBQFYAgC2AkgtQKamoOK6ADm1+ZkAWSjV5Y8GyIrpbn+YQBZOef3pAFk/1eXnAeQ6Ka8/AyDpBkg5QMoBkm6BAOgGSDlAUi4AAACAaoCkXABAuwAAAOUCaRcAQKBdIO0CaRdIuwCAlBMk5QRJOUF2ST3AqwRJN0E2Sz1A2vs/i5B0E2Tn1AP8RiFdae9/i5DW/rcQS43nDXxdMcXw6Pv5AAAAAElFTkSuQmCC", "label": "large metal pot"},
{"box_2d": [190, 252, 340, 381], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACkklEQVR42u3dQU7DQBBE0br/pZtFJJCIMLE9jtP9f2/YJVWP8WQEiiZxtqdSv4ZSfGPQ5bsJHAhbL0yn3+UF9dsAHAk7EKAuACg8QPWpvy9rzQE4kLV2zECAqqkAdUH/6tS/iP13AlRNXgD/x63pAIXeAV6KPAygBNibeRbA/tCz9sAjoScBHElddIAQ+m/lnrMAjqYmACzcBJr2BwAczj1kDziRewLAqeADAIoNcDZ45/5LkvcFWJS87SOwLPieV6mW5Sc9AnV0Fr1mz/IrT0JN6y99Bt6MUKert/0YqKWz9H0a9v879PI95VMKXwtwAUI/gH0WtXhT+yyAzTf8PkbeO3etvI8ZAegAJQC8vwB4gBJAAHZ/AQSgA5QA8P4C4AFKAHj/JwEB6AICCMAGKAHgAgIIABcQQAAB0AAlAFxAAAEEYAsIIIAAAggABqhSQAAByAAlgAJsgBJAAAEEEEAAroAAAgjgLugKEMBHwCUggAACCCCAAAIIIIAAAggggAACKMABcAUIAO8vgAACCCCAn4KuAAGw/TOhw0+Tx8+fVtlq2w7g+aqOU3eYfNRXRrpe3tW5+P0CPe8sHFb+FoCeF7YOLf9ega43Vs9uf+MFZxSAsAESNkDgAIEDBA4QOEDCFggcIIELCCAAGyCBCwggAFxAAAHYABFAAAU8CAiggAACeBZ0CQiggAD+ZdQl4BIQQAAFkAD+j1QANwEBPAwKIACYQAAB4AICCAAXCF0gLgGXgAACoAUEoAsIQBcIXUAAukDoAgLQBQSgC4QuIABdIIETCEAXCF0gdAEB6AKhCwhAFwhdQAC6QOACCVxAgMAJBKALhC6QcUPvv08ggQskbILMHTzAKwYhDr3/A6FR1i/dflNrAPrT7QAAAABJRU5ErkJggg==", "label": "small metal container"},
{"box_2d": [203, 552, 278, 896], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAAC2klEQVR42u3dy24jMRBDUf7/TzMZDBAjmMTTL0ml4uUmq4aKJ+puwV5YIjfi7PavhNcPNXA4gX9Iev8oBL8LAOkACQx2NoGPJh6gq4KdTeDziQfoZeB0ATubwHcSD+B4gAYGdjaBH0k8wMYEdjMCLwP4JHA4QImNsBygws3gtf3XE6wHWGvgEgALCU4s7eFJOQRsSmCHE3halpWrAuB1/1n9Xd1r+/8+xNStvRRg3k44vLzXZOmTrUD/wQbeob8XvtleDyL3FDi8tt1S4PjiDhdwOIDdVeDgp5fhADUSL8AOYAeE7wAE2AEAABDdHwAAAAAAAAAAAAAAAGIBHA5gANLvAQAAAAAAAAAAgJMQALH9AYgHMAAAAAAAAPRHILY/zwC2QDyA0/v7YsGwHXD42pYAZ65ueA+cu9zNBE5f3gng2vVuQnD5cu+aWx06ADy2g9IFHL8FzBYAAAGeAumvQsUDxAtwD3APKB5A8QAIyAoHSNsFShdQuoDSBZQuoHQBpQv0++b8UQG2QDxAfwQBEE4AAAAAZAsoXQAAAAAAIFoAAAAAiAYQAAAAAAAAAAAQKwAAAAAAAAAAAOQKAAAAAAAAAAAAAAAQCiAAwgUAAAAAAKIFAAAAAAAAACAXQAAAAAAAAAQLAAAAAAAAsAXAoMV3Afi2vPIAxs2xB8DASbb4UaqRk2wM8MgwO/ws2dBh6gJMmqciwNSBCgJIE4dSOQBdzA7aI/tfHUuNBK5NVgxA97PzFtAjmbBk6f7n5lMhAWm+gAoB6NnsBqDHU/GlO7P/oSEnv3In9z8wYxUADcvIdXfo/37MyUeuNf3fzakaAhqfKsfuVf1/Hnb+sXsxwGvgP3/mH7pL9P+ae8Gpu1D9FWfuvv1vCaQDqEviARQPoPT+8z6Q7kQghQtI2QRSuEA6gLomHmD0NxJdCKRwAbVPPMB/CJSR9P6/Gygq6f3/NRAAhBBCCCFV8gGqVR5Ju5mW4gAAAABJRU5ErkJggg==", "label": "wooden rolling pin"},
{"box_2d": [57, 233, 220, 332], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAAB5UlEQVR42u3bO27DQBBEwb3/pdsGHDmTYEKW+GpiJlvoWZL7Oed37eyh+n7whrWnqz36exnsDxUf/h0ItrTALqn6+D+XYBdWffwAlgdYHmB5gAGoAywPMAB1gAGoAwxAHWAA6gCfRABAEwAg4FtABETATzEBS2NWRwnYI7FNJgIiQMBxAQDOzABIn5wjcOrjPyNQByAgAnkAAgcAAQIACMQBCBwABOoALtYC0AWmAQKaQAREQAREgAAAAgQIWCIlAICAQwMiIAIECDg+RsA5YgAEXCeQARkg4F6Vy4UEggIjAKBOACCfAQAE6gIACAAgEBcQAREgUBcQAQKagACAuMDqAhMBERABESBAAAABAAQI+BwCQMDSiAiIgAgQAGAWIEAAgO1iEfAqFAECAAh4FYqACBDQBCJAwG0KESCgCQgAIFADIKAJCAAgQMA/AQFNIALWSO2Zi4D7FAQIECgLjECdwCygBwjUBUbAi4BAXGAE6gKTgTzBCNQJRmBxg/wsYBrcCyoP8NYKe2HlBURgeYC3JCCwOsEWJwCwOMHqAqsLbAjqAnWAASDQJtgItAU2AmsjAKgLjAAAAgC8BxDEvwjzACcPQMCPsdWhH4L6btn/AIT3zE/5HN35mGqN9kGQo5S6sr4A3wNIY6KbybAAAAAASUVORK5CYII=", "label": "wooden utensils in small container"},
{"box_2d": [0, 369, 167, 586], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACu0lEQVR42u3cS5IjIRAE0bj/pbNtDjDdRX0gwT23WkA8BSxKpkpenqqQpv4z6PAMgfpr6PlPF6ADVLEFqhRgAxS9AnSAsgFWQAAByAIlABugBDhd4I/dnw1wIcDBANcSnApwOYEAJwKMZDgRYChEnScwFuI8gMEURIA6GqAEGMtRpwnQAYZzQAHqLkCdBlCHARS9AeNBBBCgJ8DN5YaT9LwE76/5PUAt6TEK4NGqo0mqH8CzhYeTtAN4uPRolHYNeLr6BIAayXF5t7m+nVcF3gaoKZOlW8h6gF+3sdS/Zk4/gJo+qU4AtWjwAM9+6tnp7H3yoP81gFo8eICJAgK0zP/gOe8pANME8ADVNP/SCgjQS2A6QBWpAunfgNgAOkAEoANEAAEEEAANECJAVgF0LABSQAABJu1qjzvgw33tUQAcwMTnNAIIsAlAKhyATASobQC+2lznHwYX5G8NMEOgBGicXwA8QBAAEWCugA2gAxT8BOwFEAHoADkeIAIIMFug6T8FBJgmIMBWV8DhDXj/1RKbNUCAFQC11xXw/kY3eRwoAAHgi3csCXAkwAeHoAoM0GUiAFsgAgiABogAcAEB4AARAC4gAF0gcIEIIAAa4Jt3VwsgwOEAEUAAAdgAoeePDRDAMyCAAAIIIAAXIAJ4BgQQwEvACggggJeAFbACAggABYgAdIAIQAeIAHSACCAAHCD0/AIoYAPoABFAAAHYABGADhAB4Pm3FhCALkAHiABwAQEEgAsIAAdI4AICCAAXEAAOkMAFBKAL0AESuIAAdAE6QASACwggAFsgAgjAFqADRICvAMLOv0sH6AARQAC2gABwgEQBAdACEQAuIABdgA6QwAVCFwhdgA6QaUPP31MgcwcPEDxAL4MsG3r+HgjpMPD4ywzScO5GGfrs3zLpO7e/v/5f7zs4K9b9AZb2XPn0LsNsAAAAAElFTkSuQmCC", "label": "wooden utensils in large pot"},
{"box_2d": [508, 798, 691, 960], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACmElEQVR42u3d2U7bAAAF0fv/Pz1VH1rRVoWwxcvMfUOA5Dmxwhbs7eLj9yYcf02ebxOA+wu8lgQ3F3gjCu4s8HYVdwZ4oAtuLPBAGNcA4Cvz/yi7CsD7D4uH0rgEwGNH9p6yXx/HVQH+PbiX7+GRPa502qeyFwfIB3aZ/n32cfzv594B4Fsnz6fHv34zQP0B1C8GEH/tP76fE8xdfyAAqAE40ez9RxBwusnzn2vAeWfvfwYBZ5+9/5sJwC0AbgFwC4BcALkAcgDcAKAGADcAqAXADYAbQJ5/2f6vEgC3AKgFuPz0ANj7P0UAbgECuI0A7v6PAXCv6QGw9wdg73+vgB0A3ALIAZADgFoA3ADgFiAAtQAEoAZADgByATsAcgCQC9gBkAOAWwA5AAG4AcAtQAABBBCAWYBOgU6BABLoaaBzIIAE1AIBJBCAXCCAANwABOAWsPfbAXAuAHn/4zf4uTsAAQSgBiAANwABBBBAAAEEoAUAuYAeYAEEEEAAfSMQgLi/X4kFoAfoL0MLwA4we//s/b1EplfJ9VrhAPqvmf6HePb+APQXkwlgAcj7u6RYF9Wz9+svK6kQWABdXNkM0AXmA+gWC91kI4BuMxOAEqBbTXWvsQAC6H6DXoAF4BZYAAF04+UAAvAKLAA3wCYXCEAOsLkFJgdYAG6ATS5gB1gAboEFEIBaYHMLLIAA1ACbXMAOsMkFApADbHIBO8ACcAtMDrDJBSYX2NwCm1tgkwtsboLJATa3wOYW2NwCm1zADrC5BTa5wOQCm1zADrC5BTa3wOYW2FNn7z8dwQ6YvT+AnwZz95/kRNjBk+cfTbCTzF1/kMFON3n+UwV23qnjvx9hF5q5/asNdu19oPfyza/94PTy7bVj9wPduj0yKMKyJQAAAABJRU5ErkJggg==", "label": "metal lid of sprinkle jar"},
{"box_2d": [260, 699, 431, 810], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACNElEQVR42u3a3UrDQBgG4bn/m/78wQMRFCtJ3d2ZFwRP52mSrq3Q2lKbmfef19+OC7ti69ddGPsjxBz0qm54ScxSM7c/WWI2mjr+eoPZder4axDmiMnz/2wwR81d/+mY7s1/5DqYg2fv/xXBHD97/88EI5m9/1uBGTfBjFwggADkAgEEIBcIoINQAD0DA+ggGEC3QAAdgwIQAkwAAQRgBpiRC3QBBBBAAD0CAwjAKhBA7wIBBBBAAH0eEoC3P4C+GOvL4QDsAP2PVADy/q6AAAIIIIBOgv05GEAAUoAJIIAAAggggAACCKCTcAAB9AgIoFsggPoDCCAAT38AAQTg7g8gAHl/APb+AAIIIIAAAgjA2x9AAAEE0JtAAAH0CAggAGd/AAEE0DMwgAACCCCAAAIIoINgAAEEEEAAAQQQQAfBALoDAggggAAE/QEEEEAAAQQQQAABBBCAs58ugC6AAAIIwNuPvT8AeX4AyPPPByAANwBuANAC8Lv6UwF4YPL8UwA+Ot56eHSql/ssAK6ZPH9XAHADoAbg2snzdwMANQCoAcANgBqA+2bvR56/AQB3b/GPtvAC8KTJ8xcFADcAcgBQC4AaANQAoAbgn2bvx96PPH8NAQJwC4BbADkAyAWQCyAHALcAuAVwA7DU9ADY+58uAG4Blpy9H3v/kwhYe/b+2wXYYPb+OwnALcBGk+ffQgBuAracvf86AjaeOv4KBE6ZHuCvChw3e/8DDGj2NVyUvt9eANXI0RxL6dvMAAAAAElFTkSuQmCC", "label": "metal lid of milk jar"},
{"box_2d": [706, 654, 900, 1000], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACFUlEQVR42u3dQW7cQAxFwX//SzNAVoaTjG3Ym+gV19p0gezRSE1qE3/GvY/26mMG9++ILz9BcB9GfPnPFrhPRn39DyW4L0Ue4HkEd22Bu7bAXVzg4gJ3cYGLC9zFBS4ucADiAgAuTlAHuB+IPMDlAS4PcHkAAv8vAYC6AIC6AIC6wNUFANQFri5wdQEABADEBdSAFDj3AgQIACBAAIDbIQKKQApIASkgBQh4S2AXkAE2AS9L1YAzI86MSAEZIAMAEADgADEBAPoois1EBDSU6SkkQIDAzUYIgICNEAEBAgTyAgbuGLr0QwYEJgsQbHUDAlveAMGWNxiDLY+AYKsbLE+wPMHyBMsTEBgCAgQQECBAAAEB/w0kAQFlQICAR2WelxJQBgQIICDg14AAAQIE/BhIAgLOVBEg8EgCAgQIECBAwEl7AgQIaDoioPmOAAECBBDoxCZAgICpFAQI2AgIKAMCBIzqIkCAAAFzG03vJEDg25OMCagCZaAK5IAkIECAAAECPnFCoCIwAosLTA68QogItKtgr78c/fxboo8uSQi8vO75VfD7ogcDfPt+oCEw/w09KfW+wAECZyo1WwHQfG0MxaMBjKczsdcIfx8zWR5geYDlAZYHWB5geYDlAb5AsLrA6gKrC6wusLrA6gJbnABAXWB1ga1NsLUFtrbAOlFf/18FtrjA4gLrRR7gLcGqkQd4H78AhWCAr9qFpz4AAAAASUVORK5CYII=", "label": "wooden spoon with frosting"},
{"box_2d": [680, 756, 812, 1000], "mask": "data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAQAAAAEACAAAAAB5Gfe6AAACSElEQVR42u3dAWrjQBAF0X//S1fIEkgwhN0Njiypqg5g6EcjizGMt6qqqqqqqqqqqqqqqqp6CLDPbyYANwCf2edXCvCQfX6bAHIAUAuAGwDcAsgBwCzAX7LPj33+mwsgBwC3AAG4BQjADQBuAZAL2AGQA4BbgP/PPj/2+W8kwE+zz38TAXALgFsA3ALgFuAp2QHaAD0AAdgB0ANgnz8APQD2+QMgAPn8AegBCCAAOQD2+S8ngF0AuwB2gQDsAgEEIBcIIIAAegq2AmqCAAKQC9AKBBCAHODPNWnmp+D7EsgBRqfDCQQQQA+BADoaCiCBTgfFAiSQQGcjHZB2OuQmoCUAO0ECegHsArQCdgISsAuAnUAvQAJ2AewCJKAn0AuAnQA7AQnICbADHCmAHuCcBHoBkBNgFwA5AQnYBUBOAHIC7AIgJwgAOQG4CUAugF2AM5QAeoAXEmAXADkBRGAXwL4EnK0ESOBoA70A2AnAboCegPMSIAc4ag3QExBBBhCBnIALFMEvG3x+vJWAiyxEL8nHGZBBBL0nn/8LMoHekSI4iOARYx9Xd2tfkfqRtV/aT/mE2MvTM7QJLcE5EPQPg01OsMkJdpb0AK8CmV3gxABd5NN1Vt3pdgDCrpEeoL//+DWGTU4QQABygl0yPcATDXbh9ABPQdjkBJucYJMTbHKCTU6wO6UHmB7gBwqbnGB2gd01PcD0AP9IsMkJNjnBJifY5ASzC2xygmnSA3xDsMkJNjnBJieYXWDK9ABfCGYXmDg9wPQA7wRvEjyakklXaIEAAAAASUVORK5CYII=", "label": "wooden spoon"}
]
```
The model predicts a JSON list, where each item represents a segmentation mask. Each item has a bounding box ("box_2d") in the format [y0, x0, y1, x1] with normalized coordinates between 0 and 1000, a label ("label") that identifies the object, and lastly the segmentation mask inside the bounding box, as base64 encoded png.
To use the mask, first you need to do base64 decoding, and then loading this string as a png. This will give you a probability map with values between 0 and 255. The mask needs to be resized to match the bounding box dimensions, then you can apply your confidence threshold, e.g. binarizing at 127 for the midpoint. Finally, pad the mask into an array of the size of the full image.
All these steps are done by the the parse_segmentation_masks function provided earlier.
Ultimately, use the plot_segmentation_masks function to visualize the decoded masks by overlaying it on the image.
Output hidden; open in https://colab.research.google.com to view.
Preliminary capabilities: pointing and 3D boxes
Pointing and 3D bounding boxes are experimental model capabilities. Check this other notebook to get a sneak peek on those upcoming capabilities.
What's next?
For a more end-to-end example, the code from the AI Studio Spatial understanding example is available on Github.
You'll also find multiple other examples of Gemini apabilities in the quickstart folder, in particular the Live API and the video understanding one.
Related to image recognition and reasoning, Market a jet backpack and Guess the shape examples are worth checking to continue your Gemini API discovery (Note: these examples still use the old SDK). And of course the pointing and 3d boxes example referenced earlier.
