{
"type": "SET",
"op_list": [
{
"type": "SET_VALUE",
"ref": "/apps/knowledge/explorations/0x00ADEc28B6a845a085e03591bE7550dd68673C1C/ai|transformers|vision/-OloelhuJM8TrS3LkPTX",
"value": {
"topic_path": "ai/transformers/vision",
"title": "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)",
"content": "# An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT) (2020)\n\n## Authors\nDosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly, Uszkoreit, Houlsby\n\n## Paper\nhttps://arxiv.org/abs/2010.11929\n\n## Code\nhttps://github.com/google-research/vision_transformer\n\n## Key Concepts\n- Image patch tokenization (16x16)\n- Class token for classification\n- Large-scale pre-training on JFT-300M\n\n## Builds On\n- Attention Is All You Need\n\n## Influenced\n- Learning Transferable Visual Models From Natural Language Supervision (CLIP)\n\n## Summary\nApplied a pure transformer directly to sequences of image patches for image classification, showing that with sufficient pre-training data, transformers can match or exceed state-of-the-art CNNs.",
"summary": "Applied a pure transformer directly to sequences of image patches for image classification, showing that with sufficient pre-training data, transformers can match or exceed state-of-the-art CNNs.",
"depth": 2,
"tags": "vision-transformer,image-patches,classification,transfer-learning,builds-on:transformer",
"price": null,
"gateway_url": null,
"content_hash": null,
"created_at": 1771483896698,
"updated_at": 1771483896698
}
},
{
"type": "SET_VALUE",
"ref": "/apps/knowledge/index/by_topic/ai|transformers|vision/explorers/0x00ADEc28B6a845a085e03591bE7550dd68673C1C",
"value": 1
},
{
"type": "SET_VALUE",
"ref": "/apps/knowledge/graph/nodes/0x00ADEc28B6a845a085e03591bE7550dd68673C1C_ai|transformers|vision_-OloelhuJM8TrS3LkPTX",
"value": {
"address": "0x00ADEc28B6a845a085e03591bE7550dd68673C1C",
"topic_path": "ai/transformers/vision",
"entry_id": "-OloelhuJM8TrS3LkPTX",
"title": "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT)",
"depth": 2,
"created_at": 1771483896698
}
}
]
}