OpenAI, the artificial intelligence startup founded by Elon Musk behind the popular text-to-image generator DALL-E, announced tuesday the release of its new POINT-E imaging machine, which can produce 3D point clouds directly from text prompts. While existing systems like Google’s DreamFusion typically require several hours – and GPUs – to generate their images, Point-E only needs a GPU and a minute or two.
3D modeling is used in a variety of industries and applications. CGI effects of modern blockbuster movies, video games, virtual reality and augmented reality, NASA lunar crater mapping missionsGoogle heritage site preservation projectsand Meta’s vision for the Metaverse all depend on 3D modeling capabilities. However, creating photorealistic 3D images is still a time and resource consuming process, despite NVIDIA’s work to automate object generation and epic games RealityCapture mobile appwhich allows anyone with an iOS phone to scan real-world objects as 3D images.
Image synthesis systems such as DALL-E 2 and Craiyon from OpenAI, DeepAI, Lensa from Prisma Lab or Stable Diffusion from HuggingFace have rapidly gained popularity, notoriety and infamy in recent years. Text-to-3D is an offshoot of this research. Point-E, unlike similar systems, “leads a large corpus of (text, image) pairs, allowing it to follow diverse and complex prompts, while our image-3D model is trained on a smaller dataset of (image, 3D) pairs,” the OpenAI research team led by Alex Nichol wrote in Point E: A system for generating 3D point clouds from complex prompts, published last week. “To produce a 3D object from a text prompt, we first sample an image using the text-image model and then sample a 3D object conditioned on the sampled image. Both of these steps can be performed in seconds, and do not require costly optimization procedures.”
If you were to enter a text prompt, for example “A cat eating a burrito”, Point-E will first generate a 3D render of a synthetic view of said burrito-eating cat. It will then run this generated image through a series of diffusion models to create the 3D RGB point cloud of the initial image – first producing a coarse 1024 point cloud model and then a finer 4096 point cloud model. points. “In practice, we assume that the image contains the relevant information from the text and do not explicitly condition the scatter plots on the text,” the research team points out.
These broadcast models have each been trained on “millions” of 3D models, all converted into a standardized format. “Although our method performs worse on this assessment than state-of-the-art techniques,” the team concedes, “it produces samples in a small fraction of the time.” If you want to try it out for yourself, OpenAI has released the project’s open source code at GithubGenericName.
All products recommended by Engadget are selected by our editorial team, independent of our parent company. Some of our stories include affiliate links. If you purchase something through one of these links, we may earn an affiliate commission. All prices correct at time of publication.