Show HN: Infinity тАУ Realistic AI characters that can speak
468 by lcolucci | 292 comments on Hacker News.
Hey HN, this is Lina, Andrew, and Sidney from Infinity AI ( https://infinity.ai/ ). We've trained our own foundation video model focused on people. As far as we know, this is the first time someone has trained a video diffusion transformer thatтАЩs driven by audio input. This is cool because it allows for expressive, realistic-looking characters that actually speak. HereтАЩs a blog with a bunch of examples: https://ift.tt/TSZcdkI If you want to try it out, you can either (1) go to https://ift.tt/Si5wh9s , or (2) post a comment in this thread describing a character and weтАЩll generate a video for you and reply with a link. For example: тАЬMona Lisa saying тАШwhat the heck are you smiling at?тАЩтАЭ: https://bit.ly/3z8l1TM тАЬA 3D pixar-style gnome with a pointy red hat reciting the Declaration of IndependenceтАЭ: https://bit.ly/3XzpTdS тАЬElon Musk singing Fly Me To The Moon by SinatraтАЭ: https://bit.ly/47jyC7C Our tool at Infinity allows creators to type out a script with what they want their characters to say (and eventually, what they want their characters to do) and get a video out. WeтАЩve trained for about 11 GPU years (~$500k) so far and our model recently started getting good results, so we wanted to share it here. We are still actively training. We had trouble creating videos of good characters with existing AI tools. Generative AI video models (like Runway and Luma) donтАЩt allow characters to speak. And talking avatar companies (like HeyGen and Synthesia) just do lip syncing on top of the previously recorded videos. This means you often get facial expressions and gestures that donтАЩt make sense with the audio, resulting in the тАЬuncannyтАЭ look you canтАЩt quite put your finger on. See blog. When we started Infinity, our V1 model took the lip syncing approach. In addition to mismatched gestures, this method had many limitations, including a finite library of actors (we had to fine-tune a model for each one with existing video footage) and an inability to animate imaginary characters. To address these limitations in V2, we decided to train an end-to-end video diffusion transformer model that takes in a single image, audio, and other conditioning signals and outputs video. We believe this end-to-end approach is the best way to capture the full complexity and nuances of human motion and emotion. One drawback of our approach is that the model is slow despite using rectified flow (2-4x speed up) and a 3D VAE embedding layer (2-5x speed up). Here are a few things the model does surprisingly well on: (1) it can handle multiple languages, (2) it has learned some physics (e.g. it generates earrings that dangle properly and infers a matching pair on the other ear), (3) it can animate diverse types of images (paintings, sculptures, etc) despite not being trained on those, and (4) it can handle singing. See blog. Here are some failure modes of the model: (1) it cannot handle animals (only humanoid images), (2) it often inserts hands into the frame (very annoying and distracting), (3) itтАЩs not robust on cartoons, and (4) it can distort peopleтАЩs identities (noticeable on well-known figures). See blog. Try the model here: https://ift.tt/Si5wh9s WeтАЩd love to hear what you think!
New best story on Hacker News: WikipediaтАЩs nonprofit status questioned by D.C. U.S. attorney
WikipediaтАЩs nonprofit status questioned by D.C. U.S. attorney 840 by coloneltcb | 777 comments on Hacker News.
-
рд╡рд┐рджреНрдпрд╛рд░реНрдереНрдпрд╛рдВрд╡рд░реАрд▓ рдЕрдорд╛рдиреБрд╖ рдЕрддреНрдпрд╛рдЪрд╛рд░ тАУ рдореБрдЦреНрдпрд╛рдзреНрдпрд╛рдкрдХ рд╡ рдЕрдзреАрдХреНрд╖рдХрд╛рд╡рд░ рдЧреБрдиреНрд╣рд╛ рджрд╛рдЦрд▓ рдХрд░реВрди рддрд╛рддреНрдХрд╛рд│ рдХрд╛рд░рд╡рд╛рдИ рдХрд░рд╛. рдЖрджрд┐рд╡рд╛рд╕реА рдЯрд╛рдпрдЧрд░ рд╕реЗрдиреЗрдЪреЗ рдЪрдВрджреНрд░рдкреВрд░ рдЬрд┐рд▓реНрд╣рд╛ рдЙрдкрд╛рдз...
-
рдпреЗрдиреНрд╕рд╛ рдпреЗрдереАрд▓ рдЕрдкрдШрд╛рддрд╛рдд рдореВрддреНрдпреБрдВрдореБрдЦреА рдорд╣рд┐рд▓рд╛рдЪреНрдпрд╛ рд╡рд╛рд░рд╕рд╛рдирд╛ 5 рд▓рд╛рдЦрд╛рдЪреА рдЖрд░реНрдерд┐рдХ рдорджрдд рдХрд░рд╛ рддреБрд▓рд╕реА рдЕрд▓рд╛рдо рд╡рд░реЛрд░рд╛ рд╢рд╣рд░рд╛рддреАрд▓ рдмрд╛рд╡рдгреЗ рд▓реЗрдЖрдКрдЯ рд╡ рдХрд╛реЕрд▓рд░реА рд╡реЙрд░реНрдб...
-
рдкрд╛рд╡рдирд╛ (рд░реИ)рдпреЗрдереЗ рдЧреЛрдВрдбреА рдзрд░реНрдо рдкреНрд░рдмреЛрдзрди рдореЗрд│рд╛рд╡рд╛. рдЧреЛрдВрдбреА рдкрд╛рд░рдВрдкрд░рд┐рдХ рдиреВрддреНрдп рддрдерд╛ рдЧреЛрдВрдбреА рд░реЗрдХреЙрд░реНрдбрд┐рдВрдЧ рдбрд╛рдБрдиреНрд╕ рдиреВрддреНрдп рд╕реНрдкрд░реНрдзреЗрдЪреЗ рдЖрдпреЛрдЬрди. рднрджреНрд░рд╛рд╡рддреА(рджрд┐ .3...