**Views:**67755|**Rating:4.97| View Time:15:5Minutes|Likes:2869|Dislikes:20**

In this episode, we dive into Variational Autoencoders, a class of neural networks that can learn to compress data completely unsupervised!

VAE’s are a very hot topic right now in unsupervised modelling of latent variables and provide a unique solution to the curse of dimensionality.

This video starts with a quick intro into normal autoencoders and then goes into VAE’s and disentangled beta-VAE’s.

I aslo touch upon related topics like learning causal, latent representations, image segmentation and the reparameterization trick!

Get ready for a pretty technical episode!

Paper references:

– Disentangled VAE’s (DeepMind 2016):

– Applying disentangled VAE’s to RL: DARLA (DeepMind 2017):

– Original VAE paper (2013):

If you enjoy my videos, all support is super welcome!

Variational Autoencoders starts at 5:40

in the code shown, can someone explain why the standard deviation is 1e-6 + softplus(parameters).

why softplus?

Excellent video!! Probably the best VAE video I saw. Thanks a lot ๐

Informative. Thank you

Sublime text editor is so aesthetic. Anyway, yes, great point, the input dimensionality needs to be reduced. Even the original Atari DeepMind breakthrough relied on a smaller (handcrafted) representation of the pixel data. With the disentangled variational autoencoder it may be feasible or even an improvement to deal with the full input.

Awesome explanation! Thanks!

The shown loss is not a loss, but minus loss – we are maximising the likelihood expectation according to the paper cited, not minimising. Be careful ๐

6:35 KL Divergence purpose is explained in a graphical form here: https://youtu.be/3-UDwk1U77s?t=321

Amazing explanation to a complicated topic! Thank you so much!!!!

Great content, Congrats, you've won a new subscriber ๐

Thank you!

In the Original VEA paper they take a covariance matrix instead of a variance vector.

also in the beta-VEA paper by Irina Higgins.

I don't know if the disentangling with beta >1 even works with just a variance vector. So I think that's an important Detail.

Nice and clear, good job!

Very useful. Great content. Continue what you're doing. Great job.

I discovered your channel today and I'm hooked! Excellent work. Thank you so much for your hard work

Fantastic explanation! Thanks a lot!

Brilliant video, you explain it way better then the most litterature on the subject! It is guys like you who make the world (and me) smarter! thanks ๐

so a mean and a std is taken at all the outputs [the outputs resulting from each of the training samples] for each output neurons of the encoder?

so if there are 10 outputs of the encoder and 100 training samples then a mean and a std calulated for each of the 10 outputs using the 100 values that resulted at each output from the 100 samples?

I would like to see more videos from you. Clear explanation of concept and gentle presentation of math. Great job!

Really liked it. Firstly giving an intuition of the concept, its application and then to the objective function while explaining its individual terms, in a way everyone can understand, it was simply professional and elegant. Nice work and thanks!

I'm clear on the purpose of Autoencoders viz. compression of high dimensional space to a lower latent space. Still unclear on why VAE? Whats the advantage of learning mean and variance of a distribution?

Hi, thanks for this nice tutorial. In fact, I have a question on VAE. Hope you have time to answer it. When we use the Decoder network after finishing training, we use N(0,1), not N(m, s), where m and s came from Encoder while training. So, I think we use different distribution for input of Decoder from training time. Can you explain this more clearly? I read some article on this but I cannot get it. What I guess is as follows: when we training we use the KL cost for z ~ (0,1), so when it is trained well enough, the mean and standard deviation should be 0, 1. So, with the assumption, we can use z ~ (0, 1), with m = 0, s = 1. Did I understand roughly correct? Thanks for reading, and more thanks if you can teach me on this.

why do we need a bottlenneck in the case of a denoising autoencoder or neural inpainting? Here we dont use the compressed representation for anything? In comparsion to dimension reduction or visualization where we have a clear usage for them

I hope I don't offend anyone by asking why on earth this guy magnified his head?

Very Great video. Thank you so much for the explanation. I am trying to implement disentangled variational autoencoders to regenerate grid layouts. So I started by implementing variational autoencoders first but I couldn't know from programming perspective what is the beta we are talking about ? is it in the one in the sampling function ? Thank you in advance.

Amazing! Thank God! I was researching on Recurrent Variational AE and badly wanted to understand Var AE, Thank You!! I understood a lot! Please Please, Please make more videos!!!!

Hi – what a great video. Where can I find the rest of the code that you show ?

We needed a serious and technical channel about latest findings in DL. That siraj crap is useless. Keep going! Awesome

Considering this in combination with the Cap-Net, I'm starting to feel that GANs are, perhaps, overrated.

thanks mate, finally i find a tutorial in human language!

Native language dutch by any chance?

Great thanks for the video and explanation, very informative.

https://www.youtube.com/watch?v=vhtAAMw7ons&list=PLJgl1GI_pBenD0G4wJQ6xxvY7ZhNc2Zh8

Fantastic video, Xander! Additionally, if anyone is looking for implementational resources, Francois Chollet wrote a blog post a couple years back detailing a step-by-step VAE implementation in Keras. Here it is(you have to scroll to near the end of the post to find it): https://blog.keras.io/building-autoencoders-in-keras.html

Absolutely great stuff Arxiv Insights! Subscribed to your videos for life ๐

Very Helpfully Arxiv! keep the good Quality videos coming

Nice video. But about the DAE part, larger beta lead to more focus on learning a distribution close to the prior. So what's your prior? I am assuming the prior is zero-mean-unit Gaussian. So what is the prior for VAE? If it is diagonal Gaussian, then the benefit from DAE is simply a scaling-factor.

What a fantastic channel. Insta-sub.

I have to replay a few parts twice from the information density, but I guess that's a good thing!

this is so good..

Question: in a Stanford video lecture Justin Johnson said (about two years ago) that Variational Autoencoders are a great idea, but they do not work that well in practice. Is this still true, or VA became more useful over time?..

very cool and amazing!

Great presentation. Very informative. I'd love to see one on GQN.

awesome explanation, thanks. I still don't really get why it's useful to store things as a distribution rather than point estimates though?

You should make more videos, it's awesome!