Generative models have recently shown the ability to realistically generate data and model the distribution accurately. However, joint modeling of an image with the attribute that it is labeled with requires learning a cross modal correspondence between image and attribute data. Though the information present in a set of images and its attributes possesses completely different statistical properties altogether, there exists an inherent correspondence that is challenging to capture. Various models have aimed at capturing this correspondence either through joint modeling of a variational autoencoder or through separate encoder networks that are then concatenated. We present an alternative by proposing a bridged variational autoencoder that allows for learning cross-modal correspondence by incorporating cross-modal hallucination losses in the latent space. In comparison to the existing methods, we have found that by using a bridge connection in latent space we not only obtain better generation results, but also obtain highly parameter-efficient model which provide 40% reduction in training parameters for bimodal dataset and nearly 70% reduction for trimodal dataset. We validate the proposed method through comparison with state of the art methods and benchmarking on standard datasets.