Dense captioning methods generally detect events in videos first and then generate captions for the individual events. Events are localized solely based on the visual cues while ignoring the associated linguistic information and context. Whereas end-to-end learning may implicitly take guidance from language, these methods still fall short of the power of explicit modeling. In this paper, we propose a Visual-Semantic Embedding (ViSE) Framework that models the word(s)-context distributional properties over the entire semantic space and computes weights for all the n-grams such that higher weights are assigned to the more informative n-grams. The weights are accounted for in learning distributed representations of all the captions to construct a semantic space. To perform the contextualization of visual information and the constructed semantic space in a supervised manner, we design Visual-Semantic Joint Modeling Network (VSJM-Net). The learned ViSE embeddings are then temporally encoded with a Hierarchical Descriptor Transformer (HDT). For caption generation, we exploit a transformer architecture to decode the input embeddings into natural language descriptions. Experiments on the large-scale ActivityNet Captions dataset and YouCook-II dataset demonstrate the efficacy of our method. © 2021 IEEE.