Hexiang Hu is a Computer Science Ph.D. student in Viterbi School of Engineering at University of Southern California (USC), working with Prof. Fei Sha. Prior to this, He was a Ph.D. student in Henry Samueli School of Engineering and Applied Science at University of California, Los Angeles (UCLA). He earned his Bachelor’s degrees in Computer Science from Zhejiang University and Simon Fraser University with honor. He worked with Prof. Greg Mori during his undergrads. His research interests lie in the field of Machine Learning, Computer Vision and Natural Language Processing. [ Résumé ]
This paper presents an alternative evaluation task for visual-grounding systems: given a caption the system is asked to select the image that best matches the caption from a pair of semantically similar images. The system's accuracy on this Binary Image SelectiON (BISON) task is not only interpretable, but also measures the ability to relate fine-grained text content in the caption to visual content in the images.
In this paper, we consider the problem of learning to simultaneously transfer across both environments (ENV) and tasks (TASK), probably more importantly, by learning from only sparse (ENV, TASK) pairs out of all possible combinations. We propose a compositional neural network which depicts a meta rule for composing policies from the environment and task embeddings.
We propose a generic structured model that leverages diverse label relations to improve image classification performance. It employs a novel stacked label prediction neural network, capturing both inter-level and intra-level label semantics. The design of this framework naurally extends to leverage partial observations in the label space to inference the rest label space.
Visual data and text data are composed of information at multiple granularity. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities.
We show the design of the decoy answers has a significant impact on how and what the learning models learn from the datasets. In particular, the resulting learner can ignore the visual information, the question, or the both while still doing well on the task.
We propose a novel probabilistic model for visual question answering.
We present a novel segment proposal framework, namely FastMask, which takes advantage of the hierarchical structure in deep convolutional neural network to segment multi-scale objects in one shot. Through leveraging feature pyramid and sliding-window region attention, we made instance proposal not only fast but more accurate.