学术论文

SPFresh: Incremental In-Place Update for Billion-Scale Vector Search
Approximate Nearest Neighbor Search (ANNS) is now widely used in various applications including information retrieval, question answering, and recommendation. As the amount of vector data grows continuously, it becomes important to support updates to vector index, the enabling technique that allows for efficient and accurate ANNS on vectors. Because of the curse of high dimensionality, it is often costly to identify the right neighbors of a single new vector, a necessary process for index update. To amortize update costs, existing systems maintain a secondary index to accumulate updates, which are merged by the main index by global rebuilding the entire index periodically. However, this approach has high fluctuations of search latency and accuracy, not even to mention that it requires substantial resources and is extremely time-consuming for rebuilds. We introduce SPFresh, a system that supports in-place vector updates. At the heart of SPFresh is LIRE, a lightweight incremental rebalancing protocol to split vector partitions and reassign vectors in the nearby partitions to adapt to data distribution shift. LIRE achieves low-overhead vector updates by only reassigning vectors at the boundary between partitions, where in a high-quality vector index the amount of such vectors are deemed small. With LIRE, SPFresh provides superior query latency and accuracy to solutions based on global rebuild, with only 1% of DRAM and less than 10% cores needed at the peak compared to the state-of-the-art, in a billion scale disk-based vector index with a 1% of daily vector update rate. In this paper, we present Brainstorm, a deep learning framework for optimizing dynamic NNs, which bridges the gap by unifying how dynamism should be expressed. Brainstorm proposes (1) Cell, the key data abstraction that lets model developers express the data granularity where dynamism exists, and (2) Router, a unified interface to let model developers express how Cells should be dynamically dispatched. Brainstorm handles efficient execution of routing actions. This design allows Brainstorm to collect profiles of fine-grained dataflow at the correct granularity. The traceability further opens up a new space of dynamic optimization for dynamic NNs to specialize their execution to the runtime dynamism distribution. Extensive evaluations show Brainstorm brings up to 11.7× speedup (3.29× on average) or leads to 42% less memory consumption for popular dynamic neural networks with the proposed dynamic optimizations.
PIT: Optimization of Dynamic Sparse Deep Learning Models via Permutation Invariant Transformation
Dynamic sparsity, where the sparsity patterns are unknown until runtime, poses a significant challenge to deep learning. The state-of-the-art sparsity-aware deep learning solutions are restricted to pre-defined, static sparsity patterns due to significant overheads associated with preprocessing. Efficient execution of dynamic sparse computation often faces the misalignment between the GPU-friendly tile configuration for efficient execution and the sparsity-aware tile shape that minimizes coverage wastes (non-zero values in tensor). In this paper, we propose PIT, a deep-learning compiler for dynamic sparsity. PIT proposes a novel tiling mechanism that leverages Permutation Invariant Transformation (PIT), a mathematically proven property, to transform multiple sparsely located micro-tiles into a GPU-efficient dense tile without changing the computation results, thus achieving both high GPU utilization and low coverage waste. Given a model, PIT first finds feasible PIT rules for all its operators and generates efficient GPU kernels accordingly. At runtime, with the SRead and SWrite primitives, PIT rules can be executed extremely fast to support dynamic sparsity in an online manner. Extensive evaluation on diverse models shows that PIT can accelerate dynamic sparsity computation by up to 5.9x over state-of-the-art compilers.
End-to-End Word-Level Pronunciation Assessment with MASK Pre-training
Pronunciation assessment is a major challenge in the computer-aided pronunciation training system, especially at the word (phoneme)-level. To obtain word (phoneme)-level scores, current methods usually rely on aligning components to obtain acoustic features of each word (phoneme), which limits the performance of assessment to the accuracy of alignments. Therefore, to address this problem, we propose a simple yet effective method, namely \underline{M}asked pre-training for \underline{P}ronunciation \underline{A}ssessment (MPA). Specifically, by incorporating a mask-predict strategy, our MPA supports end-to-end training without leveraging any aligning components and can solve misalignment issues to a large extent during prediction. Furthermore, we design two evaluation strategies to enable our model to conduct assessments in both unsupervised and supervised settings. Experimental results on SpeechOcean762 dataset demonstrate that MPA could achieve better performance than previous methods, without any explicit alignment. In spite of this, MPA still has some limitations, such as requiring more inference time and reference text. They expect to be addressed in future work.
Accurate and Structured Pruning for Efficient Automatic Speech Recognition
Automatic Speech Recognition (ASR) has seen remarkable advancements with deep neural networks, such as Transformer and Conformer. However, these models typically have large model sizes and high inference costs, posing a challenge to deploy on resource-limited devices. In this paper, we propose a novel compression strategy that leverages structured pruning and knowledge distillation to reduce the model size and inference cost of the Conformer model while preserving high recognition performance. Our approach utilizes a set of binary masks to indicate whether to retain or prune each Conformer module, and employs L0 regularization to learn the optimal mask values. To further enhance pruning performance, we use a layerwise distillation strategy to transfer knowledge from unpruned to pruned models. Our method outperforms all pruning baselines on the widely used LibriSpeech benchmark, achieving a 50% reduction in model size and a 28% reduction in inference cost with minimal performance loss.
DragNUWA: Fine-grained Control in Video Generation by Integrating Text, Image, and Trajectory
Controllable video generation has gained significant attention in recent years. However, two main limitations persist: Firstly, most existing works focus on either text, image, or trajectory-based control, leading to an inability to achieve fine-grained control in videos. Secondly, trajectory control research is still in its early stages, with most experiments being conducted on simple datasets like Human3.6M. This constraint limits the models’ capability to process open-domain images and effectively handle complex curved trajectories. In this paper, we propose DragNUWA, an open-domain diffusion-based video generation model. To tackle the issue of insufficient control granularity in existing works, we simultaneously introduce text, image, and trajectory information to provide fine-grained control over video content from semantic, spatial, and temporal perspectives. To resolve the problem of limited open-domain trajectory control in current research, we propose trajectory modeling with three aspects: a Trajectory Sampler (TS) to enable open-domain control of arbitrary trajectories, a Multiscale Fusion (MF) to control trajectories in different granularities, and an Adaptive Training (AT) strategy to generate consistent videos following trajectories. Our experiments validate the effectiveness of DragNUWA, demonstrating its superior performance in fine-grained control in video generation.
Generative Retrieval for Conversational Question Answering
Effective passage retrieval is crucial for conversation question answering (QA) but challenging due to the ambiguity of questions. Current methods rely on the dual-encoder architecture to embed contextualized vectors of questions in conversations. However, this architecture is limited in the embedding bottleneck and the dot-product operation. To alleviate these limitations, we propose generative retrieval for conversational QA (GCoQA). GCoQA assigns distinctive identifiers for passages and retrieves passages by generating their identifiers token-by-token via the encoder–decoder architecture. In this generative way, GCoQA eliminates the need for a vector-style index and could attend to crucial tokens of the conversation context at every decoding step. We conduct experiments on three public datasets over a corpus containing about twenty million passages. The results show GCoQA achieves relative improvements of +13.6% in passage retrieval and +42.9% in document retrieval. GCoQA is also efficient in terms of memory usage and inference speed, which only consumes 1/10 of the memory and takes in less than 33% of the time. The code and data are released at https://github.com/liyongqi67/GCoQA In this paper, we propose PIT, a deep-learning compiler for dynamic sparsity. PIT proposes a novel tiling mechanism that leverages Permutation Invariant Transformation (PIT), a mathematically proven property, to transform multiple sparsely located micro-tiles into a GPU-efficient dense tile without changing the computation results, thus achieving both high GPU utilization and low coverage waste. Given a model, PIT first finds feasible PIT rules for all its operators and generates efficient GPU kernels accordingly. At runtime, with the SRead and SWrite primitives, PIT rules can be executed extremely fast to support dynamic sparsity in an online manner. Extensive evaluation on diverse models shows that PIT can accelerate dynamic sparsity computation by up to 5.9x over state-of-the-art compilers.
研究主题
社会责任人工智能
AI将怎样影响人类社会?
在普林斯顿大学黄俊铭博士的大力支持和协助下,微软亚洲研究院举办了“社会责任人工智能”系列研讨会的社会与科技专题讨论。研讨会上,来自社会科学和计算机领域的专家们聚焦探讨了人工智能影响下的数字平等、社会公平、经济结构变化等面向未来发展趋势的重大问题,努力为这个尚在探索阶段的议题提供严肃的思考。
大模型时代,如何评估人工智能与人类智能?
在北京师范大学心理学部骆方教授的大力支持与协助下,微软亚洲研究院举办了“社会责任人工智能(Societal AI)”系列研讨会的心理与教育专题讨论。研讨会上,来自心理测量领域、教育领域以及计算机领域的顶尖专家们共同探讨了心理测量技术应用与人工智能测评的可行性、大模型如何赋能心理测评,并展望了人工智能辅助下的未来教育。
知识产权、隐私和技术滥用:如何面对大模型时代的法律与伦理挑战?
在中国人民大学法学院副教授郭锐的大力支持与协助下,微软亚洲研究院举办了“社会责任人工智能”系列研讨会的法律与伦理专题讨论。研讨会上,来自法律和计算机领域的顶尖专家们聚焦探讨了大模型与知识产权、大模型与隐私、大模型的技术滥用问题等人工智能发展所带来的与法律规范和社会伦理相关的问题,以期在这个最为紧迫且关键的话题上引发更多深入思考与探索。
多模态
强可控视频生成模型 DragNUWA
DragNUWA允许用户直接在图像中拖拽物体或背景,然后模型会自动将拖拽操作转化为合理的运镜或物体的运动,并生成相应的视频。通过融合文本、图像和轨迹三个关键控制因素,DragNUWA在语义、空间和时间三个层面均实现了卓越的可控视频生成能力。
文档基础模型引领文档智能走向多模态大一统
微软亚洲研究院在文档智能领域开发的一系列多模态任务的文档基础模型,在诸如表单、收据、发票、报告等视觉富文本文档数据集上都取得了优异的表现,获得了学术界和产业界的广泛认可,并已应用在多个微软产品中,赋能企业和机构的数字化转型。
从LLM到MLLM,多模态大规模语言模型KOSMOS-1赋予了语言模型看见世界的能力
多模态感知是实现通用人工智能的必要条件,无论是对于知识获取还是与现实世界打交道。解锁多模态输入能够极大地拓展语言模型在更多高价值领域的应用。微软亚洲研究院提出的多模态大规模语言模型KOSMOS-1可以感知一般模态、遵循指令(即零样本学习)以及在上下文中学习(即少样本学习)。
科学智能
微软研究院团队获得首届AI药物研发算法大赛总冠军
AI 药物研发是人工智能未来应用的重要方向之一。自新冠病毒(SARS-CoV-2)首次爆发以来,新冠病毒的小分子药物研发备受关注,于近期举行的首届 AI 药物研发算法大赛便聚焦于此。在比赛中,来自微软研究院科学智能中心的团队,凭借创新的 AI 模型系统 AI2BMD 和 ViSNet 取得了绝佳的成绩,斩获桂冠。
Distributional Graphormer:从分子结构预测到平衡分布预测
微软研究院发布了可用于预测分子结构平衡分布的深度学习框架 Distributional Graphormer (DiG),可以快速生成真实多样的构象,进而为实现从单一结构预测到平衡分布预测的突破奠定基础。
科学智能(AI4Science)赋能科学发现的第五范式
微软研究院成立全新科学智能团队,专注于将第五范式变为现实
可持续发展
以科技之力,守护地球家园:微软亚洲研究院助力实现可持续发展
在2023年第50个世界环境日之际,让我们一起来了解一下微软亚洲研究院在可持续发展方面的科研和应用成果
微软发起“气候研究倡议”,与全球学术界共促气候科学变革性创新
微软与领域专家协力探索,予力全球可持续发展
气候变化、流行病、发展鸿沟…… 应对这些挑战我们还要做些什么?
来自世界各地的微软科学家们就“打造具有复原力和可持续发展的全球社会”进行了探讨
行业赋能
机器学习开源工具BatteryML,一站式分析与预测电池性能
为了更好地分析电池性能,预测电池使用寿命,微软亚洲研究院开发并开源了一站式机器学习工具 BatteryML,希望可以集结更多的专业力量,共同推动电池领域的研究。
MABIM:多智能体强化学习算法的“炼丹炉”
微软亚洲研究院在 GitHub 开源了一个能够灵活适应多智能体强化学习各种挑战的学习测试平台——MABIM,从而可以更好地测试 MARL 算法,让其更容易迁移到真实的应用场景中。
Qlib全新升级:强化学习能否重塑金融决策模式?
经过两年多的深入探索,Qlib 迎来了重大更新,在原有的 AI 量化金融框架基础上,又引入了基于强化学习和元学习的新范式以及订单执行优化和市场动态性建模的新场景,帮助相关从业者使用更先进和多样的人工智能技术来应对更复杂的金融挑战。
科研活动