I am an AI Resident at Facebook AI. My research interests include human-AI / human-robot interaction, multimodal representation learning, and reinforcement learning.

In 2021, I graduated with my M.S. and B.S. in Computer Science at Stanford University with a concentration in artificial intelligence. I was a member of the Stanford Intelligent and Interactive Autonomous Systems Group where I was fortunate to be advised by Prof. Dorsa Sadigh and mentored by Sidd Karamcheti and Erdem Bıyık.

Previously, I have done research at MIT Lincoln Laboratory on machine learning for threat prediction, and at The University of Edinburgh focusing on accessible voice interfaces. I have also interned at Facebook Messenger and at Telling.ai, a startup spun off from Carnegie Mellon's Language Technologies Institute that gleans biometric information from voice samples.

At Stanford, I was President of Stanford Speakers Bureau, managing high-profile speaker events for the university and surrounding community. I also have thirteen years of training in piano. I led Stanford MELODY, a volunteer organization that provides free piano tutoring to children from underprivileged backgrounds. I am the proud younger brother of pianist Rishi Mirchandani.

Research

Conference Papers

S. Mirchandani, S. Karamcheti, and D. Sadigh. ELLA: Exploration through Learned Language Abstraction. Conference on Neural Information Processing Systems (NeurIPS), 2021 (to appear).

Abstract Arxiv Code Talk

Abstract
Building agents capable of understanding language instructions is critical to effective and robust human-AI collaboration. Recent work focuses on training these instruction following agents via reinforcement learning in environments with synthetic language; however, these instructions often define long-horizon, sparse-reward tasks, and learning policies requires many episodes of experience. To this end, we introduce ELLA: Exploration through Learned Language Abstraction, a reward shaping approach that correlates high-level instructions with simpler low-level instructions to enrich the sparse rewards afforded by the environment. ELLA has two key elements: 1) A termination classifier that identifies when agents complete low-level instructions, and 2) A relevance classifier that correlates low-level instructions with success on high-level tasks. We learn the termination classifier offline from pairs of instructions and terminal states. Notably, in departure from prior work in language and abstraction, we learn the relevance classifier online, without relying on an explicit decomposition of high-level instructions to low-level instructions. On a suite of complex grid world environments with varying instruction complexities and reward sparsity, ELLA shows a significant gain in sample efficiency across several environments compared to competitive language-based reward shaping and no-shaping methods.


Preprints

R. Bommasani et al. On the Opportunities and Risks of Foundation Models. Center for Research on Foundation Models (CRFM), 2021.
  • Robotics (§2.3): S. Karamcheti, A. Chen, S. Mirchandani, S. Nair, K. Srinivasan, K. Hsu, J. Bohg, D. Sadigh, and C. Finn.

Abstract Arxiv

Abstract
AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature.

Older Projects

Developing a Pragmatic Framework for Evaluating Color Captioning Systems (2019)

Abstract Video

with Benjamin Louis Newman and Julia Gong
Abstract
We present a framework for evaluating natural language descriptions in the color captioning problem. In this task, two agents are given a set of three colors and one of them generates a description of a target color for the other agent. Our approach is pragmatically motivated: we measure the effectiveness of a caption in terms of how well a trained model can select the correct color given the caption. We investigate four models, two of which explicitly model pragmatic reasoning, and we formulate a performance metric based on Gricean maxims to compare the effectiveness of the models. Our results indicate that though modeling pragmatic reasoning explicitly does improve evaluation perfomance by a small margin, it may not be essential from a practical perspective. Overall, we believe this evaluation framework is a promising start for evaluating natural language descriptions of captioning systems.

Analyzing an Extension of the Rational Speech Acts Model for the Figurative Use of Number Words (2019)

Abstract

with Bhagirath Mehtha
Abstract
Rules governing cooperative speaking have been quantified through the Rational Speech Acts (RSA) model. The model encapsulates Gricean pragmatics through a series of probabilistic inferences over a hypothetical literal listener, a pragmatic speaker, and a pragmatic listener to encode the probabilities of utterances and meanings. We study the work of Kao et al. (2014), who extend the RSA model to account for different conversational goals across the dimensions of literal meaning and affect, or subtext, in the interpretation of number words. The central objective of this work is to computationally implement this extended RSA modeling framework as devised by Kao et al. and to analyze the strengths and weaknesses of the approach.

Designing a Spoken Dialogue System to Assist L2 Acquisition in People who are Hard of Hearing (2018)

Abstract Video

Advised by Dr. Maria Wolters, University of Edinburgh
Abstract
Spoken dialogue systems can be a tool for conversational practice to support second language learning. In order to assess the inclusivity of such systems, we conduct a pilot study on the effect of hearing loss on interaction efficacy with a conversation practice system. Specifically, we examine the restaurant ordering context, and measure how well participants (both fluent speakers and language learners) identify and recall food options under three conditions of simulated hearing loss. We find that inherent qualities of the synthetic voice under hearing loss conditions impact subjective intelligibility scores. Additionally, for both selection and recall tasks, the effect of hearing loss under moderate conditions was exacerbated by lack of language fluency. Based on these results, we explore the optimization of an objective speech intelligibility metric by preprocessing the text with Structured Speech Markup Language properties, and offer a technique for this optimization based on Gaussian processes.

Using Partially Observable MDPs to Learn Language in a Spatial Reference Game (2018)

Abstract

with Benjamin Louis Newman and Levi Lian
Abstract
Much of early human language learning takes place in an unsupervised setting. In this work, we investigate how autonomous agents can use goal-oriented tasks in a spatial reference game to learn language. This problem is made difficult by the high dimensionality of the state and action spaces as well as the fact that it relates achieving one objective (i.e. reaching a goal) to achieving a secondary one (i.e. learning directional language). We formalize this problem as both a Markov decision process (MDP) and partially observable Markov decision process (POMDP). We analyze the performance of the agent under different conditions using dynamic programming and online POMDP solution techniques. We perform and visualize simulations of the policies and real-time update of belief states. We observe that knowing the language can influence the time it takes to arrive at a goal state, and completely learning the language can be incentivized by explicitly optimizing for that task.

Real-time Acoustic Modeling with Convolutional Neural Networks (2018)

Abstract

with Ying Hang Seah and Levi Lian
Abstract
Acoustic modeling with Hidden Markov Models and Gaussian Mixture Models has been the standard approach for automatic speech recognition (ASR) until the introduction of Convolutional Neural Networks (CNNs). We investigate the use of CNNs for a smaller task—phoneme recognition—and extend the model to allow for real-time classification. The real-time nature of the task poses challenges for streaming both the input and the output. We show that the CNN is able to produce decent performance for audio inputs given its unique characteristics. Additionally, we adapt the real-time classification task to streaming data visualization. This provides a base for a phoneme practicing tool that can be used by people with speaking difficulties. Future research can improve the usability of this system and extend the approach beyond English phonemes.

Analyzing Approaches to Remove Gender Bias from Word Embeddings (2019)

Abstract

Abstract
Recent literature has diagnosed that word embeddings reify social biases to a disturbing degree. In this paper, I focus on gender bias. After introducing how embeddings can capture bias, I present techniques for identifying and mitigating bias from the perspective of Gendered Innovations and the Methods of Sex and Gender Analysis (Schiebinger et al., 2018). Next, I explain the implications of debiased word embeddings for natural language processing. Finally, I discuss the inadequate focus on nonbinary gender in current work on word embeddings.
This work is an attempt to bridge the gap between the technical approaches to removing computational gender bias and the methods of analysis in the Gendered Innovations project.

Approximate Solutions to the Vehicle Routing Problem with Time Windows (2016)

Summary

Summary
At the 2016 Pennsylvania Governor's School for the Sciences, I was part of a ten person team working on a variant of the Traveling Salesman Problem known as the Vehicle Routing Problem with Time Windows (VRPTW). The objective of the problem is to find the most efficient route for a number of trucks to make deliveries to customers with given locations and windows of delivery time. Our approach solved eight instances of VRPTW as efficiently as world record solutions to those instances. More information about the project is available at cmu.edu and the PGSS Blog.

Automated Illustration of Text to Improve Semantic Comprehension (2016)

Abstract Recognition Video

Abstract
Over a million Americans suffer from aphasia, a disorder that severely inhibits language comprehension. Medical professionals suggest that individuals with aphasia have a noticeably greater understanding of pictures than of the written or spoken word. Accordingly, we design a text-to-image converter that augments lingual communication, overcoming the highly constrained input strings and predefined output templates of previous work. This project offers four primary contributions. First, we develop an image processing algorithm that finds a simple graphical representation for each noun in the input text by analyzing Hu moments of contours in photographs and clipart images. Next, we construct a dataset of human-centric action verbs annotated with corresponding body positions. We train models to match verbs outside the dataset with appropriate body positions. Our system illustrates body positions and emotions with a generic human representation. Third, we design an algorithm that maps abstract nouns to concrete ones that can be illustrated easily. To accomplish this, we use spectral clustering to identify abstract noun classes and match these classes with representative concrete nouns. Finally, our system parses two datasets of pre-segmented and pre-captioned real-world images (ImageClef and Microsoft COCO) to identify graphical patterns that accurately represent semantic relationships between the words in a sentence. Our tests on human subjects establish the system's effectiveness in communicating text using images.

Taiwan International Science Fair, 2016
– Selected, one of two students to represent the U.S.
– First Prize, Computer Science and Information Engineering category
MIT THINK Scholars Program, 2015
– Selected, one of six national finalists
Pittsburgh Regional Science and Engineering Fair, 2015
– First Place, Computer Science category
– Recipient, Intel Excellence in Computer Science Award
– Recipient, Sponsor Award from Carnegie Mellon University
– Recipient, Carnegie Science Award (awarded to the top project overall in grades 9-12)
Intel International Science and Engineering Fair, 2015
– Recipient, Fourth Place Grand Award in Systems Software category
– Recipient, Sponsor awards from China Association for Science and Technology, Association for the Advancement of Artificial Intelligence
– Recipient, Trip to European Organization for Nuclear Research – CERN

Fuzzy Logic Based Eye-Brain Controlled Web Access System (2014)

Abstract Recognition Paper

Abstract
Accessing the Web is crucially important in today’s society because of communication, education, and entertainment opportunities. Paralyzed or paretic individuals are unable to capitalize on these opportunities using traditional human-computer interaction methods. We develop a low-cost web browsing system for such individuals, integrating eye and brain control in a novel fashion to relay and interpret navigation commands. The system combines gaze position estimates obtained from a new image processing algorithm with brain concentration levels sensed and transmitted by an electroencephalogram headset. Since user intent may itself be uncertain, the system incorporates a novel fuzzy logic algorithm for combining brainwave and eye position inputs to determine the user’s targeted hyperlink. The algorithm adopts an approach based on exponential smoothing to efficiently keep a record of historical signals. Experimental evaluation established that the first attempt success rate of the system lies between 87% and 95% with 95% confidence. Error recovery accuracy is 98.4%, resulting in a second attempt success rate of 99.1%.

Intel International Science and Engineering Fair, 2014
Recipient, $1500 Web Innovator Award from GoDaddy

The ElderBots Project: An Open-Source Social Robot for the Elderly (2014)

Summary

Summary
During an internship at Carnegie Mellon University’s Quality of Life Technology Center, I worked on an open-source ”social robot" designed to help the elderly cope with isolation or depression. Specifically, I worked on an iOS app controller to interface with the robot. The app was published in the App Store in August 2014. (A video is available here.) Additional information about the project is available at elderbots.org and romibo.org.

The Effect of Typeface on Ink and Toner Costs (2014)

Summary Recognition Paper (1) Paper (2)

Summary
In my sixth-grade science fair project, I estimated how much my school district would save in ink and toner costs by switching to a more ink-efficient typeface. I published my findings in the Journal of Emerging Investigators. The journal's editors encouraged me to extend my findings to the United States Government. The response was extremely positive: I was very fortunate to share my findings on CNN TV, to get a feature in Science Magazine, and to receive an invitation from HP's CEO to present my findings at HP's Headquarters in Palo Alto, CA.

Media Coverage
Science, CNN, Associated Press, CBS This Morning, Financial Times Magazine, Forbes, HuffPost Live, TIME

Featured, Science Magazine
– Invited by Ms. Meg Whitman, CEO of Hewlett-Packard, to tour HP Labs in Palo Alto, CA and present findings to HP engineers
Recognized, Outstanding Community Service, State of California Senate, 2014

Note: I have presented a response to some misinformed articles attempting to "debunk" the study. To obtain a copy of the full rebuttal, please contact me.

Music

Prior to college, I studied piano for a number of years under the tutelage of Prof. Luz Manríquez. Some old recordings are featured below.

Senior Recital

July 2017

Piano Concerto No. 2, I. Moderato

S. Rachmaninoff

Ballade No. 1 in G minor

F. Chopin

Toccata in E-flat minor

A. Khachaturian

Piano Concerto No. 3, Op. 50, I. Allegro Molto

D. Kabalevsky

Blue Johannes

R. Vali (b. 1952)

Piano Concerto No. 3, I. Andante - Allegro

S. Prokofiev

Toccata in E minor

J. S. Bach

Suggestion Diabolique

S. Prokofiev

Contact

Feel free to email me at suvir@cs.stanford.edu, or use the form below.