SUGILITE

(CMU Human-Computer Interaction Institute REU Internship)

Completed: Summer 2020

SUGILITE: the name of a pink-purple rock, and a character on the Cartoon Network animated show, Steven Universe.

But, since I was a Research Intern at Carnegie Mellon’s Human-Computer Interaction Institute (HCII), SUGILITE refers to a multi-modal smartphone intelligent Android agent that allows users to use programming by demonstration (PBD) & natural language instructions to teach it to perform various tasks in different third-party apps.

I worked under my Ph.D. mentor Toby Li as well as Professor Brad Myers, and this opportunity was via CMU’s REU program (Research Experiences for Undergraduates) in their HCI Institute (check it out here). I wanted to reflect upon my experiences and what I learned. To skip technical mumbo-jumbo, feel free to glaze over the recaps. Otherwise, let’s get started!

Brief Table of Contents

prep, problems, goals
project 1/2: fuzzierLookup (recap, challenges, iterations)
project 2/2: time expressions (recap, challenges, iterations)
key takeaways

1. prep, problems, goals

For SUGILITE, the core problem I had to address was the app’s semantic parsing: translating a user utterance into a logical, executable form that can automatically execute a script, such as ordering some kind of drink via the Starbucks app (e.g. “order a coffee” → call (get “order something”) procedureName call set_param (string “drink name”) (string “coffee”)).

(For a bit of context, SUGILITE’s main purpose is to address the domain limitations of intelligent agents like Siri or Alexa - it uses various app interfaces to learn how to automate smartphone tasks across apps and different domains. For instance, knowing how to use the Weather and Starbucks app to retrieve information to execute a simple command, “order an iced latte if it’s hot”).

Oh, and in my kind of freaked-out fashion, I decided to do some semantic parsing prep in the few weeks leading up to the internship. I:

Realized the idea of lambda calculus from learning OCaml in CS51 won’t be leaving me anytime soon
Read and outlined my mentor’s thesis proposal paper
Did Codecademy Java brushup (it’s been a while since taking AP CS in senior year!)
Got annoyed with myself for not taking Semantics in the spring, before reminding myself I got to do a cool R/Python twitter analysis instead
Read and outlined a good chunk of a Semantics textbook
Set up Linux on my Windows machine via WSL & VSCode (quite a nightmare. Let this be a cautionary tale - please just get an Apple computer)
Set up SEMPRE and briefly did the tutorial
Read papers and/or watched video tutorials, took notes on:

Because of my interest in natural language processing, and also because I was a complete newbie to any kind of research or internship for that matter, I decided to set some modest, overarching goals as well:

Have an independent contribution that I implement
Build a new skill (become more expert in NLP)
Understand how my work fits into the overall project goals

2. project 1/2: fuzzierLookup

[ recap ]

Key Tools: Java, SEMPRE framework (Stanford CoreNLP), WordNet

After onboarding, getting familiar with the codebase, and overcoming some challenges (see below), I set out on the central part of my summer experience: fuzzierLookup.

Zooming into the semantic parsing problem, I worked on how to match a user instruction to commands the agent has stored, with increased flexibility. For instance, my agent knows the utterance “call a taxi” and I say “get me a cab,” how could SUGILITE know to match those two up?

Before, the “fuzzy lookup” method just checked whether a searched utterance versus a stored utterance contains the first or last word of the other (so, “call a taxi” would not match to “get me a cab”).

So, I designed a fuzzierLookup: a sentence similarity scorer between the lexicon utterance and the user utterance to retrieve possible answers, inspired by concepts/papers I read about NLP, information retrieval, corpus/sentence/string similarity-scoring techniques, and even some measures of machine translation accuracy (ex: BLEU). I leveraged Stanford CoreNLP and was incorporated WordNet, a lexical database. A key principle in my design choices was that the end goal was to retrieve stored procedures for an end-user, rather than a most accurate similarity scorer. For instance, I adjusted to less computationally intensive methods and reconfigured to store local caches of data to account for speed/efficiency. Moreover, buy a coffee/buy a tea, and I am at Starbucks/I am near Starbucks are very similar semantically – but are very distinct in regards to user tasks, which is why my scorer was purposely stricter in the word-relation aspect.

First, to calculate a syntax (sentence structure) score, I substitute in any word roots/synonyms into the two sentences, then find a score of 1.0 minus the normalized Levenshtein’s distance, which calculates order variability.
I calculated the semantic similarity by 1.0 - the cosine distance between two vectors of size n, for n unique words amongst the two sentences. Each vector represents the semantics of each sentence with component scores determined by the presence of the word/synonym/root, weighed by its part-of-speech and frequency to prioritize content words (e.g. verbs, nouns) and rarer words.
Finally, I optimized the weight given to these two scores when retrieving possible sentences by training the data. This resulted in a 46% to 80% improvement in accuracy for the dataset we used for testing, being able to handle synonyms (like taxi/cab), increased variation, such as “it’s raining outside” to “it is rainy”, and it could link Boolean operators with more flexibility, knowing that “I have unread emails and unseen messages” is equal to “I have unread emails” AND “I have unseen messages”.
There are still restrictions in variation: not recognizing My battery is low” is “I have low battery,” and interesting “nearby” candidates, with “turn the light on” close to “turn down the brightness.” The iterative dialogue between the agent and user to confirm choices thus remains an important safeguard against incorrect retrievals, which is still less costly than over-generous procedure retrievals that are harder to reverse.

So, with a fuzzier lookup functionality than before, SUGILITE users can give commands to execute with more flexibility in their speech, and thus interact with the agent in a more natural manner!

[ challenges ]

Getting all the code-base, debugger and parser set-up on VSCode (after an emotional roller-coaster of 2 days. Thank you to my mentor for his everlasting patience with Windows.)
Familiarizing myself with parser rules and the workflow of the codebase: I ended up needing to diagram it out in an actual notebook.
Figuring out how to incorporate the surprisal measure (hint: found a csv file of word frequencies online, and read it into a local cache)
Navigating WordNet (also, handy Googling)
Running many training instances to debug 100000 typos in my datasets, also general debugging for everything! Learning how to use a debugger efficiently was a difficult but rewarding experience. The pain of seeing literally thousands of NullPointerExceptions is… quite painful.
Scaling my idea to an appropriate efficiency (see iterations!)
Finding purpose in my work (see key takeaways)

[ iterations ]

After building my initial similarity scorer, I had to make many more adjustments, after retraining the model and such, to improve my results in either efficiency (not taking 10000 years to generate a score), or accuracy. So, a couple included:

Scaled up syntax score by replacing lemmas and synonyms in the sentences
Reconfigured queries with a new class to avoid repeated WordNet retrievals
Use simple heuristics to avoid noisy data (ex: any score < 0.4 → 0) and eliminate unlikely matches (if sentences differed in word count by 3+ words)
To avoid repeated computations, set up local caches to store already-computed scores for phrase pairs
At the end of the program: with the initial class hierarchy I set up, I decided to take the risk and refactor my scorer with separate classes for cleanliness, which made training 3x as fast. I was so angry at myself, and pleased at the time time.
Completely refactored fuzzierLookup to generate two separate scores and weigh them separately in training for maximized accuracy
incorporated POS weighing into the semantic score for increased accuracy

Below are the slides I used to present my fuzzierLookup at the end of the internship! The 5-minute time constraint unfortunately didn’t allow me to mention part 2.

SUGILITE (final copy) presentation (1).jpg

SUGILITE (final copy) presentation (2).jpg

SUGILITE (final copy) presentation (3).jpg

SUGILITE (final copy) presentation (4).jpg