Large language models hold potential as conversational language-learning assistants for lan- guage learners. However, a common issue is that LLM assistants can produce output using vocabulary or grammar that is outside of the learner’s abilities to understand. In this report we demonstrate a novel decoding-time technique, “lexical logit boost”, that encourages an LLM to use words from an arbitrary “vocabulary set” provided by the user at inference time. We specifically focus on the task of generating user-appropriate vocabulary example sentences, and we provide simple evaluations for this task. Tuning was required of our “boost” hyperparameter to balance the LLM’s adherence to the vocabulary set and the generation quality. In the end, using lexical logit boost with Llama 3.1 8B, we are able to outperform prompting techniques used on the larger model Gemini 1.5 Flash.
In language learning, one of the most effective ways to acquire fluency is through conversational practice. One of the problems in language learning is that predefined “levels” of proficiency of a language do not generalize well enough to an individual learning the language. Oftentimes the learning path of a language learner outside the classroom cannot be expressed as a linear progression through predefined “levels”. Depending on individual interest factors for learning a language such as travel, casual conversation, or mastery, different vocabulary may be more relevant at different times.
When the vocabulary in a conversation is too advanced, language learners are forced to rely on external aids such as dictionaries and translation tools. This process disrupts the flow of practice and may introduce even more advanced words that can lead the language learner to an overwhelming exposure of unfamiliar words. This is especially problematic for language learners who are just starting or have a limited vocabulary base. This disruption from the flow of practice can lead to discouragement and disengagement from the language learning process. The goal of this project is to address the difference in an individual’s vocabulary to make the flow of practice more natural with less disruptions.
LLMs, with their ability to generate diverse and context rich responses, hold promise to become great assets for language learners. They are able to simulate a conversational partner, and through this project’s exploration of various sentence generation techniques.
In this report, we present the results of our exploration of various methods for sentence generation
Our experiment tested the lexical logit boosting (LLB) prototype, focusing on generating and evaluating example sentences with specific vocabulary words. The vocabulary set consisted of the 500 most common English words scraped from a website that did not lemmatize words. This ensured words like "is" were included, unlike in lemmatized lists where they would be grouped under "be." The choice of this vocabulary set was arbitrary and motivated by its simplicity and suitability for prototyping, although other sets (e.g., words starting with "e") could have been used.
From this list, 20 target words were randomly selected. Various configurations of large language models (LLMs) were then prompted to generate example sentences using these target words. The Llama 3.1 8B Instruct model was tested with different values of the boost hyperparameter 𝑏 , where 𝑏 = 0 served as the control. The Gemini 1.5 Flash model, accessed via Google’s API, was used as a reference state-of-the-art unmodified model.
Unlike the Llama models, which were prompted only with the target word and instructions, the Gemini model was supplied with the entire vocabulary list via its prompt. This approach, though effective for small vocabularies, is less scalable and potentially costly for larger vocabularies. The generated sentences were evaluated to produce the final results. Details on the prompts and hyperparameters are provided in Appendix A of the paper.
The results demonstrated that stronger boosts led to increased adherence to the vocabulary set (Figure 3). However, this came at the cost of generation quality, as observed in human evaluations (Figure 5) and the LLM’s ability to follow prompts and include target words (Figure 4). This tradeoff between control and quality aligns with findings from Liang et al. (2024). Among the tested configurations, the moderate boost value (𝑏 = 4) provided the best balance: its quality was only slightly worse than the control (𝑏 = 0) and the Gemini model, while its mean percentage of non-vocabulary words was significantly lower than the Gemini model.
Inter-evaluator agreement, measured using Krippendorff’s 𝛼, showed values of 0.76, 0.66, and 0.69 for the mechanics, semantics, and context metrics, respectively. Qualitatively, higher boost values (𝑏 > 4) tended to produce long, run-on sentences, likely because the vocabulary set excluded punctuation, leaving punctuation tokens unboosted. Adding punctuation to the vocabulary set could address this issue due to LLB's transparency properties. Additionally, the LLB model struggled to generate high-quality sentences when target words were lexically complex, suggesting a limitation in handling more challenging vocabulary.
These were the results of our exploration of various methods for sentence generation that may have application in the domain of language learning. We intend for this project to be the first step towards towards eventually building an LLM conversation partner that actively monitors the user’s language proficiency and vocabulary and tailors its own language and behavior to challenge the language learner in a level-appropriate manner.
Future Work- Multiple lexical complexity levels
- Adaptive 𝑏
- Investigate run-on sentence problems. Is punctuation the problem?
- For this exact application, constrained decoding may be useful
- More scalable evaluation via perplexity or LLM-as-a- judge
- Larger human eval
- Create system to monitor user’s vocabulary use (the “input half” of the adaptive-vocab chat assistant)
We extend our thanks to our project mentors Shirley Anugrah Hayati and James Mooney for the guidance they have offered throughout this project. Additionally, we thank Andrew Hale for volunteering to contribute to the human evaluation. Finally we thank our instructors: Dongyeop Kang, Shirley Anugrah Hayati, James Mooney, and Robert Jia for equipping us with the knowledge, both theoretical and practical, that has allowed us to execute this project.