Core Concepts for Small Language Models
Foundational Computational Linguistics Concepts and how they relate to Language Models
LLMs are all the rage, and to use them effectively as engineers and users of any kind we could benefit from thinking about them in the context of some core Natural Language Processing (NLP) concepts. At New Math Data, we’ve been building a lot of RAG and other models lately, and so I have been thinking about how to tie some key academic concepts of NLP to concepts you need to use to build successful RAG models. And thus came the vision for this article and the follow-up. These concepts I will be discussing were critical to how the rules-based, statistical, neural network, and deep learning models that preceeded the LLMs of today were designed. This article hopes to help you dip your toes in the ocean that is computational lingustics.
An NLP system must have the following capabilities:
- Tokenization – separating text into smaller tokens that can be analyzed
- Morphological Analysis – word formation
- Syntactic Analysis – sentence structure
- Semantic Analysis – meaning
- Pragmatic Analysis – understanding
- Natural language generation – creating relevant text
This article will discuss what these terms mean in an easy to understand way, and in another article we will discuss how we satisfy each of these when building RAG models.
Core Concepts
Tokenization
Tokenization is the process of separating text like sentences, notes, or many, many sentences into specific components, regardless of if they are full or correctly typed words, that make them up.
For example, if you have the statement “I love playing videogames with my friends”, a tokenized version would be [“I”, “love”, “playing”, “videogames”, “with”, “my”, “friends”]. Sometimes we may break up larger compound words into smaller parts that are easier for the computer to understand. So, “videogames” might become “video” “games”. We do this to speed up processing of information and get the computer to recognize patterns in the words, sentences, and phrases.
Morphological Analysis
Morphological analysis is like looking at a word as if it’s made of different LEGO bricks, and seeing how those parts fit together to make meaning.
For example, let’s look at the word “playing.”
- “Play” is one LEGO piece – it tells us what action we’re doing.
- “-ing” is another LEGO piece – it shows us that the action is happening right now.
When we put these LEGO bricks together, we get the full word “playing”, which means doing the action of play right now. So, morphological analysis is like taking apart words to see the little pieces inside. By understanding each piece, we know what the whole word means!
Syntactic Analysis
Syntactic analysis is building sentences using the puzzle pieces of who is doing what in the sentence, who they are doing it with, etc. Sentence diagrams are a tool you might have learned in middle school that visually shows how the words in the sentence fit together. While educators might debate the usefulness of sentence diagramming for children, if we can write algorithms to diagram sentences for machines we can graphically show how the words relate to each other, making it easier to understand the meaning.
For example, here we see that “I” is the subject, “love” is the action, and “playing videogames” is the object of my affection (direct object). The prepositional phrase “with my friends” provides some details about the way in which I like playing videogames. If I changed that to “for my friends” or “about my friends” the sentence would have a different meaning. The linkage to “playing videogames” show that connection.
Sentence diagrams help a machine understand meaning because they visually map out the structure of a sentence, making it clear how each word relates to others. It shows the role of the words (object, direct object, verb, etc), the relationship between those words (how are they connected), the hierarchy (main ideas vs modifiers), and helps the computer start to understand patterns through creating a consistent approach for interpreting sentences.
Semantic Analysis
Semantic analysis is about understanding each word in a sequence of words, putting those words together, and validating that it makes sense.
1.Understanding Each Word: Semantic analysis looks at each word in the sentence and often uses a reference dictionary to understand what it means:
- “I” means the person talking.
- “Love” means really, really liking something.
- “Playing” means doing something fun or engaging.
- “Videogames” are electronic games that you play on a screen.
- “With” means together with someone else.
- “My friends” means the people that the speaker cares about and likes to spend time with.
2. Putting the Words Together: Next, semantic analysis sees how these. words work together:
- “I” am the person who loves something.
- “Playing videogames” is the thing that I love.
- “With my friends” explains that I’m not playing alone; I’m playing together with friends.
3. Making Sure It Makes Sense: Finally, the computer or algorithm checks if the sentence makes sense. It sees that a person (I) can love an activity (playing videogames) and that it’s possible to play with friends, so the sentence is meaningful and sounds natural.
Semantic analysis is like seeing the forest through the trees, putting all the pieces together and seeing the big picture of what the sentence is saying to determine if it’s meaningful and logical.
Pragmatic Analysis
Pragmatic analysis is the process of layering context and hidden meaning into the understanding of words. This part had, until recently, been a challenge for large language models and sometimes it still is.
“I love playing videogames with my friends” could have different meanings depending on the context in which it is said. If you are a child who doesn’t want to leave a friend’s house because you “love playing videogames with [your] friends” the context is quite different from if you are playing videogames alone and thinking about how great it would be to be playing with your friends or if you are grounded and trying to plead your case for why you should be released to play videogames with your friends.
This part can be tricky for humans and machines alike because context about the person and/or the specific situation can influence the pragmatic analysis.
Natural Language Generation
Natural language generation (NLG) has three core steps:
- Understandimg the question you are asking (called Natural Language Understanding or NLU)
- Finding the answer
- Turning that information into words humans can understand
The first two steps involve a lot of iterating on the components discussed above: tokenization, morphological analysis, syntactic analysis, semantic analysis, and pragmatic analysis. Once these pieces have been satisfied to a reasonable degree, natural language generation is relatively easy because the machine can respond in a formulaic way. Getting the right data and deriving the right meaning from that data is the tricky part.
Hope this helps you understand the core concepts in an NLP system! Stay tuned for our next article on how we apply these concepts when designing retrieval-augmented generation (RAG) models!