How Does Natural Language Processing (NLP) Work?
How Natural Language Processing (NLP) works?
In general, NLP tries to break naturally language data down into its most basic of parts and then examine the relationships between those parts to extract meaning. It can be used in a variety of ways, and the NLP model you construct should reflect the type of problem you are trying to solve.
If, for example, you are trying to understand how customers feel about your product or service based on their social media comments, a relatively simple text classification model can be built by comparing text, emoticons and the ratings they correspond to even if they never directly use obvious watch words such as “like,” “dislike,” “hate,” etc.
To do this, we would program our NLP AI to match words, phrases and other bytes of text with their corresponding star rating on a large review of the dataset (including millions of ratings). From this, our NLP can infer things like a “:/” typically meaning someone is less-than-satisfied, since its reviews it appears in ratings with an average of 2.2 stars.
Other potentially less-obvious things can also arise from the data, such as slang, character swearing (for example, “#$@@”), usage frequencies of exclamation points and question marks.
Statistics in hand, the NLP can now automatically assign “sentiment” to a purely text input. Actions then can be developed to respond, such as alerting a customer service agent to directly respond to a negative comment or simply measuring feedback from consumers about a new policy, product, or service on social media.
If you use Gmail, you’ve been seeing this in action for quite some time now. It filters out spam, and auto-sorts emails into: Primary, Social and Promotions and Updates based on language patterns it’s identified with each category.
Speech Queries
If you, however, want to build a system capable of recognizing and responding to speech, you’ve got a few more steps ahead of you.
Remember breaking sentences down into subject, object, verb, indirect object, etc. in elementary school? At least a little? Then you’ve done this type of work before.
We’ll cover a quick example with the following sentence. “Andrew is on a flight to Bali, an island in Indonesia”.
Tokenization:
Each word is first separated and broken down into tokens. “Andrew,” “is,” “on,” “a,” “flight,” “to,” “Bali,” “,” “an,” “island,” “in,” “Indonesia,” “.”
Punctuation also becomes part of our token set because it affects meaning.
Parts of Speech Prediction:
Then we look at each word or token and try to determine whether it is a pronoun, adjective, verb, etc. This is done by running the sentence through a pre-trained “parts-of-speech” classification model which has already statistically examined the relationships within of millions of English sentences.
Lemmatization:
This examines words to find the base form of each, understanding that “person” is often the singular form of “people,” and that “is, was, were, am” are all forms of “to be.”
Stop Word Removal:
Articles like “the, an, a” are often removed because of their high frequency. This can cause relational confusion. However, each NLP’s list of stop words has to be carefully crafted as there is no standard set to remove from all model applications.
Dependency Parsing:
In this phase, syntax structure is devised by to allow the AI to understand sentence attributes such as subject, object, etc.
This allows the AI to understand that, while John and ball are both nouns, in the phrase “John hit the ball out of the park,” John has done the action of hitting. Open source parsers like spaCy can be used to define the properties of each word and build syntax trees.
Name Entity Recognition:
The goal of this phase is to extract nouns from the next, including people, brand names, acronyms, places, dates, etc.
A good NLP model can differentiate between noun types such as June the person and June the month based on statistical inference of its surrounding words like the presence of the preposition “in.”
Coreference Resolution:
Coreferencing tracks pronouns across sentences in relation to their entity. It is, many argue, one of the most difficult steps in NLP programming.
At this stage, we have our parts of speech mapped as subjects, objects, verbs and more. But our sentence model thus far only examines one sentence at a time, and parsing is needed to match. For example, in:
“Andrew is on a flight to Bali, an island in Indonesia. He is planning on living there. It has a warm climate”.
After coreferencing, our model would understand “he” and “it” were the subjects of sentences two and three, but would not yet be able to connect those pronouns to “Andrew” and “Bali,” respectively.
Due to this complicated nature, a more detailed explanation of the coreferencing is beyond the scope of this article but can be read about in some recent research from Sanford University.