Intent Classification for Natural Language Queries

EE 541: A Computational Introduction to Deep Learning — Final Project

Introduction

Intent classification determines what a user wants to accomplish from their text input. This is essential for conversational systems, virtual assistants, and task-oriented dialogue systems. The challenge lies in understanding user intent despite variations in phrasing, vocabulary, and sentence structure.

Dataset

The ATIS (Airline Travel Information System) dataset contains natural language queries about flight travel and related services. It includes approximately 5,800 training utterances and 900 test utterances across 26 intent classes.

Intent Classes Include:

flight - queries about flight schedules and availability
airfare - questions about ticket prices
ground_service - ground transportation inquiries
airline - airline-specific questions
abbreviation - requests to expand abbreviations
aircraft - questions about aircraft types
flight_time - departure/arrival time queries
quantity - counting queries
city - city information
meal - in-flight meal inquiries
Plus 11 additional classes covering various travel-related intents

Queries typically range from 5 to 50 tokens and use natural conversational language.

Dataset Access: https://github.com/howl-anderson/ATIS_dataset

The dataset is provided in standard train/test splits with intent labels. Some versions also include slot labels for named entity recognition, but this topic focuses on intent classification.

Problem Statement

Build a deep learning system that classifies natural language queries into one of the 26 intent categories. This is a multi-class text classification problem where the input is a variable-length sequence of words and the output is a single intent label.

Alternative Problem Formulations

Joint Intent and Slot Classification: Some versions of ATIS include slot labels marking entities like cities, dates, and times within queries. Train a model that simultaneously predicts both the overall intent and tags each word with its slot type. This multi-task learning approach can improve intent classification by leveraging entity information.

Confidence-Based Rejection: Not all queries fit cleanly into one intent—some are ambiguous or malformed. Build a model that outputs confidence scores and rejects low-confidence predictions. This requires calibrating model outputs and establishing rejection thresholds based on validation data.

Cross-Domain Transfer: Train on ATIS (travel domain) and test generalization to queries from a different domain (e.g., restaurant booking, smart home commands). This evaluates whether learned representations capture general language understanding beyond domain-specific vocabulary.

Text Representation and Tokenization

Natural language text must be converted to numerical representations for neural network processing.

Tokenization

Tokenization splits text into discrete units called tokens. The choice of tokenization strategy affects vocabulary size, sequence length, and handling of unknown words.

Word-Level Tokenization: Each word becomes a token: “Show me flights” → [“Show”, “me”, “flights”]. This is intuitive but creates large vocabularies. Words not seen during training become unknown tokens, losing information.

Character-Level Tokenization: Each character becomes a token: “Show” → [“S”, “h”, “o”, “w”]. This eliminates out-of-vocabulary problems entirely since any word can be represented. Sequences become much longer (5-10× more tokens per sentence) and the model must learn to compose characters into meaningful words.

Subword Tokenization: Algorithms like Byte-Pair Encoding (BPE) split words into frequently-occurring pieces: “flights” → [“flight”, “s”]. This balances vocabulary size and sequence length. Rare words decompose into subword units, so the model can generalize to unseen words through familiar components.

For ATIS, word-level or subword tokenization are reasonable choices given the domain-specific vocabulary. Build a vocabulary from training data and decide how to handle unknown words at test time (mapping to a special <UNK> token or decomposing into subwords).

Embeddings

Once tokenized, each token is mapped to a continuous vector representation. An embedding layer converts discrete token indices to dense vectors:

\[ \mathbf{e}_i = \mathbf{W}_E[\text{token}_i] \]

where \(\mathbf{W}_E \in \mathbb{R}^{V \times D}\) is the embedding matrix, \(V\) is vocabulary size, and \(D\) is embedding dimension.

Learned Embeddings: Initialize embedding vectors randomly and learn them during training through backpropagation. The model learns to place semantically similar words (like “flight” and “plane”) closer in embedding space based on their role in predicting intents.

Pre-Trained Embeddings: Word vectors like GloVe or word2vec are trained on large text corpora (Wikipedia, web crawl) and capture semantic relationships. These can be used as initialization for the embedding layer, potentially speeding up training and improving generalization when training data is limited.

Suggested Approach

Text Preprocessing: Decide on case normalization (lowercase everything or preserve case for entities like airport codes). Consider how to handle punctuation—it may carry meaning (“What?” vs. “What.”) or be noise. Numerical entities (dates, times, flight numbers) can be normalized to special tokens or left as-is.

Vocabulary Construction: Build a vocabulary from training data by counting token frequencies. Set a maximum vocabulary size or minimum frequency threshold to limit rare words. Reserve special tokens for padding (<PAD>), unknown words (<UNK>), and optionally sentence boundaries.

Data Augmentation: Text augmentation preserves intent while varying surface form. Synonym replacement swaps words with similar meanings. Random word deletion or insertion adds noise without changing intent. Back-translation (translate to another language and back) generates paraphrases.

Sequence Handling: Queries have variable length but many architectures expect fixed-size inputs. Pad shorter sequences to a maximum length with a special padding token. During training, ignore padding positions when computing loss.

Evaluation: Classification accuracy measures overall performance on this multi-class problem. Per-class precision, recall, and F1-scores identify which intents are confused. Confusion matrices reveal which intent pairs overlap semantically.

Dataset Considerations

Class Imbalance: Some intents appear much more frequently than others in ATIS. A model can achieve decent accuracy by always predicting common classes. Monitor per-class performance to ensure all intents are learned.

Intent Ambiguity: Some queries could match multiple intents depending on interpretation. A query about “flight prices” could be flight or airfare. These cases represent true ambiguity in the data rather than model failure.

Domain Vocabulary: ATIS contains specialized terms (airport codes, airline names, cities) alongside general language. How your model handles this domain-specific vocabulary versus common words affects performance.

Semantic Similarity: Many intents are related (flight vs. flight_time vs. flight_no). Distinguishing these requires attention to specific keywords and context rather than just general topic.

Technical Notes

Computational Requirements: Text classification is generally less demanding than image or audio tasks. Pre-computing embeddings and caching tokenized sequences speeds up training. Typical embedding dimensions (50-300) and sequence lengths (<100 tokens) are manageable without dedicated GPUs.

Sequence Length: Maximum query length in ATIS is around 50 tokens, but most queries are much shorter. Padding all sequences to 50 wastes computation on short inputs. Experiment with different maximum lengths to balance coverage and efficiency.

Vocabulary Size: Larger vocabularies increase the embedding matrix size and memory usage. A vocabulary of 5,000-10,000 words typically covers most tokens in ATIS while keeping unknowns manageable.

Expected Outcomes

Your analysis should identify which intents are most confused and why. Examine whether errors stem from true ambiguity in the data or model limitations. Investigate how vocabulary size, embedding dimension, and sequence length affect performance. Compare learned embeddings to pre-trained embeddings to understand the value of transfer learning for this specialized domain. Analyze failure cases to determine whether they involve rare words, unusual phrasing, or semantic overlap between intent classes.