Preprocessing
- Remove any stop words, such as “a”, “is”, “are”, “the”, …
- Stemming:
- Stems or removes last few characters from a word (e.g. transporting transport, caring car)
- Used for large data sets due to good performance
- Lemmatisation:
- Considers context of a word and converts it to its base form (“Lemma”; e.g.: caring care)
- Computationally expensive
Feature Extraction
- All unique words make up your alphabet
- One-hot-encode text like we did with strings
Popular features to extract from text:
- term frequency
- inverse document frequency
- continuous bag of words
- skip/n-grams
Popular methods to vectorise text:
- TF-IDF
- BM25
- word2vec
- LLMs
Graph Representation
One possible graph representation of string “Hello World” is: