data-analysis

Preprocessing

  • Remove any stop words, such as “a”, “is”, “are”, “the”, …
  • Stemming:
    • Stems or removes last few characters from a word (e.g. transporting transport, caring car)
    • Used for large data sets due to good performance
  • Lemmatisation:
    • Considers context of a word and converts it to its base form (“Lemma”; e.g.: caring care)
    • Computationally expensive

Feature Extraction

Popular features to extract from text:

  • term frequency
  • inverse document frequency
  • continuous bag of words
  • skip/n-grams

Popular methods to vectorise text:

  • TF-IDF
  • BM25
  • word2vec
  • LLMs

Graph Representation

One possible graph representation of string “Hello World” is: