Introduction to Natural Language Processing

and how you can use it as part of your research in social science

Roman Jurowetzki / AAU - Social Data Science PhD Course 28 Nov.2019

Exploration of text-elements


  • Deliniate some elements of interest

    These can be specific terms, labels, #hashtags.

  • Observe their (co-)occurence

    For instance you can track them over time

  • Sounds easy?

    Yes and no. Dictionary based approaches: easy. Entity extraction - not so easy.

Unsupervised methods in NLP

  • Identifying themes in a corpus

    Topic modelling belongs to this category. Overall, these approaches rely on the fact that words belong to texts, in which they co-occur and comparable texts form a corpus. These methods allow us to assign a number of themes to each text in a corpus without explicitly defining these themes upfront.

  • Building upon semantic similarity

    We can focus on the discourses as such or we can bring the level of analysis down to the text-level. Vectors, representing our texts, allow us to calculate similarity and from here we can derive a lot. We can construct text-networks, observe over-time spread of content and more.

  • Combining UML with exploration

    Computational techniques are excellent at finding patterns. Unsupervised NLP methods can be used to support qualitative exploration.

Supervised ML and NLP

Gives you the ability to use text within (traditional) quantitative pipelines Here the NLP-heavy part is important to find an appropriate representation of the text data, which can be used in an SML algorithm.

  • Sentiment analysis

    actually, a regression problem
  • Find labels

    in descriptions
  • Identify outliers

    e.g. odd or extraoerdinary observations

A brief history of NLP

and some random references

alt text

source: Yuval Goldberg: https://youtu.be/e12danHhlic?t=428

From traditional NLP to Neural Networks

  • While NLP before 2013 - 2014 would have been very much focused on computational linguistics, the introduction word-embeddings (or word-vectors) such as Word2Vec(Mikolov et al. 2013) triggered a revolution in this field and let to a retreat of grammar/rule-based approaches.

  • This is for instance reflected in the course design at top institutions where NLP courses have been merged with Deep Learning courses and many linguistic approaches have vanished from the curriculum.

  • In this course we will not cover embeddings but present them briefly in the outlook alongside deep learning and neural networks.

Today’s curriculum will be covering the most central techniques in a very broad fashion to provide you with an overview of the different concepts and approaches in this discipline. The goal will be to give you modular and extensible recepes that you can use in many research contexts.

Resources

Aside from Datacamp