Introduction to Natural Language Processing

and how you can use it as part of your research in social science

Roman Jurowetzki / AAU - Social Data Science PhD Course 28 Nov.2019

Thinking about NLP

3 types of activities

Exploration (of elements)
NLP & Unsupervised ML
NLP & Supervised ML

Exploration of text-elements

Deliniate some elements of interest

These can be specific terms, labels, #hashtags.
Observe their (co-)occurence

For instance you can track them over time
Sounds easy?

Yes and no. Dictionary based approaches: easy. Entity extraction - not so easy.

Unsupervised methods in NLP

Identifying themes in a corpus

Topic modelling belongs to this category. Overall, these approaches rely on the fact that words belong to texts, in which they co-occur and comparable texts form a corpus. These methods allow us to assign a number of themes to each text in a corpus without explicitly defining these themes upfront.
Building upon semantic similarity

We can focus on the discourses as such or we can bring the level of analysis down to the text-level. Vectors, representing our texts, allow us to calculate similarity and from here we can derive a lot. We can construct text-networks, observe over-time spread of content and more.
Combining UML with exploration

Computational techniques are excellent at finding patterns. Unsupervised NLP methods can be used to support qualitative exploration.

Supervised ML and NLP

Gives you the ability to use text within (traditional) quantitative pipelines Here the NLP-heavy part is important to find an appropriate representation of the text data, which can be used in an SML algorithm.

Sentiment analysis
actually, a regression problem
Find labels
in descriptions
Identify outliers
e.g. odd or extraoerdinary observations

A brief history of NLP

and some random references

alt text

source: Yuval Goldberg: https://youtu.be/e12danHhlic?t=428

From traditional NLP to Neural Networks

While NLP before 2013 - 2014 would have been very much focused on computational linguistics, the introduction word-embeddings (or word-vectors) such as Word2Vec(Mikolov et al. 2013) triggered a revolution in this field and let to a retreat of grammar/rule-based approaches.
This is for instance reflected in the course design at top institutions where NLP courses have been merged with Deep Learning courses and many linguistic approaches have vanished from the curriculum.
In this course we will not cover embeddings but present them briefly in the outlook alongside deep learning and neural networks.

Today’s curriculum will be covering the most central techniques in a very broad fashion to provide you with an overview of the different concepts and approaches in this discipline. The goal will be to give you modular and extensible recepes that you can use in many research contexts.

Resources

Aside from Datacamp

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit.
Stanford’s NLP online course (by Dan Jurafsky and Chris Manning)
Stanford’s NLP & Deep Learning course
Goldberg, Y. (2017). Neural network methods for natural language processing. (recent presentation of Yuval at SpaCy IRL: https://youtu.be/e12danHhlic)
(Advanced) NLP course at ABBYY - the first 5 notebooks are already translated from Russian.
Awesome-NLP 😎 - A curated list of different NLP things

Introduction to Natural Language Processing

Thinking about NLP

3 types of activities

Exploration (of elements)

NLP & Unsupervised ML

NLP & Supervised ML

Exploration of text-elements

Deliniate some elements of interest

Observe their (co-)occurence

Sounds easy?

Unsupervised methods in NLP

Identifying themes in a corpus

Building upon semantic similarity

Combining UML with exploration

Supervised ML and NLP

Sentiment analysis

Find labels

Identify outliers

A brief history of NLP

From traditional NLP to Neural Networks

Resources

Aside from Datacamp