Repeatable NLP of news headlines using Airflow, Newspaper3k, Quilt T4, and Vega

Published in

Quilt

7 min readFeb 27, 2019

Online newspapers like The New York Times, The Guardian, and CNN are invaluable linguistic sources for determining sentiment analysis. However, systematically acquiring newspaper article data to build a text corpus for calculating sentiment can be difficult. Even with access to databases and tools, it can be tedious to construct a daily newspaper article corpus in a manageable format.

This article will illustrate how a Python-based stack of Apache Airflow, newspaper3k, Quilt T4, and Vega can be used to execute fail-safe daily extract-transform-load (ETL) of article keywords, deposit the scraped data into version control, and visualize the corpus for a series of online news sources. (code available on Github as a template for your own projects.)

Tools

ETL using Apache Airflow

Apache Airflow is a popular data engineering tool for collecting and migrating data from location-to-location and format-to-format. It has native operators for a wide variety of languages and platforms. Developed at AirBnB in 2014 (and open-sourced from the start), its generic data toolbox allows for rapid plugin development for a wide variety of technology stack combinations. It uses Directed Acyclic Graph objects (DAGs) that offers granular control (in the form of atomic Tasks) over the flow of data from origin, through transformation(s), to destination. From the Airflow docs:

In Airflow, a DAG – or a Directed Acyclic Graph – is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.

DAGs describe how to process your workflow, but not what your workflow actually does. They ensure that what they do happens at the right time, or in the right order, or with the right handling of any unexpected issues.

(Further discussion of foundational Airflow concepts is beyond scope for this article, although sample code will illustrate the core components.)

Scraping newspaper articles

newspaper3k is a Python library that simplifies article scraping and curation, similar to the requests library for HTTP requests. When installing newspaper3k, you can optionally install NLTK (one of the leading libraries for working with human language data) and add multiple corpora for various types of analysis.

newspaper3k uses Natural Language Processing (NLP) to extract keywords from an article. Keywords are the important words of a text/corpus and provide an entry into understanding what a certain article, or collection of articles, is about. newspaper3k does linguistic analysis based on word frequency in the corpus. For example, a recent article from The Guardian returns the following keywords:

'discriminate', 'rights', 'hair', 'york', 'hairstyle', 'policies', 'commission', 'school', 'black', 'ban', 'hairstyles', 'rules'

Installing packages

I currently use pyenv for managing Python versions on my localhost but you might use pipenv. Ensure that you install Airflow against (as of writing) Python 3.6.* — it is not currently compatible with Python 3.7.*. We also install Quilt T4 for managing our data packages:

$ pyenv local 3.6.8
$ python -m venv ~/venv/etl-airflow-s3
$ source ~/venv/etl-airflow-s3/bin/activate
$ pip install apache-airflow
$ pip install t4

newspaper3k utilizes lxml for speed, so you need to install several system packages. On macOS using Homebrew :

$ brew install libxml2 libxslt
$ brew install libtiff libjpeg webp little-cms2
$ pip install newspaper3k

Installation on other platforms will be different — follow directions for your preferred package manager.

Finally, install the recommended NLP corpora:

$ curl https://raw.githubusercontent.com/codelucas/newspaper/master/download_corpora.py | python

Get Airflowing

Configuration

Post-install you’ll want to set the environment variable AIRFLOW_HOME which by default is ~/airflow:

export AIRFLOW_HOME=~/airflow

You may want to configure AIRFLOW_HOME/airflow.cfg manually. Specifically, if you are storing you DAGs in a local repository (as part of a larger version-controlled data engineering infrastructure) rather than globally in AIRFLOW_HOME/dags, you’ll need to update the entry inairflow.cfg to reflect this new DAG folder location:

[core]
# The home folder for airflow, default is ~/airflow
airflow_home = /Users/<username>/airflow# The folder where your airflow pipelines live, most likely a
# subfolder in a code repository
# This path must be absolute
dags_folder = /Users/<username>/<path-to-repo>/dags

Initialization, scheduling, and webserver (DAG UI)

By default, Airflow uses SQLite for managing data flow and state, which isn’t recommended for production environments but is suitable for this demo. You may also want to de-clutter the UI by removing the ~15 example DAGs.

# Initialize SQLite database (default)
$ airflow initdb# Start up the scheduler daemon (continuously runs)
$ airflow scheduler# Start up the webserver UI daemon (continuously runs)
$ airflow webserver

Daily ETL of newspaper article keywords

The DAG we’ll build scrapes online news sources generating keywords for each article (task1); saves to local JSON files (task2); adds the files to a Quilt data package and uploads them (task3). DAGs allow you to specify dependencies between tasks by setting related operators as upstream (or downstream). You link these operations explicitly using set_upstream() and set_downstream() or, as of Airflow 1.8, using the Python bitshift operators:

task1 >> task2 >> task3

If task1 fails, task2 and task3 won’t execute. If task2 fails, task3 won’t execute. All failures are logged and accessible in the Airflow UI.

Before building our tasks we initialize our DAG with some default values:

`task1`: Scraping the headlines

We’re going to start by scraping three different online newspapers — The Guardian, The New York Times, and CNN — although the code is extensible to any number of sources. Additionally, we’ll define a category of article to scrape (politics) in our task definition. We use newspaper3k’s methods to build() a newspaper object; loop over the articles; then download, parse and perform NLP on each article text to generate a list of keywords:

task2: Saving scraped keywords

Our second task takes the returned keywords from the first task (using Airflow’s Xcom for operator cross-communication) and saves to local per-source JSON files:

task3: Save keywords to Quilt data package & store on S3

To compare political article keywords over time we snapshot the keywords JSON files into a Quilt data package. We then use Quilt T4 to upload the data package to an S3 bucket:

The final DAG consists of three Python Operator tasks that are executed daily. Tasks should have a single function, which ensures that the DAG is atomic (consisting of an indivisible and irreducible series of operations, and if one task fails the DAG fails). View the full DAG as a Gist.

Check your DAGs

Once you have minted a new DAG, check it’s loaded correctly into Airflow via the following command:

$ airflow list_dags

If successful, you will see your DAG returned in the output. You can also check that Airflow can process each individual task inside your DAG:

$ airflow list_tasks <dag_id>

Finally, you can test your DAG tasks end-to-end directly from the command line:

$ airflow test <dag_id> <task_id> <execution_date>

Visualization of sentiment

We now have a daily update of political article keywords across our online news sources: a proxy for sentiment. Sentiment analysis classifies the polarity of a given text and whether the expressed opinion is positive, negative, or neutral. Using Quilt T4’s native Vega support, we can create interactive visualizations for our data homepage.

An informative way of quickly visualizing sentiment is a Word Cloud. Vega’s Word Cloud allows customization of all components, including color ranges (aligned to the color schemes of our sources) and scale-based frequency. A quick review suggests that CNN’s political sentiment is diverse but not deep, with many single keyword occurrences:

Vega Word Cloud generated from CNN political article keywords (scraped February 22, 2019)

Political sentiment from The New York Times is much more tightly constrained (fewer keywords at higher frequencies):

Word Cloud generated from The New York Times political article keywords (scraped February 22, 2019)

This could be due to (1) a lower total number of political articles, and/or (2) highly focused articles on specific topics. Different visualizations could be used to determine which is hypothesis is correct.

Political sentiment from The Guardian is focused on the impending Brexit and it’s implications for national politics:

Vega Word Cloud generated from The Guardian political article keywords (scraped February 22, 2019)

Diffing dashboards

A nice feature of T4’s web-based data package UI is functionality to compare visualizations over time by selecting a visualization and using the drop-down menu to view previous versions. Below you can see how word clouds generated from The New York Times articles change over time.

Vega Word Clouds generated from The New York Times political article keywords: scraped February 22, 2019 (left) and February 23, 2019 (right)

Depending on your data source lifespan and DAG frequency you can readily see changes over time in your data on your package homepage.

Conclusion

We built an Apache Airflow DAG to scrape political article keywords from multiple online sources, created data snapshots and uploaded to an S3 bucket using Quilt T4, and built simple qualitative visualizations using Vega’s declarative grammar.

Quilt T4 provides a simple UI to compare data over time (as defined by DAG frequency). All the libraries used in this article are open source — if you find them useful please consider contributing to them!

I would be really interested to learn other more quantitative ways of visualizing data over time, especially parent-child relationships between keywords (such as treemaps). What visualizations have you found to be most effective?