Suppose you would like to construct a data set of short definitions and facts drawn from a large text corpus such as the set of all article abstracts found in the English-language edition of Wikipedia. You would also like the workflow you assemble to be reusable and flexible, allowing you to update the data set and/or enhance its capabilities quickly. What Python features and libraries, both built-in and from the broader community, can you use to accomplish this?
While built-in string methods have limited flexibility and regular expressions have limited expressive power, both can still be leveraged in creative ways to implement scalable workflows that process and analyze text data. This article explores these tools and introduces a few useful peripheral packages and techniques within the context of the use case described above.
Archives containing all abstracts found in the English-language edition of Wikipedia are regularly updated and available at https://dumps.wikimedia.org/enwiki/latest/ (a link you can find via this landing page). You decide to use the consolidated abstracts archive file containing all of the abstracts. How can you retrieve it in an automated manner? One way to retrieve a large file that is available online while also getting user-friendly feedback about progress is to use the requests library together with the tqdm library.
The tqdm library makes it possible to display an animated progress bar while an instance of an iterable data structure emits each of its elements. The example below illustrates the output. The position
and leave
parameters ensure that the progress bar is not interrupted by newline characters if the output is being displayed within Jupyter Notebook.
import tqdm
import time
for i in tqdm.tqdm(range(3), position=0, leave=True):
time.sleep(1)
The requests library makes it possible to create an iterable stream of the data from a file that is located at a specified URL by setting the stream
parameter to True
. You can then retrieve the data in chunks and write them to a file while displaying progress to the user. In the example below, the total
parameter supplied to tqdm.tqdm
is the number of 1 MB (i.e., $1024^2$ byte) chunks of data in the file; it is determined by dividing file size indicated in the HTTP response header by the chosen chunk_size
of 1 MB.
import requests
from math import ceil
url = "https://python.supply/"
url += "strings-regular-expressions-and-text-data-analysis/"
url += "enwiki-20191120-abstract-sample.xml.gz"
with requests.get(url, stream=True) as response:
file_size = int(response.headers['Content-length'])
chunk_size = 1024*1024
with open("abstracts.xml.gz", "wb") as file:
chunks = response.iter_content(chunk_size=chunk_size)
for chunk in tqdm.tqdm(
chunks, total=ceil(file_size/chunk_size),
position=0, leave=True
):
file.write(chunk)
import gzip
with gzip.open("abstracts.xml.gz") as file:
print(sum(1 for line in file))
Given that the data set is in XML format, the most general approach to parsing it is to use an XML parser such as lxml. However, in this case the text content in the file follows a consistent format: the text for each abstract appears on its own line in its entirety, surrounded by the two element delimiters <abstract>
and </abstract>
. Thus, you can iterate over every line in the file and determine whether or not you want to extract it. You divide this workflow into two steps: extracting the lines, and then extracting the abstracts themselves.
First, you decide to implement a generator that iterates over all the lines in the file and yields them one at a time. In case the text data on a particular line has an unknown encoding, the chardet library is used to check each line. This is not likely to be an issue in this particular data set, but allowing only those lines that are valid ASCII or UTF-8 can help you avoid any unexpected issues later in the process. Before yielding it, you convert each line from binary data into a string using the decode
method of the binary data type bytes
and then remove any whitespace at the beginning or end of the string using strip
.
from chardet import detect
def lines():
with gzip.open("abstracts.xml.gz") as file:
for line in file:
if detect(line)["encoding"] in ['ascii', 'utf-8']:
yield line.decode().strip()
How do you extract the actual text of an abstract from each line of the form <abstract>...</abstract>
? You can take advantage of the startswith
and endswith
built-in string methods to identify these lines, and then you can use string slice indices to extract the raw text of each abstract. The function below defines a generator that yields every abstract that appears on any line in the compressed file.
def abstracts():
for line in lines():
if line.startswith("<abstract>") and\
line.endswith("</abstract>"):
line = line[len("<abstract>"):-len("</abstract>")]
yield line
You can use the islice
function for iterable data structure instances to iterate over only a portion of the abstracts. The line below calculates the maximum length of the first 100 abstracts.
from itertools import islice
max(len(a) for a in islice(abstracts(), 0, 100))
Regular expressions are mathematical constructs that can be used to describe sets of strings that follow a particular pattern or format. The history and conceptual details of regular expressions are far beyond the scope of this article, but you only need to understand a few basic building blocks to begin using the built-in re library to identify when a string adheres to a specific pattern.
In Python, regular expressions are represented as strings. In the example below, the regular expression "abc"
is satisfied only by those strings that begin with the exact sequence of characters abc
. To check whether a string satisfies a regular expression, it is sufficient to compile the regular expression string using compile
and then to use the match
method to perform the check.
import re
re.compile("abc").match("abcd")
To ensure that only an entire string (as opposed to only a portion of the string) can satisfy a regular expression, it is possible to add ^
to the beginning of the regular expression and $
to the end.
print(re.compile("^abc$").match("abc"))
print(re.compile("^abc$").match("abcd"))
The syntax for regular expression strings consists of a large collection of special characters and character sequences. For the purposes of the motivating example, the below syntactic constructs are sufficient:
[a-z]
is satisfied by any lowercase single letter;\s
is satisfied by any single whitespace character and \.
is satisfied by a period;r
, the set of strings that satisfy r
also satisfy (r)
and vice versa;r
, the set of strings that satisfy r*
consists of all strings that are concatenations of zero or more individual strings that each independently satisfy r
(e.g., "abc"
, "aaaa"
, and ""
all satisfy [a-z]*
);r
and s
, the regular expression r|s
is satisfied by any string that satisfies either r
, s
, or both;r
and s
, the regular expression rs
is satisfied by any string consisting of a portion that satisfies r
followed immediately by a portion that satisfies s
(e.g., "xy"
satisfies [a-z][a-z]
, while "x"
and "xyz"
do not). To make it easier to experiment quickly with larger and more complex regular expressions while reusing common building blocks, you can define functions that take smaller expressions and build larger ones. Because regular expressions are represented using Python strings, you can use string methods such as join
to accomplish this. The functions below also use argument unpacking syntax to make code that invokes them more concise.
def atom(s):
# Wrap an expression in parenthesis.
return "(" + s + ")"
def options(*rs):
# Given expressions r1, r2, ..., rN, build
# the expression (r1)|(r2)|...|(rN).
return atom("|".join([atom(r) for r in rs]))
def adjacent(*rs):
# Given expressions r1, r2, ..., rN, build
# the expression (r1)(r2)...(rN).
return atom("".join([atom(r) for r in rs]))
def spaced(*rs):
# Given expressions r1, r2, ..., rN, build
# the expression (r1)\s(r2)\s...\s(rN).
return atom("\s".join([atom(r) for r in rs]))
def repeated(r):
# Given the expression r0, build ((r0)*).
return atom(atom(r) + "*")
def repeated_spaced(r):
# Given the expression r0, build (((r0)(\s))*(r0)).
return adjacent(repeated(adjacent(r, "\s")), r)
The definitions below are fairly self-explanatory and use the above functions to build more complex regular expressions. These are useful as building blocks for building a regular expression that matches a sentence of the form you are trying to identify in the data.
article = options("A", "a", "An", "an", "The", "the")
word = atom("[a-z]+")
verb = options("are", "is")
period = atom("\.")
The definition below is for a regular expression that is satisfied by any string that (1) begins with a definite or indefinite article, (2) has any non-zero number of words following the article, (3) has a linking verb followed by one or more additional words, and (4) ends with a period.
sentence = adjacent(spaced(article, word, repeated_spaced(word), verb, repeated_spaced(word)), period)
You can see the actual string that the function invocations above construct. It should be apparent that assembling (or modifying) a string such as the one below manually in its raw form is likely to be an error-prone process. Abstractions such as those above are an effective way to manage complexity in regular expressions.
sentence
You decide to use this regular expression to identify facts and definitions in the abstracts, storing a reusable compiled version of it in as sentence_re
for better performance. Before you check whether an abstract matches the regular expression sentence
, you use the string method split
to separate the string into a list of substrings (i.e., those substrings that are found between the instances of the .
character in the string). You then use list index notation to keep only the first entry. Since the split
method does not include the character used to split the string, you need to concatenate it back onto the first substring. The example below applies this workflow to a portion of the abstracts.
sentence_re = re.compile("^" + sentence + "$")
for a in islice(abstracts(), 0, 400):
a = a.split(".")[0] + '.' # Keep only first sentence.
if sentence_re.match(a) is not None:
print(a)
You are interested in learning more about the particular terms or phrases being described by the first sentence of each abstract. In particular, you would like to know the distribution of the number of words before the verb. One approach is to use the search
method to find the first instance of a verb and then to inspect the contents of the Match object to determine the location of the match. Then it is possible to use the string count
method on the substring (up to and including that matched verb) to determine the number of spaces (and thus, words) before the first verb.
counts = []
verb_re = re.compile(verb)
for abstract in tqdm.tqdm(
abstracts(), position=0, leave=True, total=210777
):
abstract = abstract.split(".")[0] + '.'
if sentence_re.match(abstract) is not None:
if (m := verb_re.search(abstract)) is not None:
counts.append(abstract[:m.end()].count(" "))
You may notice in the above example the use of an assignment expression involving the :=
operator. Scenarios such as this one play a significant role in the rationale for this new feature (introduced in Python 3.8). The alternative syntax would have required a separate line for the statement m = verb_re.search(abstract)
.
Now that the number of words before the verb in each abstract is in the list of integers counts
, it is straightforward to generate a histogram using Matplotlib. The results suggest that the most common case is one involving three words before the verb (which, given that the initial definite or indefinite article) is also being counted, suggests that most abstracts refer to a concept described by a two-word phrase).
import matplotlib.pyplot as plt
plt.hist(
counts,
bins=range(max(counts)), density=False, histtype="bar",
color="#DD7600", edgecolor="k", alpha=0.8
)
plt.ylabel("number of abstracts")
plt.xlabel("number of words before verb")
plt.xticks(range(0, max(counts)))
plt.show()
This article reviews a use case while referencing or employing techniques that are drawn from a broad array of disciplines and topic areas: data science (specificially, data cleaning), ETL, parsing and data structure search and traversal, text data analysis and natural language processing, regular expressions and regular language grammars, web scraping, and a few others. Providing more detailed directions for exploring each of these further is beyond the scope of this article, but a few points that relate specifically to implementing such workflows in Python are worth noting.
The regex package is backwards-compatible with the built-in re library but offers additional features such as fuzzy matching. The long-lived Beautiful Soup library is described as a tool for parsing HTML and XML documents (typically in scenarios involving web scraping), but is actually a powerful text data processing tool in its own right. An assortment of sophisticated natural language processing tools and techniques are available in the Natural Language Toolkit, including the ability to tokenize, tag, and parse text. Finally, towards the machine learning end of the spectrum are libraries like Gensim, which provide capabilities such as topic modeling.