Web Scraping in Python: BeautifulSoup

class: center, middle

.center[![Python](http://m1.paperblog.com/i/201/2016454/guia-python-conceptos-programacion-atributos--L-DTucOw.png)]

# Web Scraping in Python: BeautifulSoup

February 15, 2018 
Instructor: [S. M. Masoud Sadrnezhaad](https://twitter.com/smmsadrnezh)

---

Introduction
==========

- [Beautiful Soup](http://www.crummy.com/software/BeautifulSoup) is a
Python library for **pulling data out of HTML and XML files**. It works
with your favorite parser to provide idiomatic ways of **navigating,
searching, and modifying the parse tree**. It commonly saves programmers
hours or days of work.

- These instructions illustrate all major features of **Beautiful Soup 4**,
with examples.

- I show you what the library is good for, how it works,
how to use it, how to make it do what you want, and what to do when it
violates your expectations.

---

Example
===========

Here's an HTML document I'll be using as an example throughout this
document. It's part of a story from `Alice in Wonderland`:

```html
 html_doc = """
 <html><head><title>The Dormouse's story</title></head>
 <body>
 The Dormouse's story

Once upon a time there were three little sisters; and their names were
 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
 and they lived at the bottom of a well.

...
 """
```

---

BeautifulSoup Object
===========

Running the "three sisters" document through Beautiful Soup gives us a
`BeautifulSoup` object, which represents the document as a **nested
data structure**:

```python
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html_doc, 'html.parser')

print(soup.prettify())
```

```html
 # <html>
 # <head>
 # <title>
 # The Dormouse's story
 # </title>
 # </head>
 # <body>
 # 
 # 
 # The Dormouse's story
 # 
 # 
 # ...
 # </html>
```

---

Navigating the Data Structure
===========

Here are some simple ways to navigate that data structure:

```python
 soup.title
 # <title>The Dormouse's story</title>

soup.title.name
 # u'title'

soup.title.string
 # u'The Dormouse's story'

soup.title.parent.name
 # u'head'

soup.p
 # The Dormouse's story

soup.p['class']
 # u'title'
```

---

Navigating the Data Structure (Contd)
===========

```python
 soup.a
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

soup.find_all('a')
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find(id="link3")
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
```

---

Navigating the Data Structure (Contd)
===========

One common task is **extracting all the URLs** found within a page's `<a>` tags:

```python
 for link in soup.find_all('a'):
     print(link.get('href'))
 # http://example.com/elsie
 # http://example.com/lacie
 # http://example.com/tillie
```

Another common task is extracting all the text from a page:

```python
 print(soup.get_text())
 # The Dormouse's story
 #
 # The Dormouse's story
 #
 # Once upon a time there were three little sisters; and their names were
 # Elsie,
 # Lacie and
 # Tillie;
 # and they lived at the bottom of a well.
 #
 # ...
```

---

Installing Beautiful Soup
=========================

If you're using a recent version of Debian or Ubuntu Linux, you can
install Beautiful Soup with the system package manager:

## apt method

```bash
$ apt-get install python-bs4 (for Python 2)

$ apt-get install python3-bs4 (for Python 3)
```

## pip method

```bash
$ pip install beautifulsoup4
```

(The `BeautifulSoup` package is probably `not` what you want. That's
the previous major release, `Beautiful Soup 3`_. Lots of software uses
BS3, so it's still available, but if you're writing new code you
should install `beautifulsoup4`.)

---

Installing Beautiful Soup
=========================

Installing a parser
-------------------

Beautiful Soup supports the HTML parser included in Python's standard
library, but it also supports a number of third-party Python parsers.
One is the [lxml parser](http://lxml.de). Depending on your setup,
you might install lxml with one of these commands:

```bash
$ apt-get install python-lxml

$ pip install lxml
```

Another alternative is the pure-Python [html5lib parser](http://code.google.com/p/html5lib), which parses HTML the way a
web browser does. Depending on your setup, you might install html5lib
with one of these commands:

```bash
$ apt-get install python-html5lib

$ pip install html5lib
```

---

Making the soup
===============

To parse a document, pass it into the `BeautifulSoup`
constructor. You can pass in a string or an open filehandle:

```python
 from bs4 import BeautifulSoup

with open("index.html") as fp:
     soup = BeautifulSoup(fp)

soup = BeautifulSoup("<html>data</html>")
```

First, the document is converted to Unicode, and HTML entities are
converted to Unicode characters:

```python
 BeautifulSoup("Sacré bleu!")
 <html><head></head><body>Sacré bleu!</body></html>
```

Beautiful Soup then parses the document using the best available
parser. It will use an HTML parser unless you specifically tell it to
use an XML parser.

---

Kinds of objects
================

Beautiful Soup **transforms** a complex HTML document into a **complex tree
of Python objects**. But you'll only ever have to deal with about four
`kinds` of objects:

* `Tag`
* `NavigableString`
* `BeautifulSoup`
* `Comment`

---

Kinds of objects (Contd)
================

`Tag`
-------

A `Tag` object corresponds to an XML or HTML tag in the original document:

```python
 soup = BeautifulSoup('Extremely bold')
 tag = soup.b
 type(tag)
 # <class 'bs4.element.Tag'>
```

Tags have **a lot of attributes and methods**, and I'll cover most of them
in `Navigating the tree` and `Searching the tree`. For now, the most
important features of a tag are its **name** and **attributes**.

---

`Tag` (Contd)
================

### Name

Every tag has a name, accessible as `.name`:

```python
 tag.name
 # u'b'
```

If you change a tag's name, the change will be reflected in any HTML
markup generated by Beautiful Soup:

```python
 tag.name = "blockquote"
 tag
 # <blockquote class="boldest">Extremely bold</blockquote>
```

---

`Tag` (Contd)
================

### Attributes

A tag may have any number of attributes. The tag `` has an attribute "id" whose value is
"boldest". You can access a tag's attributes by treating the tag like
a dictionary:

```python
 tag['id']
 # u'boldest'
```

You can access that dictionary directly as `.attrs`:

```python
 tag.attrs
 # {u'id': 'boldest'}
```

---

`Tag` (Contd)
================

You can add, remove, and modify a tag's attributes. Again, this is
done by treating the tag as a dictionary:

```python
 tag['id'] = 'verybold'
 tag['another-attribute'] = 1
 tag
 #

del tag['id']
 del tag['another-attribute']
 tag
 #

tag['id']
 # KeyError: 'id'
 print(tag.get('id'))
 # None
```

---

`Tag` (Contd)
================

#### Multi-valued attributes

HTML 4 defines a few attributes that can have multiple values. HTML 5
removes a couple of them, but defines a few more. The **most common
multi-valued attribute** is `class` (that is, a tag can have more than
one CSS class). Others include `rel`, `rev`, `accept-charset`,
`headers`, and `accesskey`. Beautiful Soup presents the value(s)
of a multi-valued attribute **as a list**:

```python
 css_soup = BeautifulSoup('')
 css_soup.p['class']
 # ["body"]

css_soup = BeautifulSoup('')
 css_soup.p['class']
 # ["body", "strikeout"]
```

---

`Tag` (Contd)
================

If an attribute `looks` like it has more than one value, but **it's not
a multi-valued attribute** as defined by any version of the HTML
standard, Beautiful Soup will **leave the attribute alone**:

```python
 id_soup = BeautifulSoup('')
 id_soup.p['id']
 # 'my id'
```

When you turn a tag back into a string, multiple attribute values are
consolidated:

```python
 rel_soup = BeautifulSoup('Back to the <a rel="index">homepage</a>')
 rel_soup.a['rel']
 # ['index']
 rel_soup.a['rel'] = ['index', 'contents']
 print(rel_soup.p)
 # Back to the <a rel="index contents">homepage</a>
```

---

`Tag` (Contd)
================

You can use `get_attribute_list` to get a value that's always a list,
string, whether or not it's a multi-valued atribute

```python
  id_soup.p.get_attribute_list('id')
  # ["my id"]
```

If you parse a document as XML, there are no multi-valued attributes:

```python
 xml_soup = BeautifulSoup('', 'xml')
 xml_soup.p['class']
 # u'body strikeout'
```

---

Kinds of objects (Contd)
================

`NavigableString`
-------

A string corresponds to a bit of **text within a tag**. Beautiful Soup
uses the `NavigableString` class to contain these bits of text:

```python
 tag.string
 # u'Extremely bold'
 type(tag.string)
 # <class 'bs4.element.NavigableString'>
```

A `NavigableString` is just like a Python Unicode string, except
that it also supports some of the features described in `Navigating
the tree`_ and `Searching the tree`_. You can **convert a
`NavigableString` to a Unicode string** with `unicode()`:

```python
 unicode_string = unicode(tag.string)
 unicode_string
 # u'Extremely bold'
 type(unicode_string)
 # <type 'unicode'>
```

---

`NavigableString` (Contd)
================

You can't edit a string in place, but you can **replace one string with
another**, using `replace_with`:

```python
 tag.string.replace_with("No longer bold")
 tag
 # <blockquote>No longer bold</blockquote>
```

`NavigableString` supports most of the features described in
`Navigating the tree`_ and `Searching the tree`_, but **not all of
them**. In particular, since a **string can't contain anything** (the way a
tag may contain a string or another tag), strings **don't support the
`.contents` or `.string` attributes, or the `find()` method**.

If you want to use a `NavigableString` **outside of Beautiful Soup**,
you should call `unicode()` on it to **turn it into a normal Python
Unicode string**. If you don't, your string will carry around a
reference to the entire Beautiful Soup parse tree, even when you're
done using Beautiful Soup. This is a big waste of memory.

---

Kinds of objects (Contd)
================

`BeautifulSoup`
-----------------

The `BeautifulSoup` object itself **represents the document as a
whole**. For most purposes, you can treat it as a `Tag`
object. This means it supports most of the methods described in
`Navigating the tree` and `Searching the tree`.

Since the `BeautifulSoup` object **doesn't correspond to an actual
HTML or XML tag**, it has no name and no attributes. But sometimes it's
useful to look at its `.name`, so it's been given the special
`.name` "[document]":

```python
 soup.name
 # u'[document]'
```

---

Kinds of objects (Contd)
================

Comments and other special strings
----------------------------------

`Tag`, `NavigableString`, and `BeautifulSoup` cover almost
everything you'll see in an HTML or XML file, but there are a few
leftover bits. The only one you'll probably ever need to worry about
is the **comment**:

```python
 markup = ""
 soup = BeautifulSoup(markup)
 comment = soup.b.string
 type(comment)
 # <class 'bs4.element.Comment'>
```

The `Comment` object is just a **special type of `NavigableString`**:

```python
 comment
 # u'Hey, buddy. Want to buy a used parser'
```

---

Comments and other special strings (Contd)
================

But when it appears as part of an HTML document, a `Comment` is
displayed with special formatting:

```python
 print(soup.b.prettify())
 # 
 # 
 # 
```

---

Comments and other special strings (Contd)
================

Beautiful Soup **defines classes for anything else** that might show up in
an XML document: `CData`, `ProcessingInstruction`,
`Declaration`, and `Doctype`. Just like `Comment`, these classes
are **subclasses of `NavigableString`** that add something extra to the
string. Here's an example that replaces the comment with a CDATA
block:

```python
 from bs4 import CData
 cdata = CData("A CDATA block")
 comment.replace_with(cdata)

print(soup.b.prettify())
 # 
 # <![CDATA[A CDATA block]]>
 # 
```

---

Navigating the tree
================

Here's the "Three sisters" HTML document again:

```python
 html_doc = """
 <html><head><title>The Dormouse's story</title></head>
 <body>
 The Dormouse's story

...
 """

from bs4 import BeautifulSoup
 soup = BeautifulSoup(html_doc, 'html.parser')
```

I'll use this as an example to show you how to **move from one part of
a document to another**.

---

Navigating the tree (Contd)
================

Going down
----------

Tags may contain strings and other tags. These elements are the **tag's
`children`**. Beautiful Soup provides a lot of different attributes for
navigating and iterating over a tag's children.

Note that Beautiful Soup strings don't support any of these
attributes, because a string can't have children.

---

Going down (Contd)
================

### Navigating using tag names

The simplest way to navigate the parse tree is to say the name of the
tag you want. If you want the `<head>` tag, just say `soup.head`:

```python
 soup.head
 # <head><title>The Dormouse's story</title></head>

soup.title
 # <title>The Dormouse's story</title>
```

You can do use this trick again and again to **zoom in on a certain part
of the parse tree**. This code gets the first `` tag beneath the `<body>` tag:

```python
 soup.body.b
 # The Dormouse's story
```

---

Going down (Contd)
================

Using a tag name as an attribute will give you only the `first` tag by that
name:

```python
 soup.a
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
```

If you need to **get `all` the `<a>` tags**, or anything more complicated
than the first tag with a certain name, you'll need to use one of the
methods described in `Searching the tree`, such as **`find_all()`**:

```python
 soup.find_all('a')
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```

---

Going down (Contd)
================

### `.contents` and `.children`

A tag's children are available in a list called `.contents`:

```python
 head_tag = soup.head
 head_tag
 # <head><title>The Dormouse's story</title></head>

head_tag.contents
 [<title>The Dormouse's story</title>]

title_tag = head_tag.contents[0]
 title_tag
 # <title>The Dormouse's story</title>
 title_tag.contents
 # [u'The Dormouse's story']
```

---

Going down (Contd)
================

The `BeautifulSoup` object itself has children. In this case, the
`<html>` tag is the child of the `BeautifulSoup` object.:

```python
 len(soup.contents)
 # 1
 soup.contents[0].name
 # u'html'
```

A string does not have `.contents`, because it can't contain
anything:

```python
 text = title_tag.contents[0]
 text.contents
 # AttributeError: 'NavigableString' object has no attribute 'contents'
```

---

Going down (Contd)
================

### `.descendants`

The `.contents` and `.children` attributes only consider a tag's
**`direct` children**. For instance, the `<head>` tag has a single direct
child the `<title>` tag:

```python
 head_tag.contents
 # [<title>The Dormouse's story</title>]
```

---

Going down (Contd)
================

But the `<title>` tag itself has a child: the string "The Dormouse's
story". There's a sense in which that string is also a child of the
`<head>` tag. The `.descendants` attribute lets you **iterate over `all`
of a tag's children**, recursively: its direct children, the children of
its direct children, and so on:

```python
 for child in head_tag.descendants:
 print(child)
 # <title>The Dormouse's story</title>
 # The Dormouse's story
```

The `<head>` tag has only one child, but it has two descendants: the
`<title>` tag and the `<title>` tag's child. The `BeautifulSoup` object
only has one direct child (the `<html>` tag), but it has a whole lot of
descendants:

```python
 len(list(soup.children))
 # 1
 len(list(soup.descendants))
 # 25
```

---

Going down (Contd)
================

### `.string`

If a **tag has only one child**, and that child is a `NavigableString`,
the child is made available as `.string`:

```python
 title_tag.string
 # u'The Dormouse's story'
```

If a **tag's only child is another tag**, and **`that` tag has a
`.string`**, then the parent tag is considered to have the same
`.string` as its child:

```python
 head_tag.contents
 # [<title>The Dormouse's story</title>]
```

```python
 head_tag.string
 # u'The Dormouse's story'
```

---

Going down (Contd)
================

If a **tag contains more than one thing**, then it's not clear what
`.string` should refer to, so `.string` is defined to be
`None`:

```python
 print(soup.html.string)
 # None
```

---

Going down (Contd)
================

### `.strings` and `stripped_strings`

If there's **more than one thing inside a tag**, you can still look at
just the strings. Use the `.strings` generator:

```python
 for string in soup.strings:
     print(repr(string))
 # u"The Dormouse's story"
 # u'\n\n'
 # u"The Dormouse's story"
 # u'\n\n'
 # u'Once upon a time there were three little sisters; and their names were\n'
 # u'Elsie'
 # u',\n'
 # u'Lacie'
 # u' and\n'
 # u'Tillie'
 # u';\nand they lived at the bottom of a well.'
 # u'\n\n'
 # u'...'
 # u'\n'
```

---

Going down (Contd)
================

These strings tend to have a **lot of extra whitespace**, which you can
**remove** by using the `.stripped_strings` generator instead:

```python
 for string in soup.stripped_strings:
     print(repr(string))
 # u"The Dormouse's story"
 # u"The Dormouse's story"
 # u'Once upon a time there were three little sisters; and their names were'
 # u'Elsie'
 # u','
 # u'Lacie'
 # u'and'
 # u'Tillie'
 # u';\nand they lived at the bottom of a well.'
 # u'...'
```

Here, strings consisting **entirely of whitespace** are **ignored**, and
whitespace at the **beginning and end of strings** is **removed**.

---

Navigating the tree (Contd)
================

Going up
--------

Continuing the "family tree" analogy, every tag and every string has a
`parent`: the tag that contains it.

### `.parent`

You can access an element's parent with the `.parent` attribute. In
the example "three sisters" document, the `<head>` tag is the parent
of the `<title>` tag:

```python
 title_tag = soup.title
 title_tag
 # <title>The Dormouse's story</title>
 title_tag.parent
 # <head><title>The Dormouse's story</title></head>
```

---

Going up (Contd)
================

The title string itself has a parent: the `<title>` tag that contains
it:

```python
 title_tag.string.parent
 # <title>The Dormouse's story</title>
```

The parent of a top-level tag like `<html>` is the `BeautifulSoup` object
itself:

```python
 html_tag = soup.html
 type(html_tag.parent)
 # <class 'bs4.BeautifulSoup'>
```

And the `.parent` of a `BeautifulSoup` object is defined as None:

```python
 print(soup.parent)
 # None
```

---

Going up (Contd)
================

### `.parents`

You can iterate over all of an element's parents with
`.parents`. This example uses `.parents` to travel from an <a> tag
buried deep within the document, to the very top of the document:

```python
 link = soup.a
 link
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
 for parent in link.parents:
 if parent is None:
 print(parent)
 else:
 print(parent.name)
 # p
 # body
 # html
 # [document]
 # None
```

---

Navigating the tree (Contd)
================

Going sideways
--------------

Consider a simple document like this:

```python
 sibling_soup = BeautifulSoup("<a>text1<c>text2</c></a>")
 print(sibling_soup.prettify())
 # <html>
 # <body>
 # <a>
 # 
 # text1
 # 
 # <c>
 # text2
 # </c>
 # </a>
 # </body>
 # </html>
```

---

Going sideways (Contd)
================

The `` tag and the `<c>` tag are at the same level: they're both **direct
children of the same tag**. We call them `siblings`. When a document is
pretty-printed, siblings show up at the same indentation level. You
can also use this relationship in the code you write.

---

Going sideways (Contd)
================

### `.next_sibling` and `.previous_sibling`

You can use `.next_sibling` and `.previous_sibling` to navigate
between page elements that are on the same level of the parse tree:

```python
 sibling_soup.b.next_sibling
 # <c>text2</c>

sibling_soup.c.previous_sibling
 # text1
```

The `` tag has a `.next_sibling`, but no `.previous_sibling`,
because there's nothing before the `` tag `on the same level of the
tree`. For the same reason, the `<c>` tag has a `.previous_sibling`
but no `.next_sibling`:

```python
 print(sibling_soup.b.previous_sibling)
 # None
 print(sibling_soup.c.next_sibling)
 # None
```

---

Going sideways (Contd)
================

The strings "text1" and "text2" are `not` siblings, because they don't
have the same parent:

```python
 sibling_soup.b.string
 # u'text1'

print(sibling_soup.b.string.next_sibling)
 # None
```

---

Going sideways (Contd)
================

In real documents, the `.next_sibling` or `.previous_sibling` of a
tag will usually be a string containing whitespace. Going back to the
"three sisters" document:

```html
 <a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>
 <a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
 <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
```

You might think that the `.next_sibling` of the first `<a>` tag would
be the second `<a>` tag. But actually, it's a string: the comma and
newline that separate the first `<a>` tag from the second:

```python
 link = soup.a
 link
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

link.next_sibling
 # u',\n'
```

The second `<a>` tag is actually the `.next_sibling` of the comma:

```python
 link.next_sibling.next_sibling
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
```

---

Going sideways (Contd)
================

### `.next_siblings` and `.previous_siblings`

You can **iterate over a tag's siblings** with `.next_siblings` or
`.previous_siblings`:

```python
 for sibling in soup.a.next_siblings:
 print(repr(sibling))
 # u',\n'
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 # u' and\n'
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
 # u'; and they lived at the bottom of a well.'
 # None

for sibling in soup.find(id="link3").previous_siblings:
 print(repr(sibling))
 # ' and\n'
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
 # u',\n'
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
 # u'Once upon a time there were three little sisters; and their names were\n'
 # None
```

---

Navigating the tree (Contd)
================

Going back and forth
--------------------

Take a look at the beginning of the "three sisters" document:

```html
 <html><head><title>The Dormouse's story</title></head>
 The Dormouse's story
```

An HTML parser takes this string of characters and turns it into a
series of events: "open an `<html>` tag", "open a `<head>` tag", "open a
`<title>` tag", "add a string", "close the `<title>` tag", "open a ``
tag", and so on. Beautiful Soup offers tools for reconstructing the
initial parse of the document.

---

Going back and forth (Contd)
================

### `.next_element` and `.previous_element`

The `.next_element` attribute of a string or tag **points to whatever
was parsed immediately afterwards**. It might be the same as
`.next_sibling`, but it's usually drastically different.

---

Going back and forth (Contd)
================

Here's the final `<a>` tag in the "three sisters" document. Its
`.next_sibling` is a string: the conclusion of the sentence that was
interrupted by the start of the `<a>` tag.:

```python
 last_a_tag = soup.find("a", id="link3")
 last_a_tag
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

last_a_tag.next_sibling
 # '; and they lived at the bottom of a well.'
```

But the `.next_element` of that `<a>` tag, the thing that was parsed
immediately after the `<a>` tag, is `not` the rest of that sentence:
it's the word "Tillie":

```python
 last_a_tag.next_element
 # u'Tillie'
```

That's because in the original markup, the word "Tillie" appeared
before that semicolon. The parser encountered an <a> tag, then the
word "Tillie", then the closing </a> tag, then the semicolon and rest of
the sentence. The semicolon is on the same level as the <a> tag, but the
word "Tillie" was encountered first.

---

Going back and forth (Contd)
================

The `.previous_element` attribute is the exact opposite of
`.next_element`. It points to whatever element was **parsed
immediately before** this one:

```python
 last_a_tag.previous_element
 # u' and\n'
 last_a_tag.previous_element.next_element
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
```

---

Going back and forth (Contd)
================

### `.next_elements` and `.previous_elements`

You should get the idea by now. You can use these iterators to **move
forward or backward in the document** as it was **parsed**:

```python
 for element in last_a_tag.next_elements:
 print(repr(element))
 # u'Tillie'
 # u';\nand they lived at the bottom of a well.'
 # u'\n\n'
 # ...
 # u'...'
 # u'\n'
 # None
```

---

Regular Expression
================

.center[![xkcd #208](xkcd-208-regular_expressions.png)]

Source: [xkcd #208](http://xkcd.com/208)

---

Regular Expression (Contd)
================

- Also called REs, or regexes, or regex patterns

- A tiny, highly specialized **programming language** embedded inside Python and made
available through the `re` module.

- Using this little language, you specify
the rules for the set of possible **strings that you want to match**; this set might
contain English sentences, or e-mail addresses, or TeX commands, or anything you
like.

- You can then ask questions such as "Does this string match the pattern?",
or "Is there a match for the pattern anywhere in this string?".

- You can also
use REs to modify a string or to split it apart in various ways.

- You could also try [JavaScript, Python, PHP, and PCRE (regex101)](https://regex101.com). It is an online regex tester and debugger.

---

Regular Expression (Contd)
================

- The regular expression language is relatively small and restricted, so **not all
possible string processing tasks** can be done using regular expressions.

- There
are also tasks that *can* be done with regular expressions, but the expressions
turn out to be **very complicated**.

- In these cases, you may be better off **writing
Python code** to do the processing; while Python code will be **slower** than an
elaborate regular expression, it will also probably be **more understandable**.

- For a detailed explanation of the **computer science** underlying regular
expressions (deterministic and non-deterministic finite automata), you can refer
to almost any textbook on writing compilers.

---

Regular Expression (Contd)
================

Matching Characters
-------------------

- Most letters and characters will **simply match themselves**.

- For example, the
regular expression `test` will match the string `test` exactly.

- You can
enable a **case-insensitive mode** that would let this RE match `Test` or `TEST`
as well; more about this later.

- There are exceptions to this rule; some characters are special
`metacharacters`, and **don't match themselves**.

- Instead, they signal that
some out-of-the-ordinary thing should be matched, or they affect other portions
of the RE by **repeating them or changing their meaning**.

- Here's a complete list of the metacharacters; their meanings will be discussed
in the rest of this section.

```regex
. ^ $ * + ? { } [ ] \ | ( )
```

---

Regular Expression (Contd)
================

- The first metacharacters we'll look at are `[` and `]`.

- They're used for
specifying a **character class**, which is a set of characters that you wish to
match.  Characters can be listed individually, or a **range of characters** can be
indicated by giving two characters and separating them by a `'-'`.

- For
example, `[abc]` will match any of the characters `a`, `b`, or `c`; this
is the same as `[a-c]`, which uses a range to express the same set of
characters.

- If you wanted to match only lowercase letters, your RE would be
`[a-z]`.

- Metacharacters are **not active inside classes**.

- For example, `[akm$]` will
match any of the characters `'a'`, `'k'`, `'m'`, or `'$'`; `'$'` is
usually a metacharacter, but inside a character class it's stripped of its
special nature.

---

Regular Expression (Contd)
================

- You can match the **characters not listed** within the class by `complementing`
the set.

- This is indicated by including a `'^'` as the first character of the
class; `'^'` outside a character class will simply match the `'^'`
character.

- For example, `[^5]` will match any character except `'5'`.

- Perhaps the most important metacharacter is the backslash, `\`.

- As in Python
string literals, the backslash can be **followed by various characters** to signal
various special sequences.

- It's also used to **escape all the metacharacters** so
you can still match them in patterns; for example, if you need to match a `[`
or  `\`, you can precede them with a backslash to remove their special
meaning: `\[` or `\\`.

---

Regular Expression (Contd)
================

- Some of the special sequences beginning with `'\'` represent
**predefined sets of characters** that are often useful.

- `\d`
   Matches any decimal digit; this is equivalent to the class `[0-9]`.

- `\D`
   Matches any non-digit character; this is equivalent to the class `[^0-9]`.

- `\s`
   Matches any whitespace character; this is equivalent to the class `[
   \t\n\r\f\v]`.

- `\S`
   Matches any non-whitespace character; this is equivalent to the class `[^
   \t\n\r\f\v]`.

- `\w`
   Matches any alphanumeric character; this is equivalent to the class
   `[a-zA-Z0-9]`.

- `\W`
   Matches any non-alphanumeric character; this is equivalent to the class
   `[^a-zA-Z0-9]`.

---

Regular Expression (Contd)
================

- These sequences can be included inside a character class.  For example,
`[\s,.]` is a character class that will match any whitespace character, or
`','` or `'.'`.

- The final metacharacter in this section is `.`.  It matches anything except a
newline character, and there's an alternate mode (`re.DOTALL`) where it will
match even a newline.  `.` is often used where you want to match "any
character".

---

Regular Expression (Contd)
================

Repeating Things
----------------

- Another capability is that you can specify that
portions of the RE must be repeated a certain number of times.

- The first metacharacter for **repeating things** that we'll look at is `*`.

- `*`
doesn't match the literal character `'*'`; instead, it specifies that the
**previous character** can be matched **zero or more times**, instead of exactly once.

- For example, `ca*t` will match `'ct'` (0 `'a'` characters), `'cat'` (1 `'a'`),
`'caaat'` (3 `'a'` characters), and so forth.

- A step-by-step **example** will make this more obvious.  Let's consider the
expression `a[bcd]*b`.  This matches the letter `'a'`, zero or more letters
from the class `[bcd]`, and finally ends with a `'b'`.  Now imagine matching
this RE against the string `'abcbd'`.

---

Regular Expression (Contd)
================

---

Regular Expression (Contd)
================

The end of the RE has now been reached, and it has matched `'abcb'`.  This
demonstrates how the matching engine goes as far as it can at first, and if no
match is found it will then progressively back up and retry the rest of the RE
again and again.

---

Regular Expression (Contd)
================

- Another repeating metacharacter is `+`, which matches one or more times.

- Pay
careful attention to the difference between `*` and `+`; `*` matches
*zero* or more times, so whatever's being repeated may not be present at all,
while `+` requires at least *one* occurrence.

- To use a similar example,
`ca+t` will match `'cat'` (1 `'a'`), `'caaat'` (3 `'a'`\ s), but won't
match `'ct'`.

- The question mark character, `?`,
matches either **once or zero times**; you can think of it as marking something as
being **optional**.

- For example, `home-?brew` matches either `'homebrew'` or
`'home-brew'`.

---

Regular Expression (Contd)
================

- The most complicated repeated qualifier is `{m,n}`, where *m* and *n* are
decimal integers.  This qualifier means there must be at least *m* repetitions,
and at most *n*.

- For example, `a/{1,3}b` will match `'a/b'`, `'a//b'`, and
`'a///b'`.  It won't match `'ab'`, which has no slashes, or `'a////b'`, which
has four.

- You can **omit** either *m* or *n*; in that case, **a reasonable value is assumed** for
the missing value.

- Omitting *m* is interpreted as a **lower limit of 0**, while
omitting *n* results in an **upper bound of infinity**.

- Three other qualifiers can
all be expressed using this notation.  `{0,}` is the same as `*`, `{1,}`
is equivalent to `+`, and `{0,1}` is the same as `?`.

- It's better to use
`*`, `+`, or `?` when you can, simply because they're **shorter and easier
to read.**

---

Using Regular Expressions in Python
================

- The `re` module provides an interface to the regular
expression engine, allowing you to compile REs into objects and then perform
matches with them.

Compiling Regular Expressions
-----------------------------

- Regular expressions are compiled into **pattern objects**, which have
methods for **various operations** such as searching for pattern matches or
performing string substitutions.

```python
   >>> import re
   >>> p = re.compile('ab*')
   >>> p
   re.compile('ab*')
```

- `re.compile` also accepts an **optional *flags* argument**, used to enable
various special features and syntax variations.  We'll go over the available
settings later, but for now a single example will do:

```python
   >>> p = re.compile('ab*', re.IGNORECASE)
```

---

Compiling Regular Expressions (Contd)
================

- The RE is passed to `re.compile` as a string.

- REs are handled as strings
because regular expressions aren't part of the core Python language, and no
special syntax was created for expressing them.

- There are applications that
don't need REs at all, so there's no need to bloat the language specification by
including them.

- Instead, the `re` module is simply a C extension module
included with Python, just like the `socket` or `zlib` modules.

- Putting REs in strings keeps the Python language simpler, but has one
disadvantage.

---

Using Regular Expressions in Python
================

Performing Matches
------------------

Once you have an object representing a compiled regular expression, what do you
do with it?  Pattern objects have **several methods and attributes**.
Only the most significant ones will be covered here.

---

More Pattern Power
================

More Metacharacters
-------------------

- There are some metacharacters that we haven't covered yet.  Most of them will be
covered in this section.

- Some of the remaining metacharacters to be discussed are `zero-width
assertions`.

- They don't cause the engine to advance through the string;
instead, **they consume no characters at all**, and **simply succeed or fail**.

- Zero-width assertions should never be repeated, because if they match once at a
given location, they can obviously be matched an infinite number of times.

---

More Metacharacters (Contd)
================

- **`|`**

- Alternation, or the **"or" operator**.   If *A* and *B* are regular expressions,
   `A|B` will match any string that matches either *A* or *B*.

- `|` has very
   low precedence in order to make it work reasonably when you're alternating
   multi-character strings.

- `Crow|Servo` will match either `'Crow'` or `'Servo'`,
   not `'Cro'`, a `'w'` or an `'S'`, and `'ervo'`.

- To match a literal `'|'`, use `\|`, or enclose it inside a character class,
   as in `[|]`.

---

More Metacharacters (Contd)
================

- **`^`**

- Matches at the **beginning of lines**.

- For example, if you wish to match the word `From` only at the beginning of a
   line, the RE to use is `^From`. :

```python
 >>> print(re.search('^From', 'From Here to Eternity')) #doctest: +ELLIPSIS
 <re.Match object; span=(0, 4), match='From'>
 >>> print(re.search('^From', 'Reciting From Memory'))
 None
```

- To match a literal `'^'`, use `\^`.

---

More Metacharacters (Contd)
================

- **`$`**

- Matches at the end of a line, which is defined as either the **end of the string**,
   or **any location followed by a newline character**.

```python
 >>> print(re.search('}$', '{block}')) #doctest: +ELLIPSIS
 <re.Match object; span=(6, 7), match='}'>
 >>> print(re.search('}$', '{block} '))
 None
 >>> print(re.search('}$', '{block}\n')) #doctest: +ELLIPSIS
 <re.Match object; span=(6, 7), match='}'>
```

- To match a literal `'$'`, use `\$` or enclose it inside a character class,
   as in  `[$]`.

---

Searching the tree
================

- If the previous section on regular expressions seemed a little disjointed from the mission of this lecture, here’s where it all ties together. BeautifulSoup and regular expressions go hand in hand when it comes to scraping the Web.

- Beautiful Soup defines **a lot of methods** for searching the parse tree,
but they're all very similar. I'm going to spend a lot of time explaining
the **two most popular methods**: `find()` and `find_all()`. The other
methods take almost exactly the **same arguments**, so I'll just cover
them briefly.

---

Searching the tree (Contd)
================

Once again, I'll be using the "three sisters" document as an example:

```html
 html_doc = """
 <html><head><title>The Dormouse's story</title></head>
 <body>
 The Dormouse's story

...
 """
```

```python
 from bs4 import BeautifulSoup
 soup = BeautifulSoup(html_doc, 'html.parser')
```

---

Searching the tree (Contd)
================

By passing in a filter to an argument like `find_all()`, you can
**zoom in** on the parts of the document you're interested in.

Kinds of filters
----------------

Before talking in detail about `find_all()` and similar methods, I
want to show examples of different filters you can pass into these
methods. These filters show up again and again, throughout the
search API. You can use them to filter based on a tag's name,
on its **attributes**, on the **text of a string**, or on some **combination** of
these.

---

Kinds of filters (Contd)
================

### A string

The simplest filter is a string. Pass a string to a search method and
Beautiful Soup will perform a match against that **exact string**. This
code finds all the `` tags in the document:

```python
 soup.find_all('b')
 # [The Dormouse's story]
```

If you pass in a **byte string**, Beautiful Soup will assume the string is
encoded as **UTF-8**. You can avoid this by passing in a Unicode string instead.

---

Kinds of filters (Contd)
================

### A regular expression

If you pass in a regular expression object, Beautiful Soup will filter
against that regular expression using its `search()` method. This code
finds all the tags whose names start with the letter "b"; in this
case, the `<body>` tag and the `` tag:

```python
 import re
 for tag in soup.find_all(re.compile("^b")):
     print(tag.name)
 # body
 # b
```

This code finds all the tags whose names contain the letter 't':

```python
 for tag in soup.find_all(re.compile("t")):
     print(tag.name)
 # html
 # title
```

---

Kinds of filters (Contd)
================

### A list

If you pass in a list, Beautiful Soup will allow a **string match**
against `any` item **in that list**. This code finds all the `<a>` tags
`and` all the `` tags:

```python
 soup.find_all(["a", "b"])
 # [The Dormouse's story,
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```

---

Kinds of filters (Contd)
================

### `True`

The value `True` matches **everything it can**. This code finds `all`
the tags in the document, but **none of the text strings**:

```python
 for tag in soup.find_all(True):
     print(tag.name)
 # html
 # head
 # title
 # body
 # p
 # b
 # p
 # a
 # a
 # a
 # p
```

---

Kinds of filters (Contd)
================

### A function

If none of the other matches work for you, define a function that
takes an element as its **only argument**. The function should **return**
`True` if the argument matches, and `False` otherwise.

---

Kinds of filters (Contd)
================

### A function (Contd)

Here's a function that returns `True` if a tag defines the "class"
attribute but doesn't define the "id" attribute:

```python
 def has_class_but_no_id(tag):
     return tag.has_attr('class') and not tag.has_attr('id')
```

Pass this function into `find_all()` and you'll pick up all the ``
tags:

```python
 soup.find_all(has_class_but_no_id)
 # [The Dormouse's story,
 # Once upon a time there were...,
 # ...]
```

This function only picks up the `` tags. It doesn't pick up the `<a>`
tags, because those tags define both "class" and "id". It doesn't pick
up tags like `<html>` and `<title>`, because those tags don't define
"class".

---

Kinds of filters (Contd)
================

### A function (Contd)

If you **pass in a function to filter on a specific attribute** like
`href`, the argument passed into the function will be the attribute
value, not the whole tag. Here's a function that finds all `a` tags
whose `href` attribute *does not* match a regular expression:

```python
 def not_lacie(href):
 return href and not re.compile("lacie").search(href)
 soup.find_all(href=not_lacie)
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```

---

Kinds of filters (Contd)
================

### A function (Contd)

The function can be as **complicated** as you need it to be. Here's a
function that returns `True` if a tag is surrounded by string
objects:

```python
 from bs4 import NavigableString
 def surrounded_by_strings(tag):
     return (isinstance(tag.next_element, NavigableString)
             and isinstance(tag.previous_element, NavigableString))

for tag in soup.find_all(surrounded_by_strings):
     print tag.name
 # p
 # a
 # a
 # a
 # p
```

Now we're ready to look at the search methods in detail.

---

Searching the tree (Contd)
================

`find_all()`
--------------

Signature: find_all(`name`, `attrs`, `recursive`, `string`, `limit`, `**kwargs`)

The `find_all()` method looks through a **tag's descendants** and retrieves `all` descendants that match your filters.

---

`find_all()` (Contd)
================

I gave several examples in `Kinds of filters`, but here are a few more:

```python
 soup.find_all("title")
 # [<title>The Dormouse's story</title>]

soup.find_all("p", "title")
 # [The Dormouse's story]

soup.find_all("a")
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.find_all(id="link2")
 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

import re
 soup.find(string=re.compile("sisters"))
 # u'Once upon a time there were three little sisters; and their names were\n'
```

---

`find_all()` (Contd)
================

Some of these should look familiar, but others are new. What does it
mean to pass in a value for `string`, or `id`? Why does
`find_all("p", "title")` find a `` tag with the CSS class "title"?
Let's look at the **arguments to `find_all()`**.

### The `name` argument

Pass in a value for `name` and you'll tell Beautiful Soup to only
consider **tags with certain names**. Text strings will be ignored, as
will tags whose names that don't match.

This is the simplest usage:

```python
 soup.find_all("title")
 # [<title>The Dormouse's story</title>]
```

Recall from `Kinds of filters`_ that the value to `name` can be `a
string`, `a regular expression`, `a list`, `a function`, or `the value
True`.

---

`find_all()` (Contd)
================

### The keyword arguments

Any argument that's **not recognized** will be turned into a **filter on one
of a tag's attributes**. If you pass in a value for an argument called `id`,
Beautiful Soup will filter against each **tag's 'id' attribute**:

```python
 soup.find_all(id='link2')
 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
```

If you pass in a value for `href`, Beautiful Soup will filter
against each **tag's 'href' attribute**:

```python
 soup.find_all(href=re.compile("elsie"))
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
```

---

`find_all()` (Contd)
================

You can filter an attribute based on `a string`, `a regular
expression`, `a list`, `a function`, or `the value True`.

This code finds all tags whose **`id` attribute has a value**,
regardless of what the value is:

```python
 soup.find_all(id=True)
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```

You can filter multiple attributes at once by passing in more than one
keyword argument:

```python
 soup.find_all(href=re.compile("elsie"), id='link1')
 # [<a class="sister" href="http://example.com/elsie" id="link1">three</a>]
```

---

`find_all()` (Contd)
================

Some attributes, like the data-* attributes in HTML 5, have names that
**can't be used as the names of keyword arguments**:

```python
 data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
 data_soup.find_all(data-foo="value")
 # SyntaxError: keyword can't be an expression
```

You can use these attributes in searches by putting them into a
**dictionary** and passing the dictionary into `find_all()` **as the
`attrs` argument**:

```python
 data_soup.find_all(attrs={"data-foo": "value"})
 # [<div data-foo="value">foo!</div>]
```

---

`find_all()` (Contd)
================

You **can't use** a keyword argument to search for HTML's **'name'** element,
because Beautiful Soup uses the `name` argument to contain the name
of the tag itself. Instead, you can **give a value to 'name' in the
`attrs` argument**.

```python
 name_soup = BeautifulSoup('<input name="email"/>')
 name_soup.find_all(name="email")
 # []
 name_soup.find_all(attrs={"name": "email"})
 # [<input name="email"/>]
```

---

`find_all()` (Contd)
================

### Searching by CSS class

It's very useful to search for a tag that has a certain CSS class, but
the name of the CSS attribute, **"class", is a reserved word** in
Python. Using `class` as a keyword argument will give you a syntax
error. As of Beautiful Soup 4.1.2, you can search by CSS class using
the **keyword argument `class_`**:

```python
 soup.find_all("a", class_="sister")
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```

---

`find_all()` (Contd)
================

As with any keyword argument, you can pass `class_` a string, a regular
expression, a function, or `True`:

```python
 soup.find_all(class_=re.compile("itl"))
 # [The Dormouse's story]

def has_six_characters(css_class):
     return css_class is not None and len(css_class) == 6

soup.find_all(class_=has_six_characters)
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```

---

`find_all()` (Contd)
================

Remember `<multivalue>` that a single tag can have multiple
values for its "class" attribute. When you search for a tag that
matches a certain CSS class, you're **matching against `any` of its CSS
classes**:

```python
 css_soup = BeautifulSoup('')
 css_soup.find_all("p", class_="strikeout")
 # []

css_soup.find_all("p", class_="body")
 # []
```

You can also search for the **exact string value** of the `class` attribute:

```python
 css_soup.find_all("p", class_="body strikeout")
 # []
```

But searching for variants of the string value **won't work**:

```python
 css_soup.find_all("p", class_="strikeout body")
 # []
```

---

`find_all()` (Contd)
================

If you want to search for tags that match **two or more CSS classes**, you
should use a **CSS selector**:

```python
 css_soup.select("p.strikeout.body")
 # []
```

In **older versions** of Beautiful Soup, which **don't have the `class_`**
shortcut, **you can use the `attrs` trick** mentioned above. Create a
dictionary whose value for "class" is the string (or regular
expression, or whatever) you want to search for:

```python
 soup.find_all("a", attrs={"class": "sister"})
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```

---

`find_all()` (Contd)
================

### The `string` argument

With `string` you can search for **strings instead of tags**. As with
`name` and the keyword arguments, you can pass in `a string`, `a
regular expression`, `a list`, `a function`, or `the value True`.
Here are some examples:

```python
 soup.find_all(string="Elsie")
 # [u'Elsie']

soup.find_all(string=["Tillie", "Elsie", "Lacie"])
 # [u'Elsie', u'Lacie', u'Tillie']

soup.find_all(string=re.compile("Dormouse"))
 [u"The Dormouse's story", u"The Dormouse's story"]

def is_the_only_string_within_a_tag(s):
     """Return True if this string is the only child of its parent tag."""
     return (s == s.parent.string)

soup.find_all(string=is_the_only_string_within_a_tag)
 # [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
```

---

`find_all()` (Contd)
================

Although `string` is for finding strings, you can **combine it with
arguments that find tags**: Beautiful Soup will find all tags whose
`.string` matches your value for `string`. This code finds the `<a>`
tags whose `.string` is "Elsie":

```python
 soup.find_all("a", string="Elsie")
 # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
```

The `string` argument is new in Beautiful Soup 4.4.0. **In earlier
versions it was called `text`**:

```python
 soup.find_all("a", text="Elsie")
 # [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]
```

---

`find_all()` (Contd)
================

### The `limit` argument

`find_all()` returns **all** the tags and strings that match your
filters. This can **take a while** if the document is large. If you don't
need `all` the results, you can pass in **a number for `limit`**. This
works just like the LIMIT keyword in SQL. It tells Beautiful Soup to
**stop gathering results after it's found a certain number**.

There are three links in the "three sisters" document, but this code
only **finds the first two**:

```python
 soup.find_all("a", limit=2)
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
```

---

`find_all()` (Contd)
================

### The `recursive` argument

If you call `mytag.find_all()`, Beautiful Soup will examine all the
descendants of `mytag`: its children, its children's children, and
so on. If you only want Beautiful Soup to consider **direct children**,
you can pass in `recursive=False`. See the difference here:

```python
 soup.html.find_all("title")
 # [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
 # []
```

---

`find_all()` (Contd)
================

Here's that part of the document:

```html
 <html>
 <head>
 <title>
 The Dormouse's story
 </title>
 </head>
 ...
```

The `<title>` tag is beneath the `<html>` tag, but it's not `directly`
beneath the `<html>` tag: the `<head>` tag is in the way. Beautiful Soup
finds the `<title>` tag when it's allowed to look at all descendants of
the `<html>` tag, but when `recursive=False` restricts it to the
`<html>` tag's immediate children, it finds nothing.

Beautiful Soup offers a lot of tree-searching methods (covered below),
and they mostly take the same arguments as `find_all()`: `name`,
`attrs`, `string`, `limit`, and the keyword arguments. But the
**`recursive` argument is different: `find_all()` and `find()` are
the only methods that support it**. Passing `recursive=False` into a
method like `find_parents()` wouldn't be very useful.

---

Searching the tree (Contd)
================

Calling a tag is like calling `find_all()`
--------------------------------------------

Because `find_all()` is the **most popular** method in the Beautiful
Soup search API, you can use a **shortcut** for it. If you treat the
`BeautifulSoup` object or a `Tag` object **as though it were a
function**, then it's the same as calling `find_all()` on that
object. These two lines of code are **equivalent**:

```python
 soup.find_all("a")
 soup("a")
```

These two lines are also **equivalent**:

```python
 soup.title.find_all(string=True)
 soup.title(string=True)
```

---

Searching the tree (Contd)
================

`find()`
----------

Signature: find(`name`, `attrs`, `recursive`, `string`, `**kwargs`)

The `find_all()` method scans the entire document looking for
results, but sometimes you only want to **find one result**. If you know a
document **only has one** `<body>` tag, it's a waste of time to scan the
entire document looking for more. Rather than **passing in `limit=1`**
every time you call `find_all`, you can use the `find()`
method. These two lines of code are `nearly` **equivalent**:

```python
 soup.find_all('title', limit=1)
 # [<title>The Dormouse's story</title>]

soup.find('title')
 # <title>The Dormouse's story</title>
```

---

`find()` (Contd)
================

The only difference is that `find_all()` returns a **list** containing
the single result, and `find()` just returns the **result**.

If `find_all()` can't find anything, it returns an **empty list**. If
`find()` can't find anything, it **returns `None`**:

```python
 print(soup.find("nosuchtag"))
 # None
```

Remember the `soup.head.title` trick from `Navigating using tag
names`? That trick works by **repeatedly calling `find()`**:

```python
 soup.head.title
 # <title>The Dormouse's story</title>

soup.find("head").find("title")
 # <title>The Dormouse's story</title>
```

---

Searching the tree (Contd)
================

`find_parents()` and `find_parent()`
----------------------------------------

I spent **a lot of time** above covering `find_all()` and
`find()`. The Beautiful Soup API defines **ten other** methods for
searching the tree, but don't be afraid. **Five** of these methods are
basically the **same as `find_all()`**, and the **other five** are basically
the **same as `find()`**. The **only differences** are in what **parts of the
tree** they search.

First let's consider `find_parents()` and
`find_parent()`. Remember that `find_all()` and `find()` work
their way down the tree, looking at tag's descendants. These methods
**do the opposite**: they work their way **`up` the tree**, looking at a tag's
(or a string's) **parents**.

---

`find_parents()` and `find_parent()` (Contd)
================

Let's **try them** out, starting from a string
buried deep in the "three daughters" document:

```python
  a_string = soup.find(string="Lacie")
  a_string
  # u'Lacie'

a_string.find_parents("a")
 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

a_string.find_parent("p")
 # Once upon a time there were three little sisters; and their names were
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 # and they lived at the bottom of a well.

a_string.find_parents("p", class="title")
  # []
```

---

`find_parents()` and `find_parent()` (Contd)
================

One of the three `<a>` tags is the direct parent of the string in
question, so our search finds it. One of the three `` tags is an
indirect parent of the string, and our search finds that as
well. There's a `` tag with the CSS class "title" `somewhere` in the
document, but it's not one of this string's parents, so we can't find
it with `find_parents()`.

You may have made the connection between `find_parent()` and
`find_parents()`, and the `.parent`_ and `.parents`_ attributes
mentioned earlier. The **connection is very strong**. These search methods
actually **use `.parents`** to iterate over all the parents, and **check
each one against the provided filter** to see if it matches.

---

Searching the tree (Contd)
================

`find_next_siblings()` and `find_next_sibling()`
----------------------------------------------------

Signature: find_next_siblings(`name`, `attrs`, `string`, `limit`, `**kwargs`)

Signature: find_next_sibling(`name`, `attrs`, `string`, `**kwargs`)

---

`find_next_siblings()` and `find_next_sibling()` (Contd)
================

These methods use `.next_siblings` to iterate over the rest of an element's siblings in the tree. The
`find_next_siblings()` method returns all the siblings that match,
and `find_next_sibling()` only returns the first one:

```python
 first_link = soup.a
 first_link
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

first_link.find_next_siblings("a")
 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

first_story_paragraph = soup.find("p", "story")
 first_story_paragraph.find_next_sibling("p")
 # ...
```

---

Searching the tree (Contd)
================

`find_previous_siblings()` and `find_previous_sibling()`
------------------------------------------------------------

Signature: find_previous_siblings(`name`, `attrs`, `string`, `limit`, `**kwargs`)

Signature: find_previous_sibling(`name`, `attrs`, `string`, `**kwargs`)

---

`find_previous_siblings()` and `find_previous_sibling()` (Contd)
================

These methods use `.previous_siblings` to iterate over an element's
siblings that precede it in the tree. The `find_previous_siblings()`
method returns all the siblings that match, and
`find_previous_sibling()` only returns the first one:

```python
 last_link = soup.find("a", id="link3")
 last_link
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

last_link.find_previous_siblings("a")
 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

first_story_paragraph = soup.find("p", "story")
 first_story_paragraph.find_previous_sibling("p")
 # The Dormouse's story
```

---

Searching the tree (Contd)
================

`find_all_next()` and `find_next()`
---------------------------------------

Signature: find_all_next(`name`, `attrs`, `string`, `limit`, `**kwargs`)

Signature: find_next(`name`, `attrs`, `string`, `**kwargs`)

These methods use `.next_elements` to
iterate over whatever tags and strings that come after it in the
document. The `find_all_next()` method returns **all matches**, and
`find_next()` only returns the **first match**:

```python
 first_link = soup.a
 first_link
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

first_link.find_all_next(string=True)
 # [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
 #  u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']

first_link.find_next("p")
 # ...
```

---

Searching the tree (Contd)
================

`find_all_previous()` and `find_previous()`
-----------------------------------------------

Signature: find_all_previous(`name`, `attrs`, `string`, `limit`, `**kwargs`)

Signature: find_previous(`name`, `attrs`, `string`, `**kwargs`)

These methods use `.previous_elements` to
iterate over the tags and strings that came before it in the
document.

---

`find_all_previous()` and `find_previous()` (Contd)
================

The `find_all_previous()` method returns **all matches**, and
`find_previous()` only returns the **first match**:

```python
 first_link = soup.a
 first_link
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

first_link.find_all_previous("p")
 # [Once upon a time there were three little sisters; ...,
 # The Dormouse's story]

first_link.find_previous("title")
 # <title>The Dormouse's story</title>
```

---

`find_all_previous()` and `find_previous()` (Contd)
================

The call to `find_all_previous("p")` found the first paragraph in
the document (the one with class="title"), but it also finds the
second paragraph, the `` tag that contains the `<a>` tag we started
with. This shouldn't be too surprising: we're looking at all the tags
that show up earlier in the document than the one we started with. A
`` tag that contains an `<a>` tag must have shown up before the `<a>`
tag it contains.

---

Searching the tree (Contd)
================

CSS selectors
-------------

Beautiful Soup supports the most commonly-used CSS selectors. Just
pass a string into the `.select()` method of a `Tag` object or the
`BeautifulSoup` object itself.

You can find tags:

```python
 soup.select("title")
 # [<title>The Dormouse's story</title>]

soup.select("p:nth-of-type(3)")
 # [...]
```

---

CSS selectors (Contd)
================

Find tags beneath other tags:

```python
 soup.select("body a")
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("html head title")
 # [<title>The Dormouse's story</title>]
```

Find tags `directly` beneath other tags:

```python
 soup.select("head > title")
 # [<title>The Dormouse's story</title>]

soup.select("p > a")
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("p > a:nth-of-type(2)")
 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
```

---

CSS selectors (Contd)
================

```python
 soup.select("p > #link1")
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("body > a")
 # []
```

Find the siblings of tags:

```python
 soup.select("#link1 ~ .sister")
 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("#link1 + .sister")
 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
```

---

CSS selectors (Contd)
================

Find tags by CSS class:

```python
 soup.select(".sister")
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select("[class~=sister]")
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```

Find tags by ID:

```python
 soup.select("#link1")
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select("a#link2")
 # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
```

---

CSS selectors (Contd)
================

Find tags that match any selector from a list of selectors:

```python
 soup.select("#link1, #link2")
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
```

Test for the existence of an attribute:

```python
 soup.select('a[href]')
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]
```

---

CSS selectors (Contd)
================

Find tags by attribute value:

```python
 soup.select('a[href="http://example.com/elsie"]')
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

soup.select('a[href^="http://example.com/"]')
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 # <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
 # <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href$="tillie"]')
 # [<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('a[href*=".com/el"]')
 # [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]
```

---

CSS selectors (Contd)
================

Match language codes:

```python
 multilingual_markup = """
 Hello
 Howdy, y'all
 Pip-pip, old fruit
 Bonjour mes amis
 """
 multilingual_soup = BeautifulSoup(multilingual_markup)
 multilingual_soup.select('p[lang|=en]')
 # [Hello,
 # Howdy, y'all,
 # Pip-pip, old fruit]
```

---

CSS selectors (Contd)
================

Find only the first tag that matches a selector:

```python
 soup.select_one(".sister")
 # <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
```

This is all a convenience for users who **know the CSS selector syntax**.

You can do all this stuff with the Beautiful Soup API. And if CSS
selectors are all you need, you might as well **use lxml directly**: it's
a lot **faster**, and it supports **more CSS selectors**. But this lets you
`combine` simple CSS selectors with the Beautiful Soup API.

---

Modifying the tree
================

Beautiful Soup's main strength is in **searching the parse tree**, but you
can also **modify the tree** and write your changes as a new HTML or XML
document.

Changing tag names and attributes
---------------------------------

I covered this earlier, in `Attributes`, but it bears repeating. You
can **rename a tag, change the values of its attributes, add new
attributes, and delete attributes**:

```python
 soup = BeautifulSoup('Extremely bold')
 tag = soup.b

tag.name = "blockquote"
 tag['class'] = 'verybold'
 tag['id'] = 1
 tag
 # <blockquote class="verybold" id="1">Extremely bold</blockquote>

del tag['class']
 del tag['id']
 tag
 # <blockquote>Extremely bold</blockquote>
```

---

Modifying the tree (Contd)
================

Modifying `.string`
---------------------

If you set a tag's `.string` attribute, the tag's contents are
replaced with the string you give:

```python
 markup = '<a href="http://example.com/">I linked to example.com</a>'
 soup = BeautifulSoup(markup)

tag = soup.a
 tag.string = "New link text."
 tag
 # <a href="http://example.com/">New link text.</a>
```

Be careful: **if the tag contained other tags**, they and all their
contents will be **destroyed**.

---

Modifying the tree (Contd)
================

`append()`
------------

You can **add to a tag's contents** with `Tag.append()`. It works just
like calling `.append()` on a Python list:

```python
 soup = BeautifulSoup("<a>Foo</a>")
 soup.a.append("Bar")

soup
 # <html><head></head><body><a>FooBar</a></body></html>
 soup.a.contents
 # [u'Foo', u'Bar']
```

---

Modifying the tree (Contd)
================

`NavigableString()` and `.new_tag()`
-------------------------------------------------

If you need to **add a string to a document**, no problem--you can pass a
Python string in to `append()`, or you can call the `NavigableString`
constructor:

```python
 soup = BeautifulSoup("")
 tag = soup.b
 tag.append("Hello")
 new_string = NavigableString(" there")
 tag.append(new_string)
 tag
 # Hello there.
 tag.contents
 # [u'Hello', u' there']
```

---

Modifying the tree (Contd)
================

If you want to **create a comment** or some other subclass of
`NavigableString`, just call the constructor:

```python
 from bs4 import Comment
 new_comment = Comment("Nice to see you.")
 tag.append(new_comment)
 tag
 # Hello there
 tag.contents
 # [u'Hello', u' there', u'Nice to see you.']
```

(This is a new feature in Beautiful Soup 4.4.0.)

---

Modifying the tree (Contd)
================

What if you need to **create** a whole **new tag**?  The best solution is to
call the factory method `BeautifulSoup.new_tag()`:

```python
 soup = BeautifulSoup("")
 original_tag = soup.b

new_tag = soup.new_tag("a", href="http://www.example.com")
 original_tag.append(new_tag)
 original_tag
 # <a href="http://www.example.com"></a>

new_tag.string = "Link text."
 original_tag
 # <a href="http://www.example.com">Link text.</a>
```

Only the first argument, the tag name, is required.

---

Modifying the tree (Contd)
================

`insert()`
------------

`Tag.insert()` is just like `Tag.append()`, except the new element
doesn't necessarily go at the end of its parent's
`.contents`. It'll be **inserted at whatever numeric position you
say.** It works just like `.insert()` on a Python list:

```python
 markup = '<a href="http://example.com/">I linked to example.com</a>'
 soup = BeautifulSoup(markup)
 tag = soup.a

tag.insert(1, "but did not endorse ")
 tag
 # <a href="http://example.com/">I linked to but did not endorse example.com</a>
 tag.contents
 # [u'I linked to ', u'but did not endorse', example.com]
```

---

Modifying the tree (Contd)
================

`insert_before()` and `insert_after()`
------------------------------------------

The `insert_before()` method **inserts** a tag or string **immediately
before** something else in the parse tree:

```python
 soup = BeautifulSoup("stop")
 tag = soup.new_tag("i")
 tag.string = "Don't"
 soup.b.string.insert_before(tag)
 soup.b
 # Don'tstop
```

---

Modifying the tree (Contd)
================

The `insert_after()` method moves a tag or string so that it
**immediately follows** something else in the parse tree:

```python
 soup.b.i.insert_after(soup.new_string(" ever "))
 soup.b
 # Don't ever stop
 soup.b.contents
 # [Don't, u' ever ', u'stop']
```

---

Modifying the tree (Contd)
================

`clear()`
-----------

`Tag.clear()` removes the contents of a tag:

```python
 markup = '<a href="http://example.com/">I linked to example.com</a>'
 soup = BeautifulSoup(markup)
 tag = soup.a

tag.clear()
 tag
 # <a href="http://example.com/"></a>
```

---

Modifying the tree (Contd)
================

`extract()`
-------------

`PageElement.extract()` **removes** a tag or string from the tree. It
**returns** the tag or string that was extracted:

```python
 markup = '<a href="http://example.com/">I linked to example.com</a>'
 soup = BeautifulSoup(markup)
 a_tag = soup.a

i_tag = soup.i.extract()

a_tag
 # <a href="http://example.com/">I linked to</a>

i_tag
 # example.com

print(i_tag.parent)
  None
```

---

Modifying the tree (Contd)
================

At this point you effectively have **two parse trees**: **one rooted** at the
`BeautifulSoup` object you used to parse the document, and **one rooted**
at the tag that was extracted. You can go on to call `extract` on
a child of the element you extracted:

```python
  my_string = i_tag.string.extract()
  my_string
  # u'example.com'

print(my_string.parent)
 # None
 i_tag
 # 
```

---

Modifying the tree (Contd)
================

`decompose()`
---------------

`Tag.decompose()` **removes** a tag from the tree, then `completely
destroys it and its contents`:

```python
 markup = '<a href="http://example.com/">I linked to example.com</a>'
 soup = BeautifulSoup(markup)
 a_tag = soup.a

soup.i.decompose()

a_tag
 # <a href="http://example.com/">I linked to</a>
```

---

Modifying the tree (Contd)
================

`replace_with()`
------------------

`PageElement.replace_with()` **removes** a tag or string from the tree,
and **replaces** it with the **tag or string of your choice**:

```python
 markup = '<a href="http://example.com/">I linked to example.com</a>'
 soup = BeautifulSoup(markup)
 a_tag = soup.a

new_tag = soup.new_tag("b")
  new_tag.string = "example.net"
  a_tag.i.replace_with(new_tag)

a_tag
 # <a href="http://example.com/">I linked to example.net</a>
```

`replace_with()` **returns** the tag or string that was **replaced**, so
that you can **examine** it or **add it back** to another part of the tree.

---

Modifying the tree (Contd)
================

`wrap()`
----------

`PageElement.wrap()` **wraps an element** in the tag you specify. It
**returns** the new wrapper:

```python
 soup = BeautifulSoup("I wish I was bold.")
 soup.p.string.wrap(soup.new_tag("b"))
 # I wish I was bold.

soup.p.wrap(soup.new_tag("div")
 # <div>I wish I was bold.</div>
```

This method is new in Beautiful Soup 4.0.5.

---

Modifying the tree (Contd)
================

`unwrap()`
---------------------------

`Tag.unwrap()` is the **opposite** of `wrap()`. It **replaces** a tag with
**whatever's inside that tag**. It's good for stripping out markup:

```python
 markup = '<a href="http://example.com/">I linked to example.com</a>'
 soup = BeautifulSoup(markup)
 a_tag = soup.a

a_tag.i.unwrap()
 a_tag
 # <a href="http://example.com/">I linked to example.com</a>
```

Like `replace_with()`, `unwrap()` returns the tag
that was replaced.

---

Output
================

Pretty-printing
---------------

The `prettify()` method will turn a Beautiful Soup parse tree into a
**nicely formatted Unicode string**, with each HTML/XML tag on **its own line**:

```python
 markup = '<a href="http://example.com/">I linked to example.com</a>'
 soup = BeautifulSoup(markup)
 soup.prettify()
 # '<html>\n <head>\n </head>\n <body>\n <a href="http://example.com/">\n...'
```

---

Pretty-printing (Contd)
================

```python
 print(soup.prettify())
 # <html>
 # <head>
 # </head>
 # <body>
 # <a href="http://example.com/">
 # I linked to
 # 
 # example.com
 # 
 # </a>
 # </body>
 # </html>
```

---

Pretty-printing (Contd)
================

You can call `prettify()` on the top-level `BeautifulSoup` object,
or **on any of its `Tag` objects**:

```python
 print(soup.a.prettify())
 # <a href="http://example.com/">
 # I linked to
 # 
 # example.com
 # 
 # </a>
```

---

Output (Contd)
================

Non-pretty printing
-------------------

If you **just want a string**, with no fancy formatting, you can call
`unicode()` or `str()` on a `BeautifulSoup` object, or a `Tag`
within it:

```python
 str(soup)
 # '<html><head></head><body><a href="http://example.com/">I linked to example.com</a></body></html>'

unicode(soup.a)
 # u'<a href="http://example.com/">I linked to example.com</a>'
```

The `str()` function returns a string encoded in UTF-8. See
`Encodings`_ for other options.

You can also call `encode()` to get a bytestring, and `decode()`
to get Unicode.

---

Output (Contd)
================

Output formatters
-----------------

If you give Beautiful Soup a document that contains HTML entities like
"&lquot;", they'll be **converted to Unicode** characters:

```python
 soup = BeautifulSoup("“Dammit!” he said.")
 unicode(soup)
 # u'<html><head></head><body>\u201cDammit!\u201d he said.</body></html>'
```

If you then **convert** the document **to a string**, the Unicode characters
will be **encoded as UTF-8**. You won't get the HTML entities back:

```python
 str(soup)
 # '<html><head></head><body>\xe2\x80\x9cDammit!\xe2\x80\x9d he said.</body></html>'
```

---

Output (Contd)
================

By default, the only characters that are escaped upon output are bare
ampersands and angle brackets. These get turned into "&", "<",
and ">", so that Beautiful Soup doesn't inadvertently generate
invalid HTML or XML:

```python
 soup = BeautifulSoup("The law firm of Dewey, Cheatem, & Howe")
 soup.p
 # The law firm of Dewey, Cheatem, & Howe

soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
 soup.a
 # <a href="http://example.com/?foo=val1&bar=val2">A link</a>
```

You can change this behavior by providing a value for the
`formatter` argument to `prettify()`, `encode()`, or
`decode()`. Beautiful Soup recognizes four possible values for
`formatter`.

---

Output (Contd)
================

The default is `formatter="minimal"`. Strings will **only be processed
enough to ensure that Beautiful Soup generates valid HTML/XML**:

```python
 french = "Il a dit <<Sacré bleu!>>"
 soup = BeautifulSoup(french)
 print(soup.prettify(formatter="minimal"))
 # <html>
 # <body>
 # 
 # Il a dit <<Sacré bleu!>>
 # 
 # </body>
 # </html>
```

---

Output (Contd)
================

If you pass in `formatter="html"`, Beautiful Soup will **convert
Unicode characters to HTML entities whenever possible**:

```python
 print(soup.prettify(formatter="html"))
 # <html>
 # <body>
 # 
 # Il a dit <<Sacré bleu!>>
 # 
 # </body>
 # </html>
```

---

Output (Contd)
================

If you pass in `formatter=None`, Beautiful Soup **will not modify
strings** at all **on output**. This is the fastest option, but it may lead
to Beautiful Soup generating **invalid HTML/XML**, as in these examples:

```python
 print(soup.prettify(formatter=None))
 # <html>
 # <body>
 # 
 # Il a dit <<Sacré bleu!>>
 # 
 # </body>
 # </html>

link_soup = BeautifulSoup('<a href="http://example.com/?foo=val1&bar=val2">A link</a>')
 print(link_soup.a.encode(formatter=None))
 # <a href="http://example.com/?foo=val1&bar=val2">A link</a>
```

---

Output (Contd)
================

Finally, if you **pass in a function for `formatter`**, Beautiful Soup
will call that function once **for every string and attribute** value in
the document. You can do whatever you want in this function. Here's a
formatter that **converts strings to uppercase** and does absolutely
nothing else:

```python
 def uppercase(str):
     return str.upper()

print(soup.prettify(formatter=uppercase))
 # <html>
 # <body>
 # 
 # IL A DIT <<SACRÉ BLEU!>>
 # 
 # </body>
 # </html>

print(link_soup.a.prettify(formatter=uppercase))
 # <a href="HTTP://EXAMPLE.COM/?FOO=VAL1&BAR=VAL2">
 # A LINK
 # </a>
```

---

Output (Contd)
================

If you're writing your own function, you should know about the
`EntitySubstitution` class in the `bs4.dammit` module. This class
implements Beautiful Soup's standard formatters as class methods: the
"html" formatter is `EntitySubstitution.substitute_html`, and the
"minimal" formatter is `EntitySubstitution.substitute_xml`. You can
use these functions to simulate `formatter=html` or
`formatter==minimal`, but then do something extra.

Here's an example that **replaces Unicode characters with HTML entities**
whenever possible, but **`also` converts all strings to uppercase**:

```python
 from bs4.dammit import EntitySubstitution
 def uppercase_and_substitute_html_entities(str):
     return EntitySubstitution.substitute_html(str.upper())

print(soup.prettify(formatter=uppercase_and_substitute_html_entities))
 # <html>
 # <body>
 # 
 # IL A DIT <<SACRÉ BLEU!>>
 # 
 # </body>
 # </html>
```

---

Output (Contd)
================

`get_text()`
--------------

If you only **want the text** part of a document or tag, you can use the
`get_text()` method. It returns **all the text in a document or
beneath a tag**, as a **single Unicode string**:

```python
 markup = '<a href="http://example.com/">\nI linked to example.com\n</a>'
 soup = BeautifulSoup(markup)

soup.get_text()
  u'\nI linked to example.com\n'
  soup.i.get_text()
  u'example.com'
```

You can **specify a string** to be used to **join the bits of text**
together:

```python
 # soup.get_text("|")
 u'\nI linked to |example.com|\n'
```

---

Output (Contd)
================

You can tell Beautiful Soup to **strip whitespace** from the **beginning and
end** of **each bit of text**:

```python
 # soup.get_text("|", strip=True)
 u'I linked to|example.com'
```

But at that point you might want to use the `.stripped_strings`
generator instead, and process the text yourself:

```python
 [text for text in soup.stripped_strings]
 # [u'I linked to', u'example.com']
```

---

Tool Safety
==========

Tool Safety is a zine Leonard Richardson wrote in 2017 about what writing Beautiful Soup taught me about software development.

.center[![Tool Safety](tool-safety.png)]

https://www.crummy.com/software/BeautifulSoup/zine/

---

# References

* https://www.crummy.com/software/BeautifulSoup/bs4/doc/

* https://docs.python.org/3/howto/regex.html

* Ryan Mitchell, Web Scraping with Python: Collecting Data from the Modern Web

---

class: center, middle

.center[![Python](http://m1.paperblog.com/i/201/2016454/guia-python-conceptos-programacion-atributos--L-DTucOw.png)]

# Thank you.
    Any questions?