` tag with the CSS class "title"?
Let's look at the **arguments to `find_all()`**.
### The `name` argument
Pass in a value for `name` and you'll tell Beautiful Soup to only
consider **tags with certain names**. Text strings will be ignored, as
will tags whose names that don't match.
This is the simplest usage:
```python
soup.find_all("title")
# [
` that a single tag can have multiple
values for its "class" attribute. When you search for a tag that
matches a certain CSS class, you're **matching against `any` of its CSS
classes**:
```python
css_soup = BeautifulSoup('')
css_soup.find_all("p", class_="strikeout")
# []
css_soup.find_all("p", class_="body")
# []
```
You can also search for the **exact string value** of the `class` attribute:
```python
css_soup.find_all("p", class_="body strikeout")
# []
```
But searching for variants of the string value **won't work**:
```python
css_soup.find_all("p", class_="strikeout body")
# []
```
---
`find_all()` (Contd)
================
If you want to search for tags that match **two or more CSS classes**, you
should use a **CSS selector**:
```python
css_soup.select("p.strikeout.body")
# []
```
In **older versions** of Beautiful Soup, which **don't have the `class_`**
shortcut, **you can use the `attrs` trick** mentioned above. Create a
dictionary whose value for "class" is the string (or regular
expression, or whatever) you want to search for:
```python
soup.find_all("a", attrs={"class": "sister"})
# [Elsie,
# Lacie,
# Tillie]
```
---
`find_all()` (Contd)
================
### The `string` argument
With `string` you can search for **strings instead of tags**. As with
`name` and the keyword arguments, you can pass in `a string`, `a
regular expression`, `a list`, `a function`, or `the value True`.
Here are some examples:
```python
soup.find_all(string="Elsie")
# [u'Elsie']
soup.find_all(string=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']
soup.find_all(string=re.compile("Dormouse"))
[u"The Dormouse's story", u"The Dormouse's story"]
def is_the_only_string_within_a_tag(s):
"""Return True if this string is the only child of its parent tag."""
return (s == s.parent.string)
soup.find_all(string=is_the_only_string_within_a_tag)
# [u"The Dormouse's story", u"The Dormouse's story", u'Elsie', u'Lacie', u'Tillie', u'...']
```
---
`find_all()` (Contd)
================
Although `string` is for finding strings, you can **combine it with
arguments that find tags**: Beautiful Soup will find all tags whose
`.string` matches your value for `string`. This code finds the ``
tags whose `.string` is "Elsie":
```python
soup.find_all("a", string="Elsie")
# [Elsie]
```
The `string` argument is new in Beautiful Soup 4.4.0. **In earlier
versions it was called `text`**:
```python
soup.find_all("a", text="Elsie")
# [Elsie]
```
---
`find_all()` (Contd)
================
### The `limit` argument
`find_all()` returns **all** the tags and strings that match your
filters. This can **take a while** if the document is large. If you don't
need `all` the results, you can pass in **a number for `limit`**. This
works just like the LIMIT keyword in SQL. It tells Beautiful Soup to
**stop gathering results after it's found a certain number**.
There are three links in the "three sisters" document, but this code
only **finds the first two**:
```python
soup.find_all("a", limit=2)
# [Elsie,
# Lacie]
```
---
`find_all()` (Contd)
================
### The `recursive` argument
If you call `mytag.find_all()`, Beautiful Soup will examine all the
descendants of `mytag`: its children, its children's children, and
so on. If you only want Beautiful Soup to consider **direct children**,
you can pass in `recursive=False`. See the difference here:
```python
soup.html.find_all("title")
# [The Dormouse's story]
soup.html.find_all("title", recursive=False)
# []
```
---
`find_all()` (Contd)
================
Here's that part of the document:
```html
The Dormouse's story
...
```
The `` tag is beneath the `` tag, but it's not `directly`
beneath the `` tag: the `` tag is in the way. Beautiful Soup
finds the `` tag when it's allowed to look at all descendants of
the `` tag, but when `recursive=False` restricts it to the
`` tag's immediate children, it finds nothing.
Beautiful Soup offers a lot of tree-searching methods (covered below),
and they mostly take the same arguments as `find_all()`: `name`,
`attrs`, `string`, `limit`, and the keyword arguments. But the
**`recursive` argument is different: `find_all()` and `find()` are
the only methods that support it**. Passing `recursive=False` into a
method like `find_parents()` wouldn't be very useful.
---
Searching the tree (Contd)
================
Calling a tag is like calling `find_all()`
--------------------------------------------
Because `find_all()` is the **most popular** method in the Beautiful
Soup search API, you can use a **shortcut** for it. If you treat the
`BeautifulSoup` object or a `Tag` object **as though it were a
function**, then it's the same as calling `find_all()` on that
object. These two lines of code are **equivalent**:
```python
soup.find_all("a")
soup("a")
```
These two lines are also **equivalent**:
```python
soup.title.find_all(string=True)
soup.title(string=True)
```
---
Searching the tree (Contd)
================
`find()`
----------
Signature: find(`name`, `attrs`, `recursive`, `string`, `**kwargs`)
The `find_all()` method scans the entire document looking for
results, but sometimes you only want to **find one result**. If you know a
document **only has one** `` tag, it's a waste of time to scan the
entire document looking for more. Rather than **passing in `limit=1`**
every time you call `find_all`, you can use the `find()`
method. These two lines of code are `nearly` **equivalent**:
```python
soup.find_all('title', limit=1)
# [The Dormouse's story]
soup.find('title')
# The Dormouse's story
```
---
`find()` (Contd)
================
The only difference is that `find_all()` returns a **list** containing
the single result, and `find()` just returns the **result**.
If `find_all()` can't find anything, it returns an **empty list**. If
`find()` can't find anything, it **returns `None`**:
```python
print(soup.find("nosuchtag"))
# None
```
Remember the `soup.head.title` trick from `Navigating using tag
names`? That trick works by **repeatedly calling `find()`**:
```python
soup.head.title
# The Dormouse's story
soup.find("head").find("title")
# The Dormouse's story
```
---
Searching the tree (Contd)
================
`find_parents()` and `find_parent()`
----------------------------------------
I spent **a lot of time** above covering `find_all()` and
`find()`. The Beautiful Soup API defines **ten other** methods for
searching the tree, but don't be afraid. **Five** of these methods are
basically the **same as `find_all()`**, and the **other five** are basically
the **same as `find()`**. The **only differences** are in what **parts of the
tree** they search.
First let's consider `find_parents()` and
`find_parent()`. Remember that `find_all()` and `find()` work
their way down the tree, looking at tag's descendants. These methods
**do the opposite**: they work their way **`up` the tree**, looking at a tag's
(or a string's) **parents**.
---
`find_parents()` and `find_parent()` (Contd)
================
Let's **try them** out, starting from a string
buried deep in the "three daughters" document:
```python
a_string = soup.find(string="Lacie")
a_string
# u'Lacie'
a_string.find_parents("a")
# [Lacie]
a_string.find_parent("p")
# Once upon a time there were three little sisters; and their names were
# Elsie,
# Lacie and
# Tillie;
# and they lived at the bottom of a well.
a_string.find_parents("p", class="title")
# []
```
---
`find_parents()` and `find_parent()` (Contd)
================
One of the three `` tags is the direct parent of the string in
question, so our search finds it. One of the three `` tags is an
indirect parent of the string, and our search finds that as
well. There's a `
` tag with the CSS class "title" `somewhere` in the
document, but it's not one of this string's parents, so we can't find
it with `find_parents()`.
You may have made the connection between `find_parent()` and
`find_parents()`, and the `.parent`_ and `.parents`_ attributes
mentioned earlier. The **connection is very strong**. These search methods
actually **use `.parents`** to iterate over all the parents, and **check
each one against the provided filter** to see if it matches.
---
Searching the tree (Contd)
================
`find_next_siblings()` and `find_next_sibling()`
----------------------------------------------------
Signature: find_next_siblings(`name`, `attrs`, `string`, `limit`, `**kwargs`)
Signature: find_next_sibling(`name`, `attrs`, `string`, `**kwargs`)
---
`find_next_siblings()` and `find_next_sibling()` (Contd)
================
These methods use `.next_siblings` to iterate over the rest of an element's siblings in the tree. The
`find_next_siblings()` method returns all the siblings that match,
and `find_next_sibling()` only returns the first one:
```python
first_link = soup.a
first_link
# Elsie
first_link.find_next_siblings("a")
# [Lacie,
# Tillie]
first_story_paragraph = soup.find("p", "story")
first_story_paragraph.find_next_sibling("p")
#
...
```
---
Searching the tree (Contd)
================
`find_previous_siblings()` and `find_previous_sibling()`
------------------------------------------------------------
Signature: find_previous_siblings(`name`, `attrs`, `string`, `limit`, `**kwargs`)
Signature: find_previous_sibling(`name`, `attrs`, `string`, `**kwargs`)
---
`find_previous_siblings()` and `find_previous_sibling()` (Contd)
================
These methods use `.previous_siblings` to iterate over an element's
siblings that precede it in the tree. The `find_previous_siblings()`
method returns all the siblings that match, and
`find_previous_sibling()` only returns the first one:
```python
last_link = soup.find("a", id="link3")
last_link
# Tillie
last_link.find_previous_siblings("a")
# [Lacie,
# Elsie]
first_story_paragraph = soup.find("p", "story")
first_story_paragraph.find_previous_sibling("p")
# The Dormouse's story
```
---
Searching the tree (Contd)
================
`find_all_next()` and `find_next()`
---------------------------------------
Signature: find_all_next(`name`, `attrs`, `string`, `limit`, `**kwargs`)
Signature: find_next(`name`, `attrs`, `string`, `**kwargs`)
These methods use `.next_elements` to
iterate over whatever tags and strings that come after it in the
document. The `find_all_next()` method returns **all matches**, and
`find_next()` only returns the **first match**:
```python
first_link = soup.a
first_link
# Elsie
first_link.find_all_next(string=True)
# [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
# u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']
first_link.find_next("p")
# ...
```
---
Searching the tree (Contd)
================
`find_all_previous()` and `find_previous()`
-----------------------------------------------
Signature: find_all_previous(`name`, `attrs`, `string`, `limit`, `**kwargs`)
Signature: find_previous(`name`, `attrs`, `string`, `**kwargs`)
These methods use `.previous_elements` to
iterate over the tags and strings that came before it in the
document.
---
`find_all_previous()` and `find_previous()` (Contd)
================
The `find_all_previous()` method returns **all matches**, and
`find_previous()` only returns the **first match**:
```python
first_link = soup.a
first_link
# Elsie
first_link.find_all_previous("p")
# [Once upon a time there were three little sisters; ...
,
# The Dormouse's story
]
first_link.find_previous("title")
# The Dormouse's story
```
---
`find_all_previous()` and `find_previous()` (Contd)
================
The call to `find_all_previous("p")` found the first paragraph in
the document (the one with class="title"), but it also finds the
second paragraph, the `` tag that contains the `` tag we started
with. This shouldn't be too surprising: we're looking at all the tags
that show up earlier in the document than the one we started with. A
`` tag that contains an `` tag must have shown up before the ``
tag it contains.
---
Searching the tree (Contd)
================
CSS selectors
-------------
Beautiful Soup supports the most commonly-used CSS selectors. Just
pass a string into the `.select()` method of a `Tag` object or the
`BeautifulSoup` object itself.
You can find tags:
```python
soup.select("title")
# [The Dormouse's story]
soup.select("p:nth-of-type(3)")
# [...
]
```
---
CSS selectors (Contd)
================
Find tags beneath other tags:
```python
soup.select("body a")
# [Elsie,
# Lacie,
# Tillie]
soup.select("html head title")
# [
The Dormouse's story]
```
Find tags `directly` beneath other tags:
```python
soup.select("head > title")
# [The Dormouse's story]
soup.select("p > a")
# [Elsie,
# Lacie,
# Tillie]
soup.select("p > a:nth-of-type(2)")
# [Lacie]
```
---
CSS selectors (Contd)
================
```python
soup.select("p > #link1")
# [Elsie]
soup.select("body > a")
# []
```
Find the siblings of tags:
```python
soup.select("#link1 ~ .sister")
# [Lacie,
# Tillie]
soup.select("#link1 + .sister")
# [Lacie]
```
---
CSS selectors (Contd)
================
Find tags by CSS class:
```python
soup.select(".sister")
# [Elsie,
# Lacie,
# Tillie]
soup.select("[class~=sister]")
# [Elsie,
# Lacie,
# Tillie]
```
Find tags by ID:
```python
soup.select("#link1")
# [Elsie]
soup.select("a#link2")
# [Lacie]
```
---
CSS selectors (Contd)
================
Find tags that match any selector from a list of selectors:
```python
soup.select("#link1, #link2")
# [Elsie,
# Lacie]
```
Test for the existence of an attribute:
```python
soup.select('a[href]')
# [Elsie,
# Lacie,
# Tillie]
```
---
CSS selectors (Contd)
================
Find tags by attribute value:
```python
soup.select('a[href="http://example.com/elsie"]')
# [Elsie]
soup.select('a[href^="http://example.com/"]')
# [Elsie,
# Lacie,
# Tillie]
soup.select('a[href$="tillie"]')
# [Tillie]
soup.select('a[href*=".com/el"]')
# [Elsie]
```
---
CSS selectors (Contd)
================
Match language codes:
```python
multilingual_markup = """
Hello
Howdy, y'all
Pip-pip, old fruit
Bonjour mes amis
"""
multilingual_soup = BeautifulSoup(multilingual_markup)
multilingual_soup.select('p[lang|=en]')
# [Hello
,
# Howdy, y'all
,
# Pip-pip, old fruit
]
```
---
CSS selectors (Contd)
================
Find only the first tag that matches a selector:
```python
soup.select_one(".sister")
# Elsie
```
This is all a convenience for users who **know the CSS selector syntax**.
You can do all this stuff with the Beautiful Soup API. And if CSS
selectors are all you need, you might as well **use lxml directly**: it's
a lot **faster**, and it supports **more CSS selectors**. But this lets you
`combine` simple CSS selectors with the Beautiful Soup API.
---
Modifying the tree
================
Beautiful Soup's main strength is in **searching the parse tree**, but you
can also **modify the tree** and write your changes as a new HTML or XML
document.
Changing tag names and attributes
---------------------------------
I covered this earlier, in `Attributes`, but it bears repeating. You
can **rename a tag, change the values of its attributes, add new
attributes, and delete attributes**:
```python
soup = BeautifulSoup('Extremely bold')
tag = soup.b
tag.name = "blockquote"
tag['class'] = 'verybold'
tag['id'] = 1
tag
# Extremely bold
del tag['class']
del tag['id']
tag
# Extremely bold
```
---
Modifying the tree (Contd)
================
Modifying `.string`
---------------------
If you set a tag's `.string` attribute, the tag's contents are
replaced with the string you give:
```python
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
tag = soup.a
tag.string = "New link text."
tag
# New link text.
```
Be careful: **if the tag contained other tags**, they and all their
contents will be **destroyed**.
---
Modifying the tree (Contd)
================
`append()`
------------
You can **add to a tag's contents** with `Tag.append()`. It works just
like calling `.append()` on a Python list:
```python
soup = BeautifulSoup("Foo")
soup.a.append("Bar")
soup
# FooBar
soup.a.contents
# [u'Foo', u'Bar']
```
---
Modifying the tree (Contd)
================
`NavigableString()` and `.new_tag()`
-------------------------------------------------
If you need to **add a string to a document**, no problem--you can pass a
Python string in to `append()`, or you can call the `NavigableString`
constructor:
```python
soup = BeautifulSoup("")
tag = soup.b
tag.append("Hello")
new_string = NavigableString(" there")
tag.append(new_string)
tag
# Hello there.
tag.contents
# [u'Hello', u' there']
```
---
Modifying the tree (Contd)
================
If you want to **create a comment** or some other subclass of
`NavigableString`, just call the constructor:
```python
from bs4 import Comment
new_comment = Comment("Nice to see you.")
tag.append(new_comment)
tag
# Hello there
tag.contents
# [u'Hello', u' there', u'Nice to see you.']
```
(This is a new feature in Beautiful Soup 4.4.0.)
---
Modifying the tree (Contd)
================
What if you need to **create** a whole **new tag**? The best solution is to
call the factory method `BeautifulSoup.new_tag()`:
```python
soup = BeautifulSoup("")
original_tag = soup.b
new_tag = soup.new_tag("a", href="http://www.example.com")
original_tag.append(new_tag)
original_tag
#
new_tag.string = "Link text."
original_tag
# Link text.
```
Only the first argument, the tag name, is required.
---
Modifying the tree (Contd)
================
`insert()`
------------
`Tag.insert()` is just like `Tag.append()`, except the new element
doesn't necessarily go at the end of its parent's
`.contents`. It'll be **inserted at whatever numeric position you
say.** It works just like `.insert()` on a Python list:
```python
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
tag = soup.a
tag.insert(1, "but did not endorse ")
tag
# I linked to but did not endorse example.com
tag.contents
# [u'I linked to ', u'but did not endorse', example.com]
```
---
Modifying the tree (Contd)
================
`insert_before()` and `insert_after()`
------------------------------------------
The `insert_before()` method **inserts** a tag or string **immediately
before** something else in the parse tree:
```python
soup = BeautifulSoup("stop")
tag = soup.new_tag("i")
tag.string = "Don't"
soup.b.string.insert_before(tag)
soup.b
# Don'tstop
```
---
Modifying the tree (Contd)
================
The `insert_after()` method moves a tag or string so that it
**immediately follows** something else in the parse tree:
```python
soup.b.i.insert_after(soup.new_string(" ever "))
soup.b
# Don't ever stop
soup.b.contents
# [Don't, u' ever ', u'stop']
```
---
Modifying the tree (Contd)
================
`clear()`
-----------
`Tag.clear()` removes the contents of a tag:
```python
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
tag = soup.a
tag.clear()
tag
#
```
---
Modifying the tree (Contd)
================
`extract()`
-------------
`PageElement.extract()` **removes** a tag or string from the tree. It
**returns** the tag or string that was extracted:
```python
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
a_tag = soup.a
i_tag = soup.i.extract()
a_tag
# I linked to
i_tag
# example.com
print(i_tag.parent)
None
```
---
Modifying the tree (Contd)
================
At this point you effectively have **two parse trees**: **one rooted** at the
`BeautifulSoup` object you used to parse the document, and **one rooted**
at the tag that was extracted. You can go on to call `extract` on
a child of the element you extracted:
```python
my_string = i_tag.string.extract()
my_string
# u'example.com'
print(my_string.parent)
# None
i_tag
#
```
---
Modifying the tree (Contd)
================
`decompose()`
---------------
`Tag.decompose()` **removes** a tag from the tree, then `completely
destroys it and its contents`:
```python
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
a_tag = soup.a
soup.i.decompose()
a_tag
# I linked to
```
---
Modifying the tree (Contd)
================
`replace_with()`
------------------
`PageElement.replace_with()` **removes** a tag or string from the tree,
and **replaces** it with the **tag or string of your choice**:
```python
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
a_tag = soup.a
new_tag = soup.new_tag("b")
new_tag.string = "example.net"
a_tag.i.replace_with(new_tag)
a_tag
# I linked to example.net
```
`replace_with()` **returns** the tag or string that was **replaced**, so
that you can **examine** it or **add it back** to another part of the tree.
---
Modifying the tree (Contd)
================
`wrap()`
----------
`PageElement.wrap()` **wraps an element** in the tag you specify. It
**returns** the new wrapper:
```python
soup = BeautifulSoup("I wish I was bold.
")
soup.p.string.wrap(soup.new_tag("b"))
# I wish I was bold.
soup.p.wrap(soup.new_tag("div")
#
```
This method is new in Beautiful Soup 4.0.5.
---
Modifying the tree (Contd)
================
`unwrap()`
---------------------------
`Tag.unwrap()` is the **opposite** of `wrap()`. It **replaces** a tag with
**whatever's inside that tag**. It's good for stripping out markup:
```python
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
a_tag = soup.a
a_tag.i.unwrap()
a_tag
# I linked to example.com
```
Like `replace_with()`, `unwrap()` returns the tag
that was replaced.
---
Output
================
Pretty-printing
---------------
The `prettify()` method will turn a Beautiful Soup parse tree into a
**nicely formatted Unicode string**, with each HTML/XML tag on **its own line**:
```python
markup = 'I linked to example.com'
soup = BeautifulSoup(markup)
soup.prettify()
# '\n \n \n \n \n...'
```
---
Pretty-printing (Contd)
================
```python
print(soup.prettify())
#
#
#
#
#
# I linked to
#
# example.com
#
#
#
#
```
---
Pretty-printing (Contd)
================
You can call `prettify()` on the top-level `BeautifulSoup` object,
or **on any of its `Tag` objects**:
```python
print(soup.a.prettify())
#
# I linked to
#
# example.com
#
#
```
---
Output (Contd)
================
Non-pretty printing
-------------------
If you **just want a string**, with no fancy formatting, you can call
`unicode()` or `str()` on a `BeautifulSoup` object, or a `Tag`
within it:
```python
str(soup)
# 'I linked to example.com'
unicode(soup.a)
# u'I linked to example.com'
```
The `str()` function returns a string encoded in UTF-8. See
`Encodings`_ for other options.
You can also call `encode()` to get a bytestring, and `decode()`
to get Unicode.
---
Output (Contd)
================
Output formatters
-----------------
If you give Beautiful Soup a document that contains HTML entities like
"&lquot;", they'll be **converted to Unicode** characters:
```python
soup = BeautifulSoup("“Dammit!” he said.")
unicode(soup)
# u'\u201cDammit!\u201d he said.'
```
If you then **convert** the document **to a string**, the Unicode characters
will be **encoded as UTF-8**. You won't get the HTML entities back:
```python
str(soup)
# '\xe2\x80\x9cDammit!\xe2\x80\x9d he said.'
```
---
Output (Contd)
================
By default, the only characters that are escaped upon output are bare
ampersands and angle brackets. These get turned into "&", "<",
and ">", so that Beautiful Soup doesn't inadvertently generate
invalid HTML or XML:
```python
soup = BeautifulSoup("The law firm of Dewey, Cheatem, & Howe
")
soup.p
# The law firm of Dewey, Cheatem, & Howe
soup = BeautifulSoup('A link')
soup.a
# A link
```
You can change this behavior by providing a value for the
`formatter` argument to `prettify()`, `encode()`, or
`decode()`. Beautiful Soup recognizes four possible values for
`formatter`.
---
Output (Contd)
================
The default is `formatter="minimal"`. Strings will **only be processed
enough to ensure that Beautiful Soup generates valid HTML/XML**:
```python
french = "Il a dit <<Sacré bleu!>>
"
soup = BeautifulSoup(french)
print(soup.prettify(formatter="minimal"))
#
#
#
# Il a dit <<Sacré bleu!>>
#
#
#
```
---
Output (Contd)
================
If you pass in `formatter="html"`, Beautiful Soup will **convert
Unicode characters to HTML entities whenever possible**:
```python
print(soup.prettify(formatter="html"))
#
#
#
# Il a dit <<Sacré bleu!>>
#
#
#
```
---
Output (Contd)
================
If you pass in `formatter=None`, Beautiful Soup **will not modify
strings** at all **on output**. This is the fastest option, but it may lead
to Beautiful Soup generating **invalid HTML/XML**, as in these examples:
```python
print(soup.prettify(formatter=None))
#
#
#
# Il a dit <>
#
#
#
link_soup = BeautifulSoup('A link')
print(link_soup.a.encode(formatter=None))
# A link
```
---
Output (Contd)
================
Finally, if you **pass in a function for `formatter`**, Beautiful Soup
will call that function once **for every string and attribute** value in
the document. You can do whatever you want in this function. Here's a
formatter that **converts strings to uppercase** and does absolutely
nothing else:
```python
def uppercase(str):
return str.upper()
print(soup.prettify(formatter=uppercase))
#
#
#
# IL A DIT <>
#
#
#
print(link_soup.a.prettify(formatter=uppercase))
#
# A LINK
#
```
---
Output (Contd)
================
If you're writing your own function, you should know about the
`EntitySubstitution` class in the `bs4.dammit` module. This class
implements Beautiful Soup's standard formatters as class methods: the
"html" formatter is `EntitySubstitution.substitute_html`, and the
"minimal" formatter is `EntitySubstitution.substitute_xml`. You can
use these functions to simulate `formatter=html` or
`formatter==minimal`, but then do something extra.
Here's an example that **replaces Unicode characters with HTML entities**
whenever possible, but **`also` converts all strings to uppercase**:
```python
from bs4.dammit import EntitySubstitution
def uppercase_and_substitute_html_entities(str):
return EntitySubstitution.substitute_html(str.upper())
print(soup.prettify(formatter=uppercase_and_substitute_html_entities))
#
#
#
# IL A DIT <<SACRÉ BLEU!>>
#
#
#
```
---
Output (Contd)
================
`get_text()`
--------------
If you only **want the text** part of a document or tag, you can use the
`get_text()` method. It returns **all the text in a document or
beneath a tag**, as a **single Unicode string**:
```python
markup = '\nI linked to example.com\n'
soup = BeautifulSoup(markup)
soup.get_text()
u'\nI linked to example.com\n'
soup.i.get_text()
u'example.com'
```
You can **specify a string** to be used to **join the bits of text**
together:
```python
# soup.get_text("|")
u'\nI linked to |example.com|\n'
```
---
Output (Contd)
================
You can tell Beautiful Soup to **strip whitespace** from the **beginning and
end** of **each bit of text**:
```python
# soup.get_text("|", strip=True)
u'I linked to|example.com'
```
But at that point you might want to use the `.stripped_strings`
generator instead, and process the text yourself:
```python
[text for text in soup.stripped_strings]
# [u'I linked to', u'example.com']
```
---
Tool Safety
==========
Tool Safety is a zine Leonard Richardson wrote in 2017 about what writing Beautiful Soup taught me about software development.
.center[]
https://www.crummy.com/software/BeautifulSoup/zine/
---
# References
* https://www.crummy.com/software/BeautifulSoup/bs4/doc/
* https://docs.python.org/3/howto/regex.html
* Ryan Mitchell, Web Scraping with Python: Collecting Data from the Modern Web
---
class: center, middle
.center[]
# Thank you.
Any questions?