class: center, middle .center[] # Web Scraping in Python: XML, JSON, CSV February 15, 2018
Instructor: [S. M. Masoud Sadrnezhaad](https://twitter.com/smmsadrnezh) --- JSON ========== - JSON or JavaScript Object Notation is a lightweight text-based open standard designed for human-readable data interchange. - The JSON format was originally specified by Douglas Crockford, and is described in RFC 4627. - The official Internet media type for JSON is application/json. - The JSON filename extension is .json. - Conventions used by JSON are known to programmers, which include C, C++, Java, Python, Perl, etc. - JSON is easy to read and write. - It is a lightweight text-based interchange format. - JSON is language independent. --- Uses of JSON ========== - It is used while writing JavaScript based applications that includes browser extensions and websites. - JSON format is used for serializing and transmitting structured data over network connection. - It is primarily used to transmit data between a server and web applications. - Web services and APIs use JSON format to provide public data. - It can be used with modern programming languages. --- Simple Example in JSON ========== The following example shows how to use JSON to store information related to books based on their topic and edition. ```json { "book": [ { "id":"01", "language": "Java", "edition": "third", "author": "Herbert Schildt" }, { "id":"07", "language": "C++", "edition": "second", "author": "E.Balagurusamy" } ] } ``` --- Simple Example in JSON (Contd) ========== After understanding the above program, we will try another example. Let's save the below code as json.htm − ```html
JSON example
``` --- Simple Example in JSON (Contd) ========== Now let's try to open json.htm using IE or any other javascript enabled browser that produces the following result − .center[] --- JSON Syntax ========== - Let's have a quick look at the **basic syntax of JSON**. JSON syntax is basically considered as a subset of JavaScript syntax; it includes the following − - Data is represented in **name/value pairs**. - **Curly braces** hold objects and each name is followed by **':'(colon)**, the name/value pairs are **separated by , (comma)**. - **Square brackets** hold arrays and values are separated by ,(comma). - JSON supports the following two data structures − - **Collection of name/value pairs** − This Data Structure is supported by different programming languages. - **Ordered list of values** − It includes array, list, vector or sequence etc. - JSON Values - number (integer or floating point), string, boolean, array, object, null --- JSON Array ========== - It is an **ordered collection of values**. - These are enclosed in **square brackets** which means that array begins with .[. and ends with .].. - The values are **separated by , (comma)**. - Array indexing can be **started at 0 or 1**. - Arrays should be used when the **key names are sequential integers**. ```json [ value, .......] ``` - Example showing array containing multiple objects − ```json { "books": [ { "language":"Java" , "edition":"second" }, { "language":"C++" , "lastName":"fifth" }, { "language":"C" , "lastName":"third" } ] } ``` --- JSON Object ========== - It is an **unordered set of name/value pairs**. - Objects are enclosed in **curly braces** that is, it starts with '{' and ends with '}'. - Each **name is followed by ':'(colon)** and the **key/value pairs** are **separated by , (comma)**. - The **keys must be strings** and should be **different from each other**. - Objects should be used when the **key names are arbitrary strings**. ```json { string : value, .......} ``` - Example showing Object − ```json { "id": "011A", "language": "JAVA", "price": 500, } ``` --- JSON Comparison with XML ========== - JSON and XML are human readable formats and are language independent. They both have support for creation, reading and decoding in real world situations. We can compare JSON with XML, based on the following factors − - XML is more **verbose** than JSON, so it is faster to write JSON for programmers. - XML is used to describe the structured data, which doesn't include arrays whereas **JSON include arrays**. - JavaScript's eval method **parses JSON**. When applied to JSON, eval returns the described object. --- JSON Comparison with XML (Contd) ========== - Individual examples of XML and JSON − ```json { "company": Volkswagen, "name": "Vento", "price": 800000 } ``` ```html
Volkswagen
Vento
800000
``` --- Python to JSON ========== Encoding basic Python object hierarchies: ```python >>> import json >>> json.dumps(['foo', {'bar': ('baz', None, 1.0, 2)}]) '["foo", {"bar": ["baz", null, 1.0, 2]}]' >>> print(json.dumps({"c": 0, "b": 0, "a": 0}, sort_keys=True)) {"a": 0, "b": 0, "c": 0} ``` Pretty printing: ```python >>> import json >>> print(json.dumps({'4': 5, '6': 7}, sort_keys=True, indent=4)) { "4": 5, "6": 7 } ``` --- JSON to Python ========== Decoding JSON: ```python >>> import json >>> json.loads('["foo", {"bar":["baz", null, 1.0, 2]}]') ['foo', {'bar': ['baz', None, 1.0, 2]}] ``` --- XML processing with lxml.etree ========== - lxml provides a very simple and powerful API for parsing XML and HTML. - It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML). A common way to import lxml.etree is as follows: ```python >>> from lxml import etree ``` - Parsers are represented by parser objects. - There is support for parsing both XML and (broken) HTML. - Note that XHTML is best parsed as XML, parsing it with the HTML parser can lead to unexpected results. --- XML parser objects ========== - Here is a simple example for parsing XML from an in-memory string: ```python >>> xml = '
' >>> root = etree.fromstring(xml) >>> etree.tostring(root) b'
' ``` - To read from a file or file-like object, you can use the parse() function, which returns an ElementTree object: ```python >>> tree = etree.parse(StringIO(xml)) >>> etree.tostring(tree.getroot()) b'
' ``` - Note how the `parse()` function reads from a file-like object here. --- XML parser objects (Contd) ========== - If parsing is done from a real file, it is more common (and also somewhat more efficient) to pass a filename: ```python >>> tree = etree.parse("doc/test.xml") ``` - `lxml` can parse from a local file, an HTTP URL or an FTP URL. - It also auto-detects and reads gzip-compressed XML files (.gz). - If you want to parse from memory and still provide a base URL for the document (e.g. to support relative paths in an XInclude), you can pass the `base_url` keyword argument: ```python >>> root = etree.fromstring(xml, base_url="http://where.it/is/from.xml") ``` --- Why lxml? ========== "the thrills without the strangeness" - To explain the motto: - "Programming with libxml2 is like the thrilling embrace of an exotic stranger. It seems to have the potential to fulfill your wildest dreams, but there's a nagging voice somewhere in your head warning you that you're about to get screwed in the worst way." (a quote by Mark Pilgrim) - Mark Pilgrim was describing in particular the experience a Python programmer has when dealing with libxml2. The default Python bindings of libxml2 are fast, thrilling, powerful, and your code might fail in some horrible way that you really shouldn't have to worry about when writing Python code. lxml combines the power of libxml2 with the ease of use of Python. - More about BeautifulSoup Parser - http://lxml.de/elementsoup.html - More about Parsing XML and HTML with lxml - http://lxml.de/parsing.html --- CSV ========== - In computing, a comma-separated values (CSV) file stores **tabular data (numbers and text)** in plain text. Each line of the file is a data record. Each record consists of one or more fields, separated by commas. - The use of the comma as a field separator is the **source of the name** for this **file format**. - The so-called CSV (Comma Separated Values) format is the most common **import and export** format for spreadsheets and databases. - CSV format was used for **many years prior** to attempts to describe the format in a **standardized way in RFC 4180**. - The basic idea of **separating fields with a comma** is clear, but that idea gets **complicated** when the **field data** may also **contain commas or even embedded line-breaks**. - CSV implementations may **not handle** such field data, or they may **use quotation marks** to surround the **field**. --- CSV (Contd) ========== - Quotation does not solve everything: some **fields may need embedded quotation marks**, so a CSV implementation may include **escape characters** or escape sequences. - The lack of a **well-defined standard** means that subtle **differences** often exist in the data produced and consumed by different applications. - These differences can make it annoying to **process CSV files from multiple sources**. - Still, while the delimiters and quoting characters vary, the **overall format is similar enough** that it is possible to write a single module which can efficiently manipulate such data, hiding the details of reading and writing the data from the programmer. - The csv **module** implements classes to **read and write tabular data** in **CSV format**. - It allows programmers to say, “**write** this data in the **format preferred by Excel**,” or “**read data** from this file which was **generated by Excel**,” **without knowing** the precise details of the CSV format used by Excel. --- CSV (Contd) ========== - Programmers can also describe the CSV formats understood by other applications or **define their own special-purpose CSV formats**. - CSV is a **common data exchange format** that is widely supported by consumer, business, and scientific applications. - Among its most common uses is **moving tabular data between programs** that natively operate on **incompatible** (often proprietary or undocumented) formats. - This works despite lack of adherence to RFC 4180 (or any other standard), because so many **programs support variations on the CSV format** for data import. - For example, a user may need to **transfer information from a database** program that stores data in a proprietary format, **to a spreadsheet** that uses a completely different format. The database program most likely can export its data as "CSV"; the exported CSV file can then be imported by the spreadsheet program. --- CSV (Contd) ========== - In practice the term "CSV" might **refer to any file that**: - is **plain text** using a **character set** such as ASCII, various Unicode character sets (e.g. UTF-8), EBCDIC, or Shift JIS, - consists of **records** (typically one record per line), - with the **records divided into fields** separated by **delimiters** (typically a single reserved character such as comma, semicolon, or tab; sometimes the delimiter may include optional spaces), - where every record has the **same sequence** of fields. --- CSV (Contd) ========== - Fields with **embedded commas** or double-quote characters must be quoted. ```csv 1997,Ford,E350,"Super, luxurious truck" ``` - Each of the **embedded double-quote** characters must be represented by a pair of double-quote characters. ```csv 1997,Ford,E350,"Super, ""luxurious"" truck" ``` - Fields with **embedded line breaks** must be quoted (however, many CSV implementations do not support embedded line breaks). ```csv 1997,Ford,E350,"Go get one now they are going fast" ``` --- CSV in Python ========== - Reader ```python >>> import csv >>> with open('eggs.csv', newline='') as csvfile: ... spamreader = csv.reader(csvfile, delimiter=' ', quotechar='|') ... for row in spamreader: ... print(', '.join(row)) Spam, Spam, Spam, Spam, Spam, Baked Beans Spam, Lovely Spam, Wonderful Spam ``` - Writer ```python import csv with open('eggs.csv', 'w', newline='') as csvfile: spamwriter = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL) spamwriter.writerow(['Spam'] * 5 + ['Baked Beans']) spamwriter.writerow(['Spam', 'Lovely Spam', 'Wonderful Spam']) ``` --- CSV in Python (Contd) ========== - The corresponding simplest possible writing example is: ```python import csv with open('some.csv', 'w', newline='') as f: writer = csv.writer(f) writer.writerows(someiterable) ``` - Since `open` is used to open a CSV file for **reading**, the file will by default be **decoded into unicode** using the system default encoding. To decode a file using a **different encoding**, use the **`encoding` argument** of open: ```python import csv with open('some.csv', newline='', encoding='utf-8') as f: reader = csv.reader(f) for row in reader: print(row) ``` - The same applies to **writing** in something other than the system default encoding: specify the **encoding argument** when opening the output file. --- CSV in Python (Contd) ========== - A slightly more advanced use of the reader --- **catching and reporting errors**: ```python import csv, sys filename = 'some.csv' with open(filename, newline='') as f: reader = csv.reader(f) try: for row in reader: print(row) except csv.Error as e: sys.exit('file {}, line {}: {}'.format(filename, reader.line_num, e)) ``` - And while the module doesn't directly support **parsing strings**, it can easily be done: ```python import csv for row in csv.reader(['one,two,three']): print(row) ``` --- References ========== * https://www.tutorialspoint.com/json/json_overview.htm * https://docs.python.org/3/library/json.html * http://lxml.de/parsing.html * https://en.wikipedia.org/wiki/Comma-separated_values * https://docs.python.org/3/library/csv.html --- class: center, middle .center[] # Thank you. Any questions?