2014-04-21T07:00:03Z
PDF On May 8, 2017, bo Zhao published Web Scraping Find, read and cite all the research you need on ResearchGate. Chapter PDF Available. May 2017; DOI: 10.1007/978-3-319-32001-4. Aug 17, 2020 Web scraping is the process of collecting and parsing raw data from the Web, and the Python community has come up with some pretty powerful web scraping tools. The Internet hosts perhaps the greatest source of information—and misinformation—on the planet. Web scraping using flask pymongo 0 Basically i perform web scraping. Firstly i fetch the webpage data of my site through url and stored in mongodb. Then i retrieve data from mongodb and stored in variable. After that i changed some data of my site and again fetch the data through url of my site.
A little over a year ago I wrote an article on web scraping using Node.js. Today I'm revisiting the topic, but this time I'm going to use Python, so that the techniques offered by these two languages can be compared and contrasted.
The Problem
As I'm sure you know, I attended PyCon in Montréal earlier this month. The video recordings of all the talks and tutorials have already been released on YouTube, with an index available at pyvideo.org.
I thought it would be useful to know what are the most watched videos of the conference, so we are going to write a scraping script that will obtain the list of available videos from pyvideo.org and then get viewer statistics from each of the videos directly from their YouTube page. Sounds interesting? Let's get started!
The Tools
There are two basic tasks that are used to scrape web sites:
- Load a web page to a string.
- Parse HTML from a web page to locate the interesting bits.
Python offers two excellent tools for the above tasks. I will use the awesome requests to load web pages, and BeautifulSoup to do the parsing.
We can put these two packages in a virtual environment:
If you are using Microsoft Windows, note that the virtual environment activation command above is different, you should use venvScriptsactivate
.
Basic Scraping Technique
The first thing to do when writing a scraping script is to manually inspect the page(s) to scrape to determine how the data can be located.
To begin with, we are going to look at the list of PyCon videos at http://pyvideo.org/category/50/pycon-us-2014. Inspecting the HTML source of this page we find that the structure of the video list is more or less as follows:
So the first task is to load this page, and extract the links to the individual pages, since the links to the YouTube videos are in these pages.
Loading a web page using requests is extremely simple:
That's it! After this function returns the HTML of the page is available in response.text
.
The next task is to extract the links to the individual video pages. With BeautifulSoup this can be done using CSS selector syntax, which you may be familiar if you work on the client-side.
To obtain the links we will use a selector that captures the <a>
elements inside each <div>
with class video-summary-data
. Since there are several <a>
elements for each video we will filter them to include only those that point to a URL that begins with /video
, which is unique to the individual video pages. The CSS selector that implements the above criteria is div.video-summary-data a[href^=/video]
. The following snippet of code uses this selector with BeautifulSoup to obtain the <a>
elements that point to video pages:
Since we are really interested in the link itself and not in the <a>
element that contains it, we can improve the above with a list comprehension:
And now we have a list of all the links to the individual pages for each session!
The following script shows a cleaned up version of all the techniques we have learned so far:
If you run the above script you will get a long list of URLs as a result. Now we need to parse each of these to get more information about each PyCon session.
Scraping Linked Pages
The next step is to load each of the pages in our URL list. If you want to see how these pages look, here is an example: http://pyvideo.org/video/2668/writing-restful-web-services-with-flask. Yes, that's me, that is one of my sessions!
From these pages we can scrape the session title, which appears at the top. We can also obtain the names of the speakers and the YouTube link from the sidebar that appears on the right side below the embedded video. The code that gets these elements is shown below:
A few things to note about this function:
- The URLs returned from the scraping of the index page are relative, so the
root_url
needs to be prepended. - The session title is obtained from the
<h3>
element inside the<div>
with idvideobox
. Note that[0]
is needed because theselect()
call returns a list, even if there is only one match. - The speaker names and YouTube links are obtained in a similar way to the links in the index page.
Now all that remains is to scrape the views count from the YouTube page for each video. This is actually very simple to write as a continuation of the above function. In fact, it is so simple that while we are at it, we can also scrape the likes and dislikes counts:
The soup.select()
calls above capture the stats for the video using selectors for the specific id names used in the YouTube page. But the text of the elements need to be processed a bit before it can be converted to a number. Consider an example views count, which YouTube would show as '1,344 views'
. To remove the text after the number the contents are split at whitespace and only the first part is used. This first part is then filtered with a regular expression that removes any characters that are not digits, since the numbers can have commas in them. The resulting string is finally converted to an integer and stored.
To complete the scraping the following function invokes all the previously shown code:
Parallel Processing
The script up to this point works great, but with over a hundred videos it can take a while to run. In reality we aren't doing so much work, what takes most of the time is to download all those pages, and during that time the script is blocked. It would be much more efficient if the script could run several of these download operations simultaneously, right?
Back when I wrote the scraping article using Node.js the parallelism came for free with the asynchronous nature of JavaScript. With Python this can be done as well, but it needs to be specified explicitly. For this example I'm going to start a pool of eight worker processes that can work concurrently. This is surprisingly simple:
The multiprocessing.Pool
class starts eight worker processes that wait to be given jobs to run. Why eight? It's twice the number of cores I have on my computer. While experimenting with different sizes for the pool I've found this to be the sweet spot. Less than eight make the script run slower, more than eight do not make it go faster.
The pool.map()
call is similar to the regular map()
call in that it invokes the function given as the first argument once for each of the elements in the iterable given as the second argument. The big difference is that it sends all these to run by the processes owned by the pool, so in this example eight tasks will run concurrently.
The time savings are considerable. On my computer the first version of the script completes in 75 seconds, while the pool version does the same work in 16 seconds!
The Complete Scraping Script
The final version of my scraping script does a few more things after the data has been obtained.
I've added a --sort
command line option to specify a sorting criteria, which can be by views, likes or dislikes. The script will sort the list of results in descending order by the specified field. Another option, --max
takes a number of results to show, in case you just want to see a few entries from the top. Finally, I have added a --csv
option which prints the data in CSV format instead of table aligned, to make it easy to export the data to a spreadsheet.
The complete script is available for download at this location: https://gist.github.com/miguelgrinberg/5f52ceb565264b1e969a.
Below is an example output with the 25 most viewed sessions at the time I'm writing this:
Conclusion
I hope you have found this article useful as an introduction to web scraping with Python. I have been pleasantly surprised with the use of Python, the tools are robust and powerful, and the fact that the asynchronous optimizations can be left for the end is great compared to JavaScript, where there is no way to avoid working asynchronously from the start.
Miguel
Hello, and thank you for visiting my blog! If you enjoyed this article, please consider supporting my work on this blog on Patreon!
60 comments
#51Rob Lineberger said 2015-10-08T20:12:25Z
#52Tushar said 2015-11-30T12:42:51Z
#53Mani Teja Varma said 2016-01-12T07:25:11Z
#54Miguel Grinberg said 2016-01-14T15:37:26Z
#55Julio Guzman said 2016-04-12T05:19:59Z
#56Raj said 2016-05-02T10:14:21Z
#57Miguel Grinberg said 2016-05-04T19:57:20Z
#58pvlbzn said 2016-06-19T13:06:15Z
#59Claire said 2017-01-30T03:53:23Z
#60Miguel Grinberg said 2017-01-30T17:58:20Z
Leave a Comment
Previous:
Flask intro: A very simple Flask app
Flask, part 2: Values in routes; using an API
Flask templates: Write HTML templates for a Flask app
Flask: Deploy an app: How to put your finished app online
Code for this chapter is here.
Web Scraping Flask Download
In the Flask Templates chapter, we built a functioning Flask app. In this chapter, we’ll explore how to add functional web forms to a similar app.
Flask forms app example (actors_app):
Introduction¶
Flask has an extension that makes it easy to create web forms.
WTForms is “a flexible forms validation and rendering library for Python Web development.” With Flask-WTF, we get WTForms in Flask.
WTForms includes security features for submitting form data.
WTForms has built-in validation techniques.
WTForms can be combined with Bootstrap to help us make clean-looking, responsive forms for mobile and desktop screens.
Setup for using forms in Flask¶
We will install the Flask-WTF extension to help us work with forms in Flask. There are many extensions for Flask, and each one adds a different set of functions and capabilities. See the list of Flask extensions for more.
In Terminal, change into your Flask projects folder and activate your virtual environment there. Then, at the command prompt — where you see $
(Mac) or C:Usersyourname>
(Windows )—
We will also install the Flask-Bootstrap4 extension to provide Bootstrap styles for our forms.
This installation is done only once in any virtualenv. It is assumed you already have Flask installed there.
More details in WTForms docs
An alternative is Bootstrap Flask — but that is NOT used here
Imports for forms with Flask-WTF and Flask-Bootstrap¶
You will have a long list of imports at the top of your Flask app file:
Note as always that Python is case-sensitive, so upper- and lowercase must be used exactly as shown. The fourth line will change depending on your form’s contents. For example, if you have a SELECT element, you’ll need to import that. See the complete list of WTForms form field types.
Set up a form in a Flask app¶
After the imports, these lines follow in the app script:
Flask allows us to set a “secret key” value. You can grab a string from a site such as RandomKeygen. This value is used to prevent malicious hijacking of your form from an outside submission.
Flask-WTF’s FlaskForm
will automatically create a secure session with CSRF (cross-site request forgery) protection if this key-value is set.Don’t publish the actual key on GitHub!
Web Scraping Facebook Graph Api
You can read more about app.config['SECRET_KEY']
in this StackOverflow post.
Configure the form¶
Next, we configure a form that inherits from Flask-WTF’s class FlaskForm
. Python style dictates that a class starts with an uppercase letter and uses camelCase, so here our new class is named NameForm
(we will use the form to search for a name).
In the class, we assign each form control to a unique variable. This form has only one text input field and one submit button.
Every form control must be configured here.
If you had more than one form in the app, you would define more than one new class in this manner.
Note that StringField
and SubmitField
were imported at the top of the file. If we needed other form-control types in this form, we would need to import those also. See a list of all WTForms field types.
Note that several field types (such as RadioField
and SelectField
) must have an option choices=[]
specified, after the label text. Within the list, each choice is a pair in this format: ('string1','string2')
.
WTForms also has a long list of validators we can use. The DataRequired()
validator prevents the form from being submitted if that field is empty. Note that these validators must also be imported at the top of the file.
Put the form in a route function¶
Now we will use the form in a Flask route:
A crucial line is where we assign our configured form object to a new variable:
We must also pass that variable to the template, as seen in the final line above.
Be aware that if we had created more than one form class, each of those would need to be assigned to a unique variable.
Put the form in a template¶
Before we break all that down and explain it, let’s look at the code in the template index.html:
Where is the form? This is the amazing thing about Flask-WTF — by configuring the form as we did in the Flask app, we can generate a form with Bootstrap styles in HTML using nothing more than the template you see above. Line 27 is the form.
Note that in the Flask route function, we passed the variable form
to the template index.html:
So when you use wtf.quick_form()
, the argument inside the parentheses must be the variable that represents the form you created in the app.
We discussed the configuration of NameForm
above.
A quick note about Bootstrap in Flask¶
There’s more about this in the Resources section at the bottom of this page — but to summarize briefly:
You pip-installed Flask-Bootstrap4 in your Flask virtual environment.
You wrote
fromflask_bootstrapimportBootstrap
at the top of the Flask app file.Below that, you wrote
Bootstrap(app)
in the Flask app file.In any Flask template using Bootstrap styles, the top line will be:
{%extends'bootstrap/base.html'%}
That combination of four things has embedded Bootstrap 4 in this app and made wtf.quick_form()
possible.
There’s an excellent how-to video (only 9 minutes long) about using Bootstrap styles in Flask if you want to separate the forms information from the Bootstrap information in your mind. You can, of course, use Flask-Bootstrap4 without the forms!
Examining the route function¶
Before reading further, try out a working version of this app. The complete code for the app is in the folder named actors_app.
You type an actor’s name into the form and submit it.
If the actor’s name is in the data source (ACTORS), the app loads a detail page for that actor. (Photos of bears 🐻 stand in for real photos of the actors.)
Otherwise, you stay on the same page, the form is cleared, and a message tells you that actor is not in the database.
First we have the route, as usual, but with a new addition for handling form data: methods
.
Every HTML form has two possible methods, GET
and POST
. GET
simply requests a response from the server. POST
, however, sends a request with data attached in the body of the request; this is the way most web forms are submitted.
This route needs to use both methods because when we simply open the page, no form was submitted, and we’re opening it with GET
. When we submit the form, this same page is opened with POST
if the actor’s name (the form data) was not found. Thus we cannot use only one of the two options here.
At the start of the route function, we get the data source for this app. It happens to be in a list named ACTORS
, and we get just the names by running a function, get_names()
. The function was imported from the file named modules.py.
We assign the previously configured form object, NameForm()
, to a new variable, form
. This has been discussed above.
We create a new, empty variable, message
.
validate_on_submit()
is a built-in WTForms function, called on form
(our variable). If it returns True, the following commands and statements in the block will run. If not, the form is simply not submitted, and invalid fields are flagged. It will return True if the form was filled in and submitted.
form.name.data
is the contents of the text input field represented by name
. Perhaps we should review how we configured the form:
That name
is the name
in form.name.data
— the contents of which we will now store in a new variable, name
. To put it another way: The variable name
in the app now contains whatever the user typed into the text input field on the web page — that is, the actor’s name.
This if-statement is specific to this app. It checks whether the name
(that was typed into the form) matches any name in the list names
. If not, we jump down to else
and text is put into the variable message
. If name
DOES match, we clear out the form, run a function called get_id()
(from modules.py) and — important! — open a different route in this app:
Thus redirect(url_for('actor',id=id))
is calling a different route here in the same Flask app script. (See actors.py, lines 46-55.) The redirect()
function is specifically for this use, and we imported it from the flask
module at the top of the app. We also imported url_for()
, which you have seen previously used within templates.
As far as using forms with Flask is concerned, you don’t need to worry about the actors and their IDs, etc. What is important is that the route function can be used to evaluate the data sent from the form. We check to see whether it matched any of the actors in a list, and a different response will be sent based on match or no match.
Any kind of form data can be handled in a Flask route function.
You can do any of the things that are typically done with HTML forms — handle usernames and passwords, write new data to a database, create a quiz, etc.
The final line in the route function calls the template index.html and passes three variables to it:
Conclusion¶
Flask-WTF provides convenient methods for working with forms in Flask. Forms can be built easily and also processed easily, with a minimum of code.
Adding Flask-Bootstrap ensures that we can build mobile-friendly forms with a minimum amount of effort.
Note that it is possible to build a customized form layout using Bootstrap 4 styles in a Flask template, or to build a custom form with no Bootstrap styles. In either case, you cannot use {{wtf.quick_form(form)}}
but would instead write out all the form code in your Flask template as you would in a normal HTML file. To take advantage of WTForms, you would still create the form class with FlaskForm
in the same way as shown above.
An example is the demo Flask app Books Hopper, which includes four separate Bootstrap forms:
a login form
a registration form
a search form
a form for writing a book review and selecting a rating
Bootstrap 4 was used in all templates in the Books Hopper app, but Flask-Bootstrap was not.
Important
You are using Bootstrap 4 in Flask if you installed with pipinstallFlask-Bootstrap4
. In early 2018, Bootstrap 4 replaced Bootstrap 3. The differences are significant.
Resources¶
Sending form data — how web browsers interact with servers; request/response
.