Quickly scrape web data without coding
Turn web pages into structured spreadsheets within clicks
Extract Web Data in 3 Steps
Feb 12, 2021 Web scraping can be done by people with various degree of experience and knowledge. Whether you're a developer wanting to perform large-scale data extraction on lots of websites or a growth-hacker wanting to extract email addresses on directory websites, there are many options!
Point, click and extract. No coding needed at all!
- FMiner is a visual web data extraction tool for web scraping and web screen scraping. Its intuitive user interface permits you to quickly harness the software’s powerful data mining engine to extract data from websites. In addition to the basic web scraping features it also has AJAX/Javascript processing and CAPTCHA solving.
- Professional Data Services We provide professional data scraping services for you. Tell us what you need. Our data team will meet with you to discuss your web crawling and data processing requirements. Save money and time hiring the web scraping experts. Data Scraping Service Data Scraping Service.
Enter the website URL you'd like to extract data from
Click on the target data to extract
Run the extraction and get data
- Step 1Step 2Step 3
Web Scraping Tools
Extract Web Data in 3 Steps
Point, click and extract. No coding needed at all!
- Step 1
Enter the website URL you'd like to extract data from
Step 3Run the extraction and get data
Advanced Web Scraping Features
Everything you need to automate your web scraping
Easy to Use
Scrape all data with simple point and click.
No coding needed.
Deal With All Websites
Scrape websites with infinite scrolling,
login, drop-down, AJAX...
Download Results
Download scraped data as CSV, Excel, API
or save to databases.
Cloud Services
Scrape and access data on Octoparse Cloud Platform 24/7.
Schedule Scraping
Schedule tasks to scrape at any specific time,
hourly, daily, weekly...
IP Rotation
Automatic IP rotation to prevent IP
from being blocked.
What We Can Do
Easily Build Web Crawlers
Point-and-Click Interface - Anyone who knows how to browse can scrape. No coding needed.
Scrape data from any dynamic website - Infinite scrolling, dropdowns, log-in authentication, AJAX...
Scrape unlimited pages - Crawl and scrape from unlimited webpages for free.
Sign upSign up
Octoparse Cloud Service
Cloud Platform - Execute multiple concurrent extractions 24/7 with faster scraping speed.
Schedule Scraping - Schedule to extract data in the Cloud any time at any frequency.
Automatic IP Rotation - Anonymous scraping minimizes the chances of being traced and blocked.
Buy NowBuy Now
Professional Data Services
We provide professional data scraping services for you. Tell us what you need.Our data team will meet with you to discuss your web crawling and data processing requirements.Save money and time hiring the web scraping experts.Data Scraping ServiceData Scraping Service
Trusted by
- It is very easy to use even though you don't have any experience on website scraping before.
It can do a lot for you. Octoparse has enabled me to ingest a large number of data point and focus my time on statistical analysis versus data extraction. - Octoparse is an extremely powerful data extraction tool that has optimized and pushed our data scraping efforts to the next level.
I would recommend this service to anyone. The price for the value provides a large return on the investment.
For the free version, which works great, you can run at least 10 scraping tasks at a time.
'Come on, I worked so hard on this project! And this is publicly accessible data! There's certainly a way around this, right? Or else, I did all of this for nothing... Sigh...'
Yep - this is what I said to myself, just after realizing that my ambitious data analysis project could get me into hot water. I intended to deploy a large-scale web crawler to collect data from multiple high profile websites. And then I was planning to publish the results of my analysis for the benefit of everybody. Pretty noble, right? Yes, but also pretty risky.
Interestingly, I've been seeing more and more projects like mine lately. And even more tutorials encouraging some form of web scraping or crawling. But what troubles me is the appalling widespread ignorance on the legal aspect of it.
So this is what this post is all about - understanding the possible consequences of web scraping and crawling. Hopefully, this will help you to avoid any potential problem.
Disclaimer: I'm not a lawyer. I'm simply a programmer who happens to be interested in this topic. You should seek out appropriate professional advice regarding your specific situation.
What are web scraping and crawling?
Let's first define these terms to make sure that we're on the same page.
- Web scraping: the act of automatically downloading a web page's data and extracting very specific information from it. The extracted information can be stored pretty much anywhere (database, file, etc.).
- Web crawling: the act of automatically downloading a web page's data, extracting the hyperlinks it contains and following them. The downloaded data is generally stored in an index or a database to make it easily searchable.
For example, you may use a web scraper to extract weather forecast data from the National Weather Service. This would allow you to further analyze it.
In contrast, you may use a web crawler to download data from a broad range of websites and build a search engine. Maybe you've already heard of Googlebot, Google's own web crawler.
So web scrapers and crawlers are generally used for entirely different purposes.
Why is web scraping often seen negatively?
The reputation of web scraping has gotten a lot worse in the past few years, and for good reasons:
- It's increasingly being used for business purposes to gain a competitive advantage. So there's often a financial motive behind it.
- It's often done in complete disregard of copyright laws and of Terms of Service (ToS).
- It's often done in abusive manners. For example, web scrapers might send much more requests per second than what a human would do, thus causing an unexpected load on websites. They might also choose to stay anonymous and not identify themselves. Finally, they might also perform prohibited operations on websites, like circumventing the security measures that are put in place to automatically download data, which would otherwise be inaccessible.
Tons of individuals and companies are running their own web scrapers right now. So much that this has been causing headaches for companies whose websites are scraped, like social networks (e.g. Facebook, LinkedIn, etc.) and online stores (e.g. Amazon). This is probably why Facebook has separate terms for automated data collection.
In contrast, web crawling has historically been used by the well-known search engines (e.g. Google, Bing, etc.) to download and index the web. These companies have built a good reputation over the years, because they've built indispensable tools that add value to the websites they crawl. So web crawling is generally seen more favorably, although it may sometimes be used in abusive ways as well.
So is it legal or illegal?
Web scraping and crawling aren't illegal by themselves. After all, you could scrape or crawl your own website, without a hitch.
The problem arises when you scrape or crawl the website of somebody else, without obtaining their prior written permission, or in disregard of their Terms of Service (ToS). You're essentially putting yourself in a vulnerable position.
Just think about it; you're using the bandwidth of somebody else, and you're freely retrieving and using their data. It's reasonable to think that they might not like it, because what you're doing might hurt them in some way. So depending on many factors (and what mood they're in), they're perfectly free to pursue legal action against you.
Professional Web Scraping With Java
I know what you may be thinking. 'Come on! This is ridiculous! Why would they sue me?'. Sure, they might just ignore you. Or they might simply use technical measures to block you. Or they might just send you a cease and desist letter. But technically, there's nothing that prevents them from suing you. This is the real problem.
Professional Web Scraping Tool
Need proof? In Linkedin v. Doe Defendants, Linkedin is suing between 1-100 people who anonymously scraped their website. And for what reasons are they suing those people? Let's see:
- Violation of the Computer Fraud and Abuse Act (CFAA).
- Violation of California Penal Code.
- Violation of the Digital Millennium Copyright Act (DMCA).
- Breach of contract.
- Trespass.
- Misappropriation.
That lawsuit is pretty concerning, because it's really not clear what will happen to those 'anonymous' people.
Consider that if you ever get sued, you can't simply dismiss it. You need to defend yourself, and prove that you did nothing wrong. This has nothing to do with whether or not it's fair, or whether or not what you did is really illegal.
Professional Web Scraping Services
Another problem is that law isn't like anything you're probably used to. Because where you use logic, common sense and your technical expertise, they'll use legal jargon and some grey areas of law to prove that you did something wrong. This isn't a level playing field. And it certainly isn't a good situation to be in. So you'll need to get a lawyer, and this might cost you a lot of money.
Besides, based on the above lawsuit by LinkedIn, you can see that cases can undoubtedly become quite complex and very broad in scope, even though you 'just scraped a website'.
The typical counterarguments brought by people
I found that people generally try to defend their web scraping or crawling activities by downplaying their importance. And they do so typically by using the same arguments over and over again.
So let's review the most common ones:
'I can do whatever I want with publicly accessible data.'
False. The problem is that the 'creative arrangement' of data can be copyrighted, as described on cendi.gov:
Facts cannot be copyrighted. However, the creative selection, coordination and arrangement of information and materials forming a database or compilation may be protected by copyright. Note, however, that the copyright protection only extends to the creative aspect, not to the facts contained in the database or compilation.
So a website - including its pages, design, layout and database - can be copyrighted, because it's considered as a creative work. And if you scrape that website to extract data from it, the simple fact of copying a web page in memory with your web scraper might be considered as a copyright violation.
In the United States, copyrighted work is protected by the Digital Millenium Copyright Act (DMCA).
'This is fair use!'
This is a grey area:
- In Kelly v. Arriba Soft Corp., the court found that the image search engine Ditto.com made fair use of a professional photographer's pictures by displaying thumbnails of them.
- In Associated Press v. Meltwater U.S. Holdings, Inc., the court found that Meltwater's news aggregator service didn't make fair use of Associated Press' articles, even though scraped articles were only displayed as excerpts of the originals.
'It's the same as what my browser already does! Scraping a site is not technically different from using a web browser. I could gather data manually, anyway!'
False. Terms of Service (ToS) often contain clauses that prohibit crawling/scraping/harvesting and automated uses of their associated services. You're legally bound by those terms; it doesn't matter that you could get that data manually.
'The worse that might happen if I break their Terms of Service is that I might get banned or blocked.'
This is a grey area:
- In Facebook v. Pete Warden, Facebook's attorney threatened Mr. Warden to sue him if he published his dataset comprised of hundreds of million of scraped Facebook profiles.
- In Linkedin Corporation v. Michael George Keating, Linkedin blocked Mr. Keating from accessing Linkedin because he had created a tool that they thought was made to scrape their website. They were wrong. But yet, he has never been able to restore his account. Fortunately, this case didn't go further.
- In LinkedIn Corporation v. Robocog Inc, Robocog Inc. (a.k.a. HiringSolved) was ordered to pay 40000$ to Linkedin for their unauthorized scraping of the site.
'This is completely unfair! Google has been crawling/scraping the whole web since forever!'
True. But law has apparently nothing to do with fairness. It's based on rules, interpreted by people.
'If I ever get sued, I'll Good-Will-Hunting my way into defending myself.'
Good luck! Unless you know law and legal jargon extensively. Personally, I don't.
'But I used an automated script, so I didn't enter into any contract with the website.'
This is a grey area:
- In Internet Archive v. Suzanne Shell, Internet Archive was found guilty of breach of contract while copying and archiving pages from Mrs. Shell's website using its web crawlers. On her website, Mrs. Shell displays a warning stating that as soon as you copy content from her website, you enter into a contract, and you owe her 5000$US per page copied (!!!). The two parties apparently reached an amicable resolution.
- In Southwest Airlines Co. v. BoardFirst, LLC, BoardFirst was found guilty of violating a browsewrap contract displayed on Southwest Airlines' website. BoardFirst had created a tool that automatically downloaded the boarding passes of Southwest's customers to offer them better seats.
'Terms of Service (ToS) are not enforceable anyway. They have no legal value.'
False. The Bingham McCutchen LLP law firm published a pretty extensive article onthis matter and they state that:
As is the general rule with any contract, a website's terms of use will generally be deemed enforceable if mutually agreed to by the parties. [...] Regardless of whether a website's terms of use are clickwrap or browsewrap, the defendant's failure to read those terms is generally found irrelevant to the enforceability of its terms. One court disregarded arguments that awareness of a website's terms of use could not be imputed to a party who accessed that website using a web crawling or scraping tool that is unable to detect, let alone agree, to such terms. Similarly, one court imputed knowledge of a website's terms of use to a defendant who had repeatedly accessed that website using such tools. Nevertheless, these cases are, again, intensely factually driven, and courts have also declined to enforce terms of use where a plaintiff has failed to sufficiently establish that the defendant knew or should have known of those terms (e.g., because the terms are inconspicuous), even where the defendant repeatedly accessed a website using web crawling and scraping tools.
In other words, Terms of Service (ToS) will be legally enforced depending on the court, and if there's sufficient proof that you were aware of them.
'I respected their robots.txt and I crawled at a reasonable speed, so I can't possibly get into trouble, right?'
This is a grey area.
robots.txt is recognized as a 'technological tool to deter unwanted crawling or scraping'. But whether or not you respect it, you're still bound to the Terms of Service (ToS).
'Okay, but this is for personal use. For my personal research only. I won't re-publish it, or publish any derivative dataset, or even sell it. So I'm good to go, right?'
This is a grey area. Terms of Service (ToS) often prohibit automatic data collection, for any purpose.
According to the Bingham McCutchen LLP law firm:
The terms of use for websites frequently include clauses prohibiting access or use of the website by web crawlers, scrapers or other robots, including for purposes of data collection. Courts have recognized causes of action for breaches of contract based on the use of web crawling or scraping tools in violation of such provisions.
'But the website has no robots.txt. So I can do what I want, right?'
False. You're still bound to the Terms of Service (ToS), and the content is copyrighted.
General advice for your scraping or crawling projects
Based on the above, you can certainly guess that you should be extra cautious with web scraping and crawling.
Here are a few pieces of advice:
- Use an API if one is provided, instead of scraping data.
- Respect the Terms of Service (ToS).
- Respect the rules of robots.txt.
- Use a reasonable crawl rate, i.e. don't bombard the site with requests. Respect the crawl-delay setting provided in robots.txt; if there's none, use a conservative crawl rate (e.g. 1 request per 10-15 seconds).
- Identify your web scraper or crawler with a legitimate user agent string. Create a page that explains what you're doing and why, and link back to the page in your user agent string (e.g. 'MY-BOT (+https://yoursite.com/mybot.html)')
- If ToS or robots.txt prevent you from crawling or scraping, ask a written permission to the owner of the site, prior to doing anything else.
- Don't republish your crawled or scraped data or any derivative dataset without verifying the license of the data, or without obtaining a written permission from the copyright holder.
- If you doubt on the legality of what you're doing, don't do it. Or seek the advice of a lawyer.
- Don't base your whole business on data scraping. The website(s) that you scrape may eventually block you, just like what happened in Craigslist Inc. v. 3Taps Inc..
- Finally, you should be suspicious of any advice that you find on the internet (including mine), so please consult a lawyer.
Remember that companies and individuals are perfectly free to sue you, for whatever reasons they want. This is most likely not the first step that they'll take. But if you scrape/crawl their website without permission and you do something that they don't like, you definitely put yourself in a vulnerable position.
Conclusion
As we've seen in this post, web scraping and crawling aren't illegal by themselves. They might become problematic when you play on somebody else's turf, on your own terms, without obtaining their prior permission. The same is true in real life as well, when you think about it.
There are a lot of grey areas in law around this topic, so the outcome is pretty unpredictable. Before getting into trouble, make sure that what you're doing respects the rules.
And finally, the relevant question isn't 'Is this legal?'. Instead, you should ask yourself 'Am I doing something that might upset someone? And am I willing to take the (financial) risk of their response?'.
So I hope that you appreciated my post! Feel free to leave a comment in the comment section below!
Professional Web Scraping Services
Update (24/04/2017): this post was featured in Reddit and Lobsters. It was also featured in the Programming Digest newsletter. If you get a chance to subscribe to it, you won't be disappointed! Thanks to everyone for your support and your great feedback!