What is Content Scraping and How to Prevent It
Content creation often demands a considerable investment in terms of working hours and other resources. The better the content, the more planning, research, and editing has to go into making it. We have tools that help us, sure. We can write better blog posts thanks to them, and even automate content distribution to channels such as Facebook. But we still pour resources in content creation for a straightforward reason: content matters a lot.
Unfortunately, content also matters to people who don’t want to deal with content creation. If you’ve ever searched the internet for a keyword you used for an article only to find a carbon copy of your article on another website, you’re keenly aware of it. You’re also a victim of content scraping.
In this article, we’ll tell show you:
Content scraping is the practice of taking the content of one page, post, or a whole website and publishing it on another website. You don’t have to get credit for the content — although even if you did, it wouldn’t change anything. You needn’t permit it, either. Content scraping is often little more than simple theft, a plagiarism technique enacted against your property.
Content scraping happens because the content you create has value. You can use it to drive traffic, promote affiliate links, build a mailing list, get better search engine rankings, demonstrate thought leadership in your industry, and do so much more. But it’s because your content can help you achieve those goals that it becomes a target for content scraping. The scrapers want the same things you do, and they’re willing to let you do the heavy lifting for them.
Of course, you’re a better target for scraping if you’re likely to produce high-quality content. If you’re a publisher of high-quality material, like a blog or a business that keeps the production value of the content high, you’re a good target. But so are e-commerce websites, job portals, and review sites. If it’s content and it’s good enough to attract an audience, it will attract scrapers, too.
Not only is content scraping stealing and, therefore, wrong, but it might also hurt your website in more ways than one. For example, the content on your website will have to compete with the same content on another website, making it harder for your website to rank. Plus, Google isn’t too fond of plagiarism, and it might be your website that’s left holding the bag.
Then there’s the fact that traffic from scrapers isn’t real traffic. Scraping can create fake page views, influencing all the metrics that are calculated using page views. There’s nothing like phony traffic to throw a wrench into your website’s analytics.
But you shouldn’t forget the most practical and immediate implication of content scraping — the congestion it can cause. Scrapers can send numerous requests in a small timeframe, and they can download lots of images at the same time, slowing your website down to a crawl. And you can guess how willing the average website visitor is to browse a site that takes ages to load.
Scrapers come in various degrees of sophistication. On the lower end, a scraping attack is nothing more than a person who goes through your website page by page and copies/pastes the content from your website onto their own. It really can be as simple as that – a person copying your content.
A more sophisticated attack would involve the use of a bot, script, scraper, or parser. These can do anything from sending tons of search queries to your website and extracting result links and titles, to opening your pages and taking screenshots of them. In these cases, scraping is automated.
Some businesses offer scraping as a service. You might expect someone who’s getting paid to take content from your website to put some effort into it, and maybe even employ techniques and tools that are not publicly available.
Whatever tool or technique they’re using to scrape your content, you should find out you’ve been a target of a scraper as soon as possible. There’s no single thing that can tell your content has been scraped — you’ll need to be vigilant for the signs that something is wrong.
You can monitor for possible content scraping by:
Setting up Google Alerts for your post titles. It works best if you don’t post too often.
Doing a Google search for pieces of your content. It’s the manual version of using Google Alerts.
Watch for abnormal website traffic and behavior. Look for lots of pageviews from a single IP in a short amount of time and a high volume of searches from a unique visitor.
Add internal links. Then watch for trackbacks and links to your site in Webmaster Tools.
Whatever jumps out as irregular to you in your website’ logs can be a reason to do an exact match search for your content. Knowing the way your content was scraped can help you choose the type of prevention and protection you should employ.
The good news is that there are many ways that you can prevent content scraping on your website. The bad news is that the ones you can do without plugins or a third-party service tend to be tedious. And both can turn real users off.
Here’s an example. One way you can make it more difficult for content scrapers to take your content is by putting it behind a wall. You can easily enable registration on your website. With the use of plugins, you can add email confirmation and Captcha when registering and logging in. That will make it more difficult for scrapers to get to your content. But it will do the same for website visitors.
Some of the popular methods you can employ to avert content scrapers and not website visitors include:
Disabling right-clicking and keyboard commands.
Rate limiting to allow only a certain number of actions in a specified timeframe.
Changing your website HTML frequently.
Turning text into images.
Blocking by IP address, range, or browser ID.
Using cookies to identify and block scrapers.
For some, if not all, of these methods, you’ll need to know more than a few things about website administration or coding. To block by IP address, you’d need to track the appropriate addresses in your log files, and then block them in .htaccess. It might not seem hard to do, but it can take a while.
Using plugins and third-party services to perform protective and preventive actions is a course you could take. You can disable right-clicking, for example, using the WP Content Copy Protection & No Right Click plugin. Turning text to SVG images is one of the techniques SiteGuarding uses for those that subscribe to their services. Cloudflare is a top-rated tool that utilizes rate limiting. These plugins might cost money, but they’ll save you lots of time.
If you want to spend neither time nor money to fight against content scraping, you don’t have to. It might happen that you’re not suffering any damage from the scrapper’s actions. You might even be able to use the scrapers’ activity to your advantage.
For example, you could add lots of internal links to your content. They will all be pointing back to your website once the scrapers publish the scraped content. You can also include your affiliate links in the content. Finally, you can edit the RSS Footer using a plugin to add a banner or a notice about the original content creator and a link to your website.
If you’re up for the fight, going after content scrapers legally is an option, too. The easiest way to do this would be to send a DMCA (Digital Millennium Copyright Act) notice to the site’s web host. Just use a WHOIS service like who.is to find out the host, then look at their website for a DMCA notice email address — many web hosts have it. You can easily find DMCA notice templates and generators online to help you create your notice.
Let’s Wrap It Up!
Your content is an incredible asset that can propel your website to achieve any goal you set for it. But it can also be a magnet for people who’d like to have all of that without doing the content creation. If you catch their eye, they might try to scrape the content from your website.
You’ll have several ways to deal with them. You can try to defeat them with nothing but your wit and good old elbow grease. There’s also the option to get plugins and third-party services to do that work for you. You can send them a legal notice hoping that will be enough to have them take down your content from their website. And, of course, you can simply do nothing and put that time and resources into making more of excellent content. The choice is yours.