Effects of Duplicate Content on SEO and Google Rankings
Duplicate content is something that is affecting the performance of miliions of websites and esppecially because Google frowns at it, it is worth given some attention to address. In this post, you will learn about duplicate content, what it is and how you can fix it in specific cases.
1. What Is Duplicate Content
Duplicate content is simply the content you use on your website which has been already written by someone, somewhere else. So, if you copy a piece of content off one website, paste and then publish it on your website, then you have duplicate content. Google frowns at it as it represents some form or attempt to manipulate the Search Engines.
Duplicate content is one major behavior which affect website quality, it has many sides and can be caused by many things. It could be due to some technical difficulties or unintentional mistakes and can also be a deliberate action. Before we move on to greater details, let us first understand what content duplication actually is.
On the web, duplicate content is when the same (or very similar) content is found on at least two different URLs.
One key thing here to keep in mind is that the content is already on Google indexed pages. Again, even though the content could appear on another website, if Google doesn’t have the original version of the copied content in its index, then it can’t really consider it duplicate content, even though it is!
Another take away here is that old content which is completely revamped using some content management software, turning them and the images into text and then using may not be right from a copyright point of view, but it should pass Google’s duplication filters. This wouldn't be encouraged but Google is not likely to see it as duplicate content.
I would actually recommend publications which are moving from print to digital should repurpose old content in their magazines on their websites.
We know Google loves quality content on websites. So, for evergreen contents that is not yet indexed and is still is new and original, you can publish it and not incure any penalty for duplicate content.
Furthermore, you can modify an existing content, enrich it and turn it into something like How people used to do optimize websites in the 90s’. You can keep the content identical this way (although a small original introduction might be required) and it will still pass the duplicate content test. The discussion on what amounts to duplicate content is big and ongong.
The other big issue is what is the solution to duplicate content. It is important to understand that there isn’t one single solution for fixing duplicate content issues. This is because there are very many scenarios representing content duplication. And so, there are multiple solutions and one of them might be better than the other. We will look further into these solutions and problems and hope that we can provide answers to most of the uestions you will have on content duplication.
However, we must first get some other things clear to better understand the nature of duplicate content. Then we will analyze different scenarios and give solutions for each and every one of them.
Also read:The Ultimate Guide to On-Page SEO
2. How Google Handles Duplicate Content
There’s a lot of content out there in the world. Compared to that, Google knows only about a small part of it. To be able to truly say if the content on your site has been copied, Google would have to know every piece of paper that has ever been written, which is impossible.
When you publish something on your website, it takes a while for Google to crawl and index it. If your site is popular and you publish content often, Google will crawl it more often. This means it can index the content sooner.
If you publish rarely, Google will probably not crawl your site so often and it might not index the content very quickly. Once a piece of content is indexed, Google can then relate other content to it to see if it’s duplicate or not.
The date of the index is a good reference source for which content was the original version.
So what happens when Google identifies a piece of content as duplicate? Well, it has 2 choices:
- Display it: Yes, Google might choose to display duplicate content in its search results if it finds it to be actually relevant to a user. A good example might be news publications making the same statements over and over again when something happens.
- Don’t display it: Google throws your content into something often called Google Omitted Results. If you SPAM the web all the time, it might even consider not indexing your site anymore.
3. The Myth of the Duplicate Content Penalty
As we mentioned earlier, Google does not punish for duplicate content. However, because Google doesn’t like duplicate content, people simply assume that it’s a bad practice which gets punished by Google with a Penalty!
Despite popular belief and although content duplicate does cause issues, there’s no such thing as a duplicate content penalty!
This comes in contradiction with Google’s official page on duplicate content on the webmaster guidelines which states that:
“In the rare cases in which Google perceives that duplicate content may be shown with intent to manipulate our rankings and deceive our users, we’ll also make appropriate adjustments in the indexing and ranking of the sites involved. As a result, the ranking of the site may suffer, or the site might be removed entirely from the Google index, in which case it will no longer appear in search results.” – Google
So while there is no clear duplicate content penalty, if you ‘try to manipulate the search results’ you might end up losing rankings or even getting deindexed. However, Duplicate content isn’t something that you should avoid just because Google might hit you in the head. Google actually won’t hit you just because you have duplicate content.
Google's response to duplicate content is not the same with those who use content scrapers, deliberately steal content and try to get it ranked or use mass content syndication only for links. It’s not only about content duplication but actually about stealing content and filling the internet up with content that are not really valuable.
The fact is that there’s just so much non-deliberate duplicate content out there which makes it even harder for Google to detect the evil-doers with a 100% success rate. But even though Google won’t penalize you, it doesn’t mean that duplicate content can’t affect your website in a negative way.
Talking about duplicate content penalties, here’s what is written in the Google Search Quality Evaluator Guidelines from March 2017:
The Lowest rating is appropriate if all or almost all of the MC (main content) on the page is copied with little or no time, effort, expertise, manual curation, or added value for users. Such pages should be rated Lowest, even if the page assigns credit for the content to another source.
In the video below, Andrey Lipattsev, senior Google search quality strategist said content duplication penalty doesn’t exist. Lipattsev also said that:
- Google rewards unique content and correlates it with added value;
- The duplicate content is filtered;
- Google wants to find new content and duplicates slows the search engine down;
- If you want Google to quickly discover your new content, you should send XML sitemaps;
- What the search engine wants us to do is to concentrate signals in canonical documents, and optimize those canonical pages so they are better for users;
- It is not duplicate content that is hurting your ranking, but the lack of unique content.
4. Why Google Dislikes Duplicate & Very Similar Content
Google is very concerned about the quality of its search resultes. Take for instance, when you search something on Google, would you like to see the exact same thing 10 times? Of course not! You want different results, so that you may choose. You want different opinions, so that you can form your own.
Google wants to avoid SPAM and useless overload of its index and servers. It wants to serve its users the best content available.
As a general rule of thumb, Google tries to display only 1 version of the same content.
However, sometimes, Google fails to do this and multiple or very similar versions of the same pages, many times even from the exact same website get shown.
For example, you could have an eCommerce website, XYZ, generate pages dynamically from nearly all the searches that happen on a site. You can have 3 top listings for keyword, keyword plural and keyword + preposition. All of these were searched internally on XYZ’s website so they automatically generated these pages and sent them to the index.
You can hev the titles & descriptions very similar and the content on those pages identical.
Normally, Google shouldn’t allow this to happen.
Although it may not be correct, it is possible that those results are actually the most relevant. But this doesn’t happen for every keyword out there.
Generally speaking, although this site could canonicalize these versions to a single page, it won't be heir fault that they get 3 top listings. It’s Google’s job to rank the pages, not theirs.
This is a classic example of duplicate content issue. From a user’s perspective, this might not be a very good thing. Maybe the user wants to see other websites. Maybe they’ve had a bad experience with the site in the past.
Google is still trying to figure out ways to detect when this is an issue and when it is not. It’s not quite there, but it’s getting better and better.
5. How Much Copy/Paste Is Considered Duplicate Content?
Matt Cutts in the following video says, about 25-30% of the entire internet is made up of duplicate content. That figure might have changed in recent years, since the video is pretty old. Considering the expansion of the internet and the growing number of new websites (especially eCommerce ones, where content duplication is thriving), it has likely increased.
So what we get from the video above is that not all duplicate content is bad. Sometimes people quote other people for a reason. They bring quality to their content by doing that and it isn’t something bad.
In essence, think about it like this:
Duplicate content is when content is identical or very similar to the original source.
Now of course, very similar can be interpreted. But that’s not the point. If you’re thinking about these numbers, then you’re obviously up to something bad. If you’ve contemplated deliberately copying/stealing some content to claim it as your own, then it’s duplicate content.
A popular type of duplicate content that is harmless are eCommerce sites product descriptions.
Ecommerce sites owners and editors are very good at simply copying and pasting product descriptions. This creates a ton of duplicate content, but users might like to still see it on the web because of different prices or services quality.
What ultimately sells a product though is its copy. So don’t just list a bunch of technical specifications. Write a story that sells.
Many eCommerce website owners are complaining that other websites are stealing their description content. As long as they don’t outrank you, that is still not a problem.
Another one is boilerplate content. Boilerplate content is content that repeats itself over and over again on multiple pages, such as the header, navigation, footer and sidebar content.
As long as you’re not trying to steal someone else’s content without their permission and claim it as your own, you’re mostly fine with using quotes or rewriting some phrases. However, if your page has 70-80% similarity and you only replace some verbs and subjects with synonyms… that’s not actually quoting.
Did You Know
Google Search Console no longer allows you to see your duplicate content issues. Some time ago, this was possible, but Google ‘let go’ of this old feature.
So how can you know if you have duplicate content issues?
You can use a Site Audit Tool for that. The tool automatically identifies any duplicate content issues. Therefore, you can quickly take a look at your duplicate pages, duplicate titles, descriptions, etc.
6. Duplicate Content Causes Problems
Some of the actual issues caused by duplicate content is that it burns up crawl budget (that especially happens to big sites) and it dilutes link equity, because people will be linking to different pages which hold the same content.
6.1 It Burn Crawl Budget
This simply means that Google has to spend a lot of resources to crawl your website. This includes servers, personnel, internet and electricity bills and many other costs. Although Google’s resources seem unlimited (and probably are), the crawler does stop at some point if a website is very, very big.
If Google crawls your pages and keeps finding the same thing over and over again, it will ‘get bored’ and stop crawling your site.
This might leave important pages uncrawled, so new content or changes might be ignored. Make sure all of your most important pages are crawled and indexed by reducing the number of irrelevant pages your site is feeding to Google.
Since duplicate content is usually generated by dynamic URLs from search filters, it ends up being duplicated not once, but can really be big, depending on how many filter combinations there are.
6.2 Link signal dilution
When a site get backlinks, they point to a specific UR on the websiteL. The URL that is linkked gets stronger and stronger the more links it gets.
If you have nuerous versions of the same page and people can access all of them, different websites might link to different versions of that page.
While this is helpful for your domain overall, it might not be the best solution for your website or for specific, important pages that you want to rank high. We’ll later look at this problem, what causes it and how to fix it.
6.3 Non SEO Friendly URLs
Some filters might not produce Search Engine Friendly URLs. But Google recommends that you keep your URLs user friendly. Weird long URLs are associated with viruses, malware and scams.
Example of not friendly URL: https://yourdomain.com/category/default.html?uid=87YHG9347HG387H4G&action=register
Example of friendly URL: https://yourdomain.com/account/register/
Try to keep your URLs short and easy to read, so that they would help and not hurt your sites. For example, people will figure out what those filters mean if you say order=asc&price=500&color=red. But, unless you’re a very big and trustworthy brand, like Google, they won’t be so sure what’s happening if the URL parameter extension is ei=NgfZXLizAuqErwTM6JWIDA (that’s a Google search parameter suffix).
6.4 Bad user experience
Sometimes content duplication can result in bad user experience, which will definitely hurt your website ranking. If their be one thing which can hurt website ranking or help it, it is user experience. When visitors find that your content is exactly the same with what they have read previously, it could turn them away from your page and that hurts..
Again, if you have been able to rank a page to the top of Google when it’s not actually relevant, users will notice that immediately and their behaviour will cause you to loode ranking. Quality content will eep you at the top in the search engines.
7. Internal Duplicate Content Issues
It’s finally time to list the most popular scenarios of how duplicate content gets created on the web. To check some SEO basics, let’s start with how it happens on websites internally, because it’s by far the most common issue.
7.1 HTTP / HTTPS & WWW / non-WWW
If you have an SSL certificate on your website, then there are two versions of your website. One with HTTP and one with HTTPS. The same thing applies for website versions with www and non www.
They might look very similar, but in Google’s eyes they’re different. First of all, they’re on different URLs. And since it’s the same content, it results in duplicate content. Second, one is secure and the other one is not. It’s a big difference regarding security.
If you’re planning to move your site to a secure URL, make sure to check this HTTP to HTTPS migration guide. There are also two more versions possible:
It’s the same thing as above, whether they’re running on HTTP or HTTPS. Two separate URLs containing the same content. You might not see a big difference between those two, but www is actually a subdomain. You’re just so used to seeing them as the same thing because they display the same content and usually redirect to a single preferred version.
While Google might know how to display a single version on its result pages, most of the time it doesn’t always get the right one.
It’s a technical SEO basic thing that every SEO should check, yet very many make this mistake and forget to set a preferred version. On some keywords Google displayed the HTTP version and on some other keywords it displayed the HTTPS version of the same page.
So how can you fix this?
Solution: To solve this issue, make sure you’re redirecting all the other URL versions to your preferred version. This should be the case not only for the main domain but also for all the other pages. Each page of non-preferred versions should redirect to the proper page’s preferred version:
Just in case you’re wondering, a WWW version will help you on the long term if your site gets really big and you want to serve cookieless images from a subdomain. If you’re just building your website, use WWW. If you’re already on root domain, leave it like that. The switch isn’t worth the hassle.
You can still set the preferred version from the old Google Search Console (GSC)
However, 301 redirects are mandatory. Without them, the link signals will be diluted between the 4 versions. Some people might link to you using one of the other variants. Without 301 redirects, you won’t take full advantage of those links and all your previous link building efforts will vanish.
Also, we’re not yet sure of the course the GSC is taking with its new iteration, so it’s unclear if this feature will still be available in the future.
7.2 Hierarchical Product URLs
Another very common issue that leads to duplicate content is using hierarchical product URLs. If you have an eCommerce store with very many products and categories or a blog with very many posts and categories.
On a hierarchical URL structure, the URLs would look like this:
At a first look, everything seems fine. The issue arises when you have the same product or article in multiple categories.
This is how to deal with this:
As long as you are 100% certain that your product/article won’t be in two different categories, you’re safe using hierarchical URLs.
For example, if you have a page called services and have multiple unique services with categories and subcategories, there’s no issue in having hierarchical URLs.
Solution: If you think your articles or products will be in multiple categories, then it’s better to separate post types and taxonomies with their own prefixes:
Category pages can still remain hierarchical as long as a subcategory isn’t found in multiple root categories.
Another solution would be to specify a main category and then use canonical tags or 301 redirects to the main version, but this can still cause link signal dilution.
Warning: If you do plan on fixing this issue by changing your URL structure, make sure you set the proper 301 redirects! Each old duplicate version should 301 to the final and unique new one.
7.3 URL Variations (Parameters & Session IDs)
One of the most common causes of content duplication are URL variations. Parameters and URL extensions create multiple versions of the same content under different URLs.
They are especially popular on eCommerce websites, but can also be found on other types of sites, such as booking websites, rental services and blog category pages.
On an eCommerce store, if you have filters to sort items by ascending or descending price, you can get one of these two URLs:
These pages are called faceted pages. A facet is one side of an object with multiple sides. In the example above, the pages are very similar, but instead of being written A to Z they’re written Z to A.
Some people will link to the first variant, but others might link to the second, depending on which filter they were on last. And let’s not forget about the original version without any filters (yourdomain.com/category/subcategory). On top of that, these are only two filters, but there might be a lot more (reviews, relevancy, popularity, etc.).
This results in link signal dilution, making one of every version a little bit stronger, instead of making a single version of that page really strong. Eventually, this will lead to fewer rankings overall.
You might want to argue that because of pagination, the pages will actually be completely different. That’s true if you have enough products in a category to fill multiple pages.
However, It could also be argued that the first page of “?order=desc” is a duplicate of the last page of domain.com/category/subcategory?order=asc and vice versa. One of them is also a duplicate of the main version, unless the main version orders them randomly.
The good side is that Google doesn’t really care about pagination anymore.
Google still recommends using pagination the same way as you did before (either with parameters or subdirectories).
However, you should also make sure now that you properly interlink between these pages and that each page can ‘kind of’ stand on its own. Mihai Aperghis from Vertify asked John Mueller about this and this was his response:
Just because parameters create duplicate content issues it doesn’t mean you should never index any pages that contain parameters.
Sometimes it’s a good idea to index faceted pages, if users are using those filters as keywords in their search queries.
For example, some bad filters which you should not index could be sorting by price. However, if your users search for “best second hand car under 3000” then filters with price might be relevant.
Another good example are color filters. If you don’t have a specific color scheme for a product but the filter exists, you don’t want to index that. However, if filtering by the color black completely changes the content of the page, then it might be a relevant page to index, especially if your users also use queries such as “black winter coats”.
Ian Laurie from Portent talks about fixing a HUGE duplicate content issue (links with parameters to contact pages on every page of the site) like this. The solution was to use # instead of ? as an extension to the contact page URL. For some reason, Google completely ignores links with anchors.
However, in this article Ian mentions that he hasn’t even tried rel=canonical to fix the issue. While rel=canonical would probably not harm at all, in this case it might have not been helpful due to the scale of the issue.
Solution: The best solution here is to actually avoid creating duplicate content issues in the first place. Don’t add parameters when it’s not necessary and don’t add parameters when the pages don’t create a unique facet, at least to some extent.
If the deed is already done, the fix is to either use rel=canonical and canonicalize all the useless facets to the root of the URL or noindex those pages completely. Remember though that Google is the one to decide if it will take the ‘recommendation’ you give through robots.txt or noindex meta tags. This is also applicable to canonical tags, but from my experience, they work pretty well!
Remember to leave the important facets to be indexed (self referencing canonical), especially if they have searches. Make sure to also dynamically generate their titles.
7.4 Bad Multilingual Implementation
Another issue that can result in content duplication is a bad hreflang implementation.
Most multilingual websites have a bad hreflang implementation. That’s because most plugins out there implement the hreflang wrong.
When you have 2 languages and a page is translated to both languages, everything is fine. Each page has 2 hreflang tags pointing correctly to the other version. However, when a page is untranslated, the other language version points to the root of the other language, when it should not exist at all. This basically tells Google that the French language version of yourdomain.com/en/untranslated-page/ is domain.com/fr/, which isn’t true.
However, it’s not the hreflang tag itself that causes duplicate content issues, but the links to these pages from the language selector.
The hreflang issue only confuses search engines into which page to display where. It doesn’t cause duplicate content issues. But while some plugins are smarter, others also create the pages and links to those other versions in the menu of the website. Now this is duplicate content.
Now you might think a fix is easy, but merging from one plugin to another isn’t always the easiest thing to do. It takes a lot of time and effort to get it right.
Solution: The simple solution is to not create any links to untranslated variants. If you have 5 languages, a page which is translated to all 5 languages should include the other 4 links in the menu (under the flag drop down let’s say) and also have the appropriate hreflang tags implemented correctly.
However, if you have 5 languages but a particular page is only translated in 2 languages, the flags dropdown should only contain 1 link, to the other page (maybe 2 links to both pages, a self-referencing link isn’t really an issue). Also, only 2 hreflang tags should be present instead of all 5.
7.5 Indexed Landing Pages for Ads
The thing here is that many times, they are very similar and offer the same content. Why similar and not identical? Well, there can be many reasons. Maybe you have different goals for those pages, or different rules from the advertising platform.
For example, you might only change an image because Adwords rules don’t let you use waist measuring tools when talking about weight loss products. However, when it comes to organic search, that’s not really an issue.
Solution: If your landing pages have been created specifically for ads and provide no SEO value, use a noindex meta tag on them. You can also try to canonicalize them to the very similar version that is actually targeted to organic search.
7.6 Boilerplate Content
Boilerplate content is the content that is found on multiple or every page of your site. Common examples are Headers, Navigation Menus, Footers and Sidebars. These are vital to a site’s functionality. We’re used to them and without them a site would be much harder to navigate.
However, it can sometimes cause duplicate content, for example when there is too little content. If you have only 30 words on 50 different pages, but the header, footer and sidebar have 250 words, then that’s about a 90% similarity. It’s mostly caused by the lack of content rather than the boilerplate.
Solution: Try to keep your pages content rich and unique. If some faucet pages from your filters list too little products, then the boilerplate content will be most of the content. In that case, you want to use the solution mentioned above in the URL Variations section.
8. External Duplicate Content (Plagiarism)
Duplicate content can occur cross-domains. Again, Google doesn’t want to show its users the same thing 6 times, so it only has to pick one, the original article most of the times.
There are different scenarios where cross-domain content duplication occurs.
8.1 Someone steals your content
Generally, Google tries to reward the original creator of the content. However, sometimes it fails.
Google might not look at the publication date when trying to determine who was first, because that can be easily changed in the HTML. Instead, it looks at when it first indexed it.
Google figures out who published the content first by looking at when it indexed the first iteration of that content.
Sometimes it takes the links as well and then Google is able to figure out the original source if you do internal linking well. But often it strips all links and sometimes even adds links of their own.
Solution: When someone steals your content the best way to protect yourself is to have it indexed first. Get your pages indexed as soon as possible using the Google Search Console.
This might be tricky if you have a huge website. Another thing you can do is to try and block the scrapers from crawling you from within your server. However, they might be using different IPs each time.
8.2 You steal someone else’s content
Stealing content is not a content marketing strategy. In general, having only duplicate content on your website won’t give you great results with search engines. So using content scraping tools and automated blogs isn’t the way to go for SEO.
While Google considers that there’s no added value for their search engine and tries to reward the original creator whenever possible, we can’t say that a news scarping site is never useful. For example. a natural disaster warning news reaches some people through that site and saves some lives. You never know.
8.3 Content Curation
Content curation is the process of gathering information relevant to a particular topic or area of interest.
That's it we have to discuss on Duplicate Content. Missed something? Let us know in the comments