“Can duplicate content hurt me?” This is a question that comes up in forums on almost a weekly basis. So what are some scenarios where duplicate can exist and what can web site owners do to make sure duplicate content doesn’t come back to haunt them?
First of all some practical tips from panelists on a recent session dealing with duplicate content issues at Search Engine Strategies New York.
Anne Kennedy – Beyond Ink first of all explained what duplicate content is and how it can exist. It can be multiple URLs with the same content or identical home pages with same content (index.htm, index.html, etc.). She goes on to say that dynamic sites can inadvertently produce duplicate content. Why is it a problem? It confuses search engine bots as to which result to serve up to searchers.
Shari Thurow – Grandtastic Designs reveals some of the ways that search engines work to remedy duplicate content issues. One way is “content properties.” This is where engines look for unique content by removing “boilerplates” such as navigational areas, headers, etc. and then analyzing the “good stuff.” Next is “linkage properties.” The engines will look at inbound and outbound links to determine if the linkage properties are different for each site.
Jake Baillie – True Local talks about 6 duplicate content mistakes: 1.) circular navigation, 2.) print-friendly pages, 3.) inconsistent linking, 4.) product only pages, 5.) transparent serving and 6.) bad cloaking.
Solutions suggested are not to have duplicate content in the first place but if you do, use robots.txt to exclude crawling of duplicate content or use 301 redirection to direct duped pages to “real” ones. Kind of reminds me of advice given to students in high schools regarding sex: “Abstain but if you must, use a condom.”
Two search engine representatives were also on hand to give their analysis as well as field questions at the end.
Rajat Mukherjee of Yahoo basically stated that rather than looking for ways to demote content, they are trying to find the right content to promote, in other words, the original content.
Matt Cutts of Google says that honest webmasters often worry about duplicate content when they don’t need to. Google tries to return what they feel is the “best” version of a page. A lot of people ask about articles split into parts and then printable versions of the same. Matt says that Google will not penalize for this. Furthermore he says not to worry about duplicate content across different top level domains (such as “searchrank.com” or “searchrank.net”) or where you have a .com and a country specific such as .co.uk.
So we know duplicate content exists and that search engines are trying to filter it out in one way or another. I myself have come across my own share of duplicate content issues working with client sites. I find that Google works to filter out the duplicate content and display the best page.
That may not always be your site. Such would be the case where you authored an article but then had it republished on a major ezine. The ezine shows up first because it is considered more of an authority site then yours, even though the article may have been originally published on your own site.
While you cannot always control what Google decides to display, a webmaster can be assured that Google will not penalize for duplicate content unless it is some kind of extreme case (such as having 2500 mirror sites).
Yahoo on the other hand does penalize for duplicate content. I believe this is a part of their algorithm that was inherited with the integration of Inktomi technology into their own search engine.
That is why it is my advice to always try to avoid duplicate content. If you have things like PDF files that are the same as html versions or “print friendly” pages that represent the html version, block access to them using the robots.txt file. If you are redesigning a site and your file structure is changing, set up 301 redirects from old pages to new ones. The same is true if re-branding and using a new domain for site – 301 redirect old domain to new one while removing files at old domain.
Do not spread duplicate content across multiple domains. I see this happen with affiliate sites where multiple affiliates use the same language on their sites provided by the very company they are an affiliate of.. I also see it happen when people assume that keywords in domains will help them rank better so they register multiple domains targeting specific keywords hoping to rank better but it only ends up biting them in the end.
I don’t think MSN has matured enough to really deal with duplicate content yet so you will probably find it abounding in their index. And as for Ask, their index is more selective anyway meaning that they are not out to build the biggest index, but rather one of quality. Therefore they probably work to index quality content and can easily identify duplicates of the original.