I asked a similar question a few months back about when does a good link become toxic? but this time I'm asking "what is a toxic link?". You read all over the internet that you should disavow your toxic links with little information what actually classifies as a toxic link.
Sadly most of the information will class toxic links simply because they are self created when in fact this can be actually bad advice. Links on social networks, github, WordPress plugins directory, start-up sites, some infographics, citations and so forth generally are self created and those shouldn't be deleted. Some advice includes penguin only cares about links that are not "nofollowed" the rest are ignored (some of the most respected SEO'ers have said this).
Other links could include blog comments using your brand or business name to relevant blogs that actually improve their relevancy and improves the user experience. So then... the idea of this question to have quality answers based on what you have read and what have you learned, and possibility switch this question to a wiki community question once established.
Ideal answers should be somewhat in depth, for example blog comments links are bad is a little board, because using your real name may be acceptable etc, using the name field is bad but in the body is ok, partial keyword matches are ok, exact are bad etc. But by no means am I expecting anyone to spend the next week writing a 100,000 word answer.
I'll kick this off...
Relevant Forum Signatures
The term toxic link is not used by Google. It is fully an SEO chatter class invention and some are using this term to either gain users to their site and scare the bejeebers out of you as well as carve out importance for themselves or sell you something.
As far as Google is concerned, there is no such thing as a toxic link. There are bad links of course, but any link that can actually gain a penalty does not happen in a vacuum. It is actually hard to do.
As well, this is a HUGE topic. I will give as much of a 64,000 foot view and still keep it within 100,000 words limit. If you get dizzy, don't look down, look up. If you fall asleep, I suggest ice water. If you fall over, call LifeAlert.
I will reverse engineer this looking at it from Google's point of view as much as I can. It may seem like the long way around the barn, but it is actually a better learning experience.
I have talked about semantics and relational pairs and relational chains before. Semantic relationships are important to understand. Why? Because when Google talks about the topic of spam and bad links, they speak of semantics using terms such as nodes, links, clusters, etc. It is important to remember to think in these terms and not assign terms that are unrelated such as the term toxic. Applying computer science to the problem does not include emotional human terms. Think like a machine and you will be far better off. Nix that! I just thought of Hal, Bender, and a few others.
I have spoken about semantics with examples in the answer: Why would a website with keyword stuffing rank higher than one without in google search results?
In this case, there are a lot of factors at play. I will not get them all here. In fact, I will not even try- the list is just too long. So I will only give you a clue.
The first thing you need to know is that Google has started this process before 2003 only a few years after the first research paper published in 1997. 2003 is the first applicable hint of what Google has done and is thinking on the subject. Also know that Google collects as much data as they can on every website looking for clues. What is collected will stun you, however, even 2003, we knew that registration information (whois), domain name registrar information, host information, DNS information including DNS, network, and domain stability testing which includes domain names changing IP addresses, what domain names are assigned to an IP address, and multi-hosting versus single-hosting were collected. As part of this, what becomes clear is what networks are known for bad behaviors, what registrars host low quality domains, what web hosts host low quality domains, any specific IP addresses that are low quality, and what IP address blocks are low quality. Shortly after, Google used blacklist data as part of this analysis and even looked to the quality of the technical support of registrars and hosts rating the companies in that respect. Really.
Also know that Google has used content evaluators to manually check websites. Of this, they are looking for sites that fit withing certain categories that will act as seeds for AI learning methods. Of these categories, there is of course spam, trust, authoritative, non-authoritative, human generated, machine generated, etc. These seed sites are used for comparison in AI analysis.
We know that Google is applying semantics and other analysis to title tags, links, and content. One of this is n-grams. N-grams are a simple method of breaking down content in n word sets incrementally. For example, "The quick brown fox jumps over the lazy dog." using 3-gram would be The quick brown, quick brown fox, brown fox jumps, and so on. The n can then be incremented and analysis restarted. Using this and comparing to the seed sites, Google can evaluate the language of the content and determine a few things like; was it written by a human, was it written by a machine, was a spinner used, the language of the content including variations such as American-English versus any other, and so on. Using data pairs such as by lines, Google can even use n-grams to identify the author of an unsigned work by comparing it to work from known authors. Amazing.
In this semantic database, certain links are made and clusters are formed. Clusters are any entities with relational similarity or linkage. To clarify.
Company A has several websites which can be related using several factors including registration, host, IP, registrar, domain name patterns, templating, color schemes, image similarity, content similarity, content duplication, web based contact information (e-mail is a particularly valuable clue), personnel lists, application profiles, resource profiles, link patterns, etc. I used the term realm before and that is a correct term in some circles. Using the term cluster is the same notion in semantics. For all of the sites that Company A has, that is a cluster. Please understand that clusters can be any relationship and clusters can overlap each other. So imagine this as we progress.
There are several ways to know spam sites including; content similarity, templating, image similarity, application profiles, resource profiles, link patterns, just to name a few. And oh yeah, there are other traits.
Spammer sites usually have a few things in common. One is a super authority site. Why? Because without authority, the whole spam scheme falls apart and fails. The super authority site will have many thousands of inbound links and fewer outbound links. As part of this, the traditional view of PageRank that we have all seen, is thrown out the window in 2003. You remember seeing drawings of a PR 6 page linking to two other pages passing on PR 3 through each link. This is an overly simplistic view and almost completely wrong. Each link is evaluated for value, meaning actual value which includes 0, and the trust/authority of any site/page is capped so that high trust/authority sites/pages pass less value than they posses and only by a factor of the value of the link. Why is this done? To sculpt a more natural curve into the schema and to defeat super authority sites from passing on too much value. This seems to be the first salvo across the bow of spammers.
Links are evaluated not only for patterns, but also much the same way that content is evaluated. In this, Google can tell if the link is natural or un-natural. Link schemes follow patterns especially when you consider that they are machine made which are, for the most part, detectable.
Semantics is used to store many many factors into a database. Using the database, link maps can be evaluated and clusters determined. I mentioned clusters before mostly related to the domain, but now I want to you think in smaller entities such as pages, links, templates, content, navigational links, sidebars, etc. Using the semantic link map, Google is able to strongly determine patterns and likelihood that a set of entities are designed to be manipulative. Using clusters to link patterns and relationships, any penalty is handed out as a result of this analysis where it applies. Remember this.
While we cannot know any of Googles algorithms, we can know this. Panda runs periodically. Panda 4.2 is running slow, why?, because it requires refetching large portions of sites. It is also known that Panda is being reworked into the regular algorithm. I mentioned in another answer that AI is written in smaller code called Agents. An agent generally answers a single generally binary question. This is not always the case of course, however, agents do generally one conceptual function. We also know that agents are used to create metadata which can be scores of various types. Agents sometimes are dependent upon each other and can be referenced in another code to maintain this dependency. As well, when a refresh of a large database is required code is written that will reference several or many agents. Panda, in this case, requires more information, meaning that there are likely new factors to be added to the semantic database or existing factors refreshed. As well, it is likely that algorithm values and tweaks require recalculations within the semantic database. Incidentally, we also know that Panda is likely a roll-up of other code- likely agents are a significant part of this. It appears to fit the AI pattern.
Google does not use the term toxic link. Google talks about bad links, clusters and so forth. We know that Google caps trust and authority and evaluates actual value of links even giving them a value of 0. We know that Google can evaluate hundreds of factors far exceeding the 200 number to determine a bad link. We know that Google looks at everything and anything to find relationships and manipulative content, links, sites, etc. We know that penalties result of this analysis.
Any outbound link to your site from one of the entities of a cluster where manipulation is determined. (Breath- it's okay- more below.) It is not the link itself or any characteristic of the link, but the fact that a link pulls you relationally into a cluster of concern to Google.
No. Remember that a large site that links out to a lot of sites and can even have many inbound links of it's own is not worthy of a penalty generally. This in of itself is not spam. It is a site. Think of domaintools.com. Fits the pattern but no where worthy of a penalty. There are many examples of sites like this. For a penalty to happen, there has to be a gross example of manipulation somewhere.
This is just an illustration to make the point that it is not the link itself nor the configuration or placement of the link, but rather the source of the link that is of concern. Generally, a lousy link from a lousy site does no harm. To state otherwise is a scare tactic or just plain wrong. In fact, enough lousy links from enough lousy sites can really help a target site perform surprisingly well (not recommending it of course- and yes- I have examples). Even a single toxic link as I defined it probably will not hurt you either. It is more of a pattern of toxic links from a single cluster or clusters that are the issue. They can potentially draw you into the cluster(s) and when they are penalized, the potential for your site to be included could be high.
Not quite 100,000 words and possibly within the 40,000 character limit that SE enfor