A new patent application from Google tells us about how the search engine may use context to find query suggestions before a searcher has completed typing in a full query. After seeing this patent, I’ve been thinking about previous patents I’ve seen from Google that have similarities.
It’s not the first time I’ve written about a Google Patent involving query suggestions. I’ve written about a couple of other patents that were very informative, in the past:
User query is an element that specifies an information need, but it is not the only one. Studies in literature have found many contextual factors that strongly influence the interpretation of a query. Recent studies have tried to consider the user’s interests by creating a user profile. However, a single profile for a user may not be sufficient for a variety of queries of the user. In this study, we propose to use query-specific contexts instead of user-centric ones, including context around query and context within query. The former specifies the environment of a query such as the domain of interest, while the latter refers to context words within the query, which is particularly useful for the selection of relevant term relations. In this paper, both types of context are integrated in an IR model based on language modeling. Our experiments on several TREC collections show that each of the context factors brings significant improvements in retrieval effectiveness.
The Google patent doesn’t take a user-based approach ether, but does look at some user contexts and interests. It sounds like searchers might be offered a chance to select a context cluster before showing query suggestions:
In some implementations, a set of queries (e.g., movie times, movie trailers) related to a particular topic (e.g., movies) may be grouped into context clusters. Given a context of a user device for a user, one or more context clusters may be presented to the user when the user is initiating a search operation, but prior to the user inputting one or more characters of the search query. For example, based on a user’s context (e.g., location, date and time, indicated user preferences and interests), when a user event occurs indicating the user is initiating a process of providing a search query (e.g., opening a web page associated with a search engine), one or more context clusters (e.g., “movies”) may be presented to the user for selection input prior to the user entering any query input. The user may select one of the context clusters that are presented and then a list of queries grouped into the context cluster may be presented as options for a query input selection.
I often look up the inventors of patents to get a sense of what else they may have written, and worked upon. I looked up Jakob D. Uszkoreit in LinkedIn, and his profile doesn’t surprise me. He tells us there of his experience at Google:
Previously I started and led a research team in Google Machine Intelligence, working on large-scale deep learning for natural language understanding, with applications in the Google Assistant and other products.
This passage reminded me of the search results being shown to me by the Google Assistant, which are based upon interests that I have shared with Google over time, and that Google allows me to update from time to time. If the inventor of this patent worked on Google Assistant, that doesn’t surprise me. I haven’t been offered context clusters yet (and wouldn’t know what those might look like if Google did offer them. I suspect if Google does start offering them, I will realize that I have found them at the time they are offered to me.)
Like many patents do, this one tells us what is “innovative” about it. It looks at:
…query data indicating query inputs received from user devices of a plurality of users, the query data also indicating an input context that describes, for each query input, an input context of the query input that is different from content described by the query input; grouping, by the data processing apparatus, the query inputs into context clusters based, in part, on the input context for each of the query inputs and the content described by each query input; determining, by the data processing apparatus, for each of the context clusters, a context cluster probability based on respective probabilities of entry of the query inputs that belong to the context cluster, the context cluster probability being indicative of a probability that at least one query input that belongs to the context cluster and provided for an input context of the context cluster will be selected by the user; and storing, in a data storage system accessible by the data processing apparatus, data describing the context clusters and the context cluster probabilities.
It also tells us that it will calculate probabilities that certain context clusters might be requested by a searcher. So how does Google know what to suggest as context clusters?
Each context cluster includes a group of one or more queries, the grouping being based on the input context (e.g., location, date and time, indicated user preferences and interests) for each of the query inputs, when the query input was provided, and the content described by each query input. One or more context clusters may be presented to the user for input selection based on a context cluster probability, which is based on the context of the user device and respective probabilities of entry of the query inputs that belong to the context cluster. The context cluster probability is indicative of a probability that at least one query input that belongs to the context cluster will be selected by the user. Upon selection of one of the context clusters that is presented to the user, a list of queries grouped into the context cluster may be presented as options for a query input selection. This advantageously results in individual query suggestions for query inputs that belong to the context cluster but that alone would not otherwise be provided due to their respectively low individual selection probabilities. Accordingly, users’ informational needs are more likely to be satisfied.
The Patent in this patent application is:
(US20190050450) Query Composition System
Publication Number: 20190050450
Publication Date: February 14, 2019
Applicants: Google LLC
Inventors: Jakob D. Uszkoreit
Methods, systems, and apparatus for generating data describing context clusters and context cluster probabilities, wherein each context cluster includes query inputs based on the input context for each of the query inputs and the content described by each query input, and each context cluster probability indicates a probability that at a query input that belongs to the context cluster will be selected by the user, receiving, from a user device, an indication of a user event that includes data indicating a context of the user device, selecting as a selected context cluster, based on the context cluster probabilities for each of the context clusters and the context of the user device, a context cluster for selection input by the user device, and providing, to the user device, data that causes the user device to display a context cluster selection input that indicates the selected context cluster for user selection.
What are Context Clusters as Query Suggestions?
The patent tells us that context clusters might be triggered when someone is starting a query on a web browser. I tried it out, starting a search for “movies” and got a number of suggestions that were combinations of queries, or what seem to be context clusters:
One of those clusters involved “Movies about Business”, which I selected, and it showed me a carousel, and buttons with subcategories to also choose from. This seems to be a context cluster:
User Query Histories
The patent tells us that context clusters selected to be shown to a searcher might be based upon previous queries from a searcher, and provides the following example:
Further, a user query history may be provided by the user device (or stored in the log data) that includes queries and contexts previously provided by the user, and this information may also factor into the probability that a user may provide a particular query or a query within a particular context cluster. For example, if the user that initiates the user event provides a query for “movie show times” many Friday afternoons between 4 PM-6 PM, then when the user initiates the user event on a Friday afternoon in the future between these times, the probability associated with the user inputting “movie show times” may be boosted for that user. Consequentially, based on this example, the corresponding context cluster probability of the context cluster to which the query belongs may likewise be boosted with respect to that user.
It’s not easy to tell whether the examples I provided about movies above are related to this patent or if it is tied more closely to the search results that appear in Google Assistant results. It’s worth reading through and thinking about potential experimental searches to see if they might influence the results that you may see. It is interesting that Google may attempt to anticipate what is suggests to show to us as query suggestions, after showing us search results based upon what it believes are our interests based upon searches that we have performed or interests that we have identified for Google Assistant.
The contex cluster may be related to the location and time that someone accesses the search engine. The patent provides an example of what might be seen by the searcher like this:
In the current example, the user may be in the location of MegaPlex, which includes a department store, restaurants, and a movie theater. Additionally, the user context may indicate that the user event was initiated on a Friday evening at 6 PM. Upon the user initiating the user event, the search system and/or context cluster system may access the content cluster data 214 to determine whether one or more context clusters is to be provided to the user device as an input selection based at least in part on the context of the user. Based on the context of the user, the context cluster system and/or search system may determine, for each query in each context cluster, a probability that the user will provide that query and aggregate the probability for the context cluster to obtain a context cluster probability.
In the current example, there may be four queries grouped into the “Movies” cluster, four queries grouped into the “Restaurants” cluster, and three queries grouped into the “Dept. Store” cluster. Based on the analysis of the content cluster data, the context cluster system may determine that the aggregate probability of the queries in each of the “Movies” cluster, “Restaurant” cluster, and “Dept. Store” cluster have a high enough likelihood (e.g., meet a threshold probability) to be input by the user, based on the user context, that the context clusters are to be presented to the user for selection input in the search engine web site.
I could see running such a search at a shopping mall, to learn more about the location I was at, and what I could find there, from dining places to movies being shown. That sounds like it could be the start of an interesting adventure.
Sura gave up on her debugging for the moment. “The word for all this is ‘mature programming environment.’ Basically, when hardware performance has been pushed to its final limit, and programmers have had several centuries to code, you reach a point where there is far more signicant code than can be rationalized. The best you can do is understand the overall layering, and know how to search for the oddball tool that may come in handy—take the situation I have here.” She waved at the dependency chart she had been working on. “We are low on working fluid for the coffins. Like a million other things, there was none for sale on dear old Canberra. Well, the obvious thing is to move the coffins near the aft hull, and cool by direct radiation. We don’t have the proper equipment to support this—so lately, I’ve been doing my share of archeology. It seems that five hundred years ago, a similar thing happened after an in-system war at Torma. They hacked together a temperature maintenance package that is precisely what we need.
In a science fiction novel set far in the future, Vernor Vinge writes about how people might engage in software archaeology. I understand the desire to do that, looking at some patents that give us hints about how technology is changing, and processes behind search engines do as well.
Google has just been granted a continuation patent for universal search. This post is looking at how the patents covering universal search at Google have changed. This post is not intended as a lesson on how patents work, but knowing something about how continuation patents work, can provide some insights into the processes that people at Google are trying to protect when they have updated the universal search patent. This post is also not intended as an analysis of patents, but rather a look at how search works, and has changed in the last dozen years or so
A patent is pursued by a company to protect the process described within the patent. It isn’t unusual that the process protected by a patent might change in some way as it is implemented, and put into use. What sometimes happens when that takes place is that the company that was originally assigned the initial patent might file another patent. One referred to as a continuation patent, which takes the original granted date of the first version of the patent as the start time for protection under the patent.
The continuation patents are usually very similar to the earlier versions of the patents, with the description sections often being very close to identical. The parts of the patents that change are the claims sections, which are what prosecuting attorneys deciding whether to grant a patent look at and review to see if the patents are new, non-obvious and useful, and should be granted.
So, in looking at updated patents covering a specific process, ideally it makes sense to look at how the claims have changed over time.
The Original Universal Search Patent Application
Before the patent was granted, I wrote about it in the post How Google Universal Search and Blended Results May Work which was about the Universal Search Patent application published in 2008. That patent was granted, and the claims from the original filing of the patent were updated from the original application, when it was granted in 2011 (Sometimes processes in original applications have to be amended for the patent to be granted, and the claims may change to match those).
1. A computer-implemented method, comprising: receiving a plurality of first search results in a first presentation format, the first search results received from a first search engine, the first search results identified for a search query directed to the first search engine, the first search results having an associated order indicative of respective first quality scores that are used to rank the first search results; receiving one or more second search results in a second presentation format different from the first presentation format, the second search results received from a second search engine, the second search results identified for the search query directed to the second search engine, wherein the first search engine searches a first corpus of first resources, wherein the second search engine searches a second corpus of second resources, and wherein the first search engine and the second search engines are distinct from each other; obtaining a respective first quality score for a plurality of the first search results, the respective first quality score determined in relation to the corpus of first resources and obtaining a respective second quality score for each of the one or more second search results, each respective second quality score determined in relation to the corpus of second resources; and inserting one or more of the second search results into the order including decreasing one or more of the respective first quality scores by reducing a contribution of a scoring feature unique to the first search results and distinct from scoring features of the second search results so that the inserted second search results occur within a number of top-ranked search results in the order.
2. The method of claim 1, wherein the plurality of first search results comprises an ordered list of search results, and wherein the plurality of first search results is a number of highest-quality search results provided by the first search engine that are identified as responsive to the search query.
3. The method of claim 1, further comprising: receiving a third search result, the third search result received from a third search engine, wherein the third search engine searches a corpus of third resources, and wherein the third search engine is distinct from the first search engine and the second search engine; and inserting the third search result into the order.
4. The method of claim 1, wherein: the first resources are generic web pages and the second resources are video resources.
5. The method of claim 1, wherein: the first resources are generic web pages and the second resources are news resources.
6. The method of claim 4, further comprising: receiving a third search result from the second search engine; and inserting the third search result at a position between two otherwise adjacent first search results in the order, the position not being adjacent to the inserted one or more second search results.
The Second Universal Search Patent
We know that Google introduced Universal Search Results at a Searchology presentation in 2007 (a few months before the patent was filed originally), and the patent has been updated since then, with a continuation patent titled Interleaving Search Results granted in 2015, which has new claims, which insert the concept of historic click data into those. Here are the first five claims from that version of the patent:
The invention claimed is:
1. A computer-implemented method comprising: receiving in a search engine system a query, the query comprising query text submitted by a user; searching a first collection of resources to obtain one or more first search results, wherein each of the one or more first search results has a respective first search result score; searching a second collection of web resources to obtain one or more second search results, wherein each of the one or more second search results has a respective second search result score, wherein the resources of the first collection of resources are different from the resources of the second collection of web resources; determining from historical user click data that resources from the first collection of resources are more likely to be selected by users than resources from other collections of data when presented by the search engine in a response to the query text; generating enhanced first search result scores for the first search results as a consequence of the determining, the enhanced first search result scores being greater than the respective first search result scores for the first search results; generating a presentation order of first search results and second search results in order of the enhanced first search result scores and the second search result scores; generating a presentation of highest-ranked first search results and second search results in the presentation order; and providing the presentation in a response to the query.
2. The method of claim 1, wherein the historical click data represents resource collections of search results selected by users after submitting the query.
3. The method of claim 1, wherein determining from historical user click data that resources from the first collection of resources are more likely to be selected by users than resources from other collections of data when presented by the search engine in a response to the query text comprises: obtaining one or more user characteristics of the user; and determining that users having the one or more user characteristics are more likely to select resources from the first collection of resources than resources from other collections of data.
4. The method of claim 1, wherein generating the presentation of highest-ranked first search results and second search results in the presentation order comprises generating the presentation so that at least one first search result occurs within a number of highest-ranked second search results.
5. The method of claim 1, wherein generating the presentation of highest-ranked first search results and second search results in the presentation order comprises: generating each of the second search results in a web search results presentation format; and; generating each of the first search results in a different presentation format
Publication Number: 3422216
Publication Date: 02.01.2019
Applicants: GOOGLE LLC
Inventors: Bailey David R, Effrat Jonathan J, Singhal Amit
(EN) Interleaving Search Results
(EN) A method comprising receiving a plurality of first search results that satisfy a search query directed to a first search engine, each of the plurality of first search results having a respective first score, receiving a second search result from a second search engine, the second search result having a second score, wherein the search query is not directed to the second search engine, wherein at least one of the first and second scores is based on characteristics of queries or results of queries learned from user click data; and determining from the second score whether to present the second search result, and if so, presenting the first search results in an order according to their respective scores, and presenting the second search result at a position relative to the order, the position being determined using the first scores and the second score
1. A method comprising:
receiving a plurality of first search results that satisfy a search query directed to a first search engine, each of the plurality of first search results having a respective first score;
receiving a second search result from a second search engine, the second search result having a second score, wherein the search query is not directed to the second search engine;
wherein at least one of the first and second scores is based on characteristics of queries or results of queries learned from user click data; and
determining from the second score whether to present the second search result, and if so:
presenting the first search results in an order according to their respective scores, and
presenting the second search result at a position relative to the order, the position being determined using the first scores and the second score.
2. The method of claim 1, wherein receiving a second search result from a second search engine comprises:
receiving a plurality of second search results, each second search result having a respective second score, each second search results from a respective second search engine, wherein the search query is not directed to the respective second search engines; and
determining from the respective second scores whether to present respective ones of the second search results.
3. The method of claim 1, wherein presenting the second search result at a position relative to the order comprises inserting the second search result at a position between two otherwise adjacent first search results in the order.
4. The method of any preceding claim, wherein the first and second search result scores are based on multiple distinct scoring features, the multiple distinct scoring features including at least one unique scoring feature of the first search engine distinct from the scoring features of the second search engine.
5. The method of any preceding claim, wherein the characteristics of queries or results of queries learned from user click data comprise a relationship between one of the first corpus of first resources and the second corpus of second resources and a particular search query.
Changes to Universal Search
If you look at them, you will see David Bailey’s name on those patents. He wrote a guest post at Search Engine land about Universal Search that provides a lot of insight into how it works and the title of the post refers to that: An Insider’s View Of Google Universal Search It’s worth reading though his analysis of Universal search carefully before trying to compare the claims from one version of the patent to another
The second version of the claims refer to historic click data, and the newest version changes that to “user click data”, but doesn’t provide any insights into why that change in the claims was made. We’ve heard spokespeople from Google tell us that they don’t utilize user click data to rank content, so this gets a little confusing if they are taken at their word.
Another difference in the latest claims is where it refers to multiple distinct scoring features, and how each type of search that is blended into results has some unique scoring feature that sets it apart from the results inserted on to the search results page from a search engine before it. We do know that different types of search are ranked based upon different signals, such as freshness being important for news results, and links often for Web results. So results shown in universal search may all be relevant for a query searched for, but have some element that considers some unique features that adds diversity to what we see in SERPs.
To those of us who are used to doing Search Engine Optimization, we’ve been looking at URLs filled with content, and links between that content, and how algorithms such as PageRank (based upon links pointed between pages) and information retrieval scores based upon the relevance of that content have been determining how well pages rank in search results in response to queries entered into search boxes by searchers. Web pages connected by links have been seen as information points connected by nodes. This was the first generation of SEO.
Search has been going through a transformation. Back in 2012, Google introduced something it refers to as the knowledge graph, in which they told us that they would begin focusing upon indexing things instead of strings. By “strings,” they were referring to words that appear in queries, and in documents on the Web. By “things,” they were referring to named entities, or real and specific people, places, and things. When people searched at Google, the search engines would show Search Engine Results Pages (SERPs) filled with URLs to pages that contained the strings of letters that we were searching for. Google still does that, and is slowly changing to showing search results that are about people, places, and things.
They now show us knowledge panels in search results that tell us about the people, places, and things they recognize in the queries we perform. In addition to crawling webpages and indexing the words on those pages, Google is collecting facts about the people, places, and things it finds on those pages.
A Google Patent that was just granted in the past week tells us about how Google’s knowledge graph updates itself when it collects information about entities, their properties and attributes and relationships involving them. This is part of the evolution of SEO that is taking place today – learning how Search is changing from being based upon search to being based upon knowledge.
What does the patent tell us about knowledge? This is one of the sections that details what a knowledge graph is like that Google might collect information about when it indexes pages these days:
Knowledge graph portion includes information related to the entity [George Washington], represented by [George Washington] node. [George Washington] node is connected to [U.S. President] entity type node by [Is A] edge with the semantic content [Is A], such that the 3-tuple defined by nodes and the edge contains the information “George Washington is a U.S. President.” Similarly, “Thomas Jefferson Is A U.S. President” is represented by the tuple of [Thomas Jefferson] node 310, [Is A] edge, and [U.S. President] node. Knowledge graph portion includes entity type nodes [Person], and [U.S. President] node. The person type is defined in part by the connections from [Person] node. For example, the type [Person] is defined as having the property [Date Of Birth] by node and edge, and is defined as having the property [Gender] by node 334 and edge 336. These relationships define in part a schema associated with the entity type [Person].
Note that SEO is no longer just about how often certain words appear on pages of the Web, what words appear in links to those pages, in page titles, and headings, alt text for images, and how often certain words may be repeated or related words may be used. Google is looking at the facts that are mentioned about entities, such as entity types like a “person,” and properties, such as “Date of Birth,” or “Gender.”
Note that quote also mentions the word “Schema” as in “These relationships define in part a schema associated with the entity type [Person].” As part of the transformation of SEO from Strings to Things, The major Search Engines joined forces to offer us information on how to use Schema for structured data on the Web to provide a machine readable way of sharing information with search engines about the entities that we write about, their properties, and relationships.
I’m writing about this patent because I am participating in a Webinar online about Knowledge Graphs and how those are being used, and updated. The Webinar is tomorrow at: #SEOisAEO: How Google Uses The Knowledge Graph in its AE algorithm. I haven’t been referring to SEO as Answer Engine Optimization, or AEO and it’s unlikely that I will start, but see it as an evolution of SEO
I’m writing about this Google Patent, because it starts out with the following line which it titles “Background:”
This disclosure generally relates to updating information in a database. Data has previously been updated by, for example, user input.
This line points to the fact that this approach no longer needs to be updated by users, but instead involves how Google knowledge graphs update themselves.
Updating Knowledge Graphs
I attended a Semantic Technology and Business conference a couple of year ago, where the head of Yahoo’s knowledge base presented, and he was asked a number of questions in a question and answer session after he spoke. Someone asked him what happens when information from a knowledge graph changes and it needs to be updated?
His Answer was that a knowledge graph would have to be updated manually to have new information place within it.
That wasn’t a satisfactory answer because it would have been good to hear that the information from such a source could be easily updated. I’ve been waiting for Google to answer a question like this, which made seeing a line like this one from this patent a good experience:
In some implementations, a system identifies information that is missing from a collection of data. The system generates a question to provide to a question answering service based on the missing information, and uses the response from the question answering service to update the collection of data.
This would be a knowledge graph update, so that patent provides details using language that reflects that exacly:
In some implementations, a computer-implemented method is provided. The method includes identifying an entity reference in a knowledge graph, wherein the entity reference corresponds to an entity type. The method further includes identifying a missing data element associated with the entity reference. The method further includes generating a query based at least in part on the missing data element and the type of the entity reference. The method further includes providing the query to a query processing engine. The method further includes receiving information from the query processing engine in response to the query. The method further includes updating the knowledge graph based at least in part on the received information.
How does the search engine do this? The patent provides more information that fills in such details.
The approaches to achieve this would be to:
…Identifying a missing data element comprises comparing properties associated with the entity reference to a schema table associated with the entity type.
…Generating the query comprises generating a natural language query. This can involve selecting, from the knowledge graph, disambiguation query terms associated with the entity reference, wherein the terms comprise property values associated with the entity reference, or updating the knowledge graph by updating the data graph to include information in place of the missing data element.
…Identifying an element in a knowledge graph to be updated based at least in part on a query record. Operations further include generating a query based at least in part on the identified element. Operations further include providing the query to a query processing engine. Operations further include receiving information from the query processing engine in response to the query. Operations further include updating the knowledge graph based at least in part on the received information.
A knowledge graph updates itself in these ways:
(1) The knowledge Graph may be updated with one or more previously performed searches.
(2) The knowledge Graph may be updated with a natural language query, using disambiguation query terms associated with the entity reference, wherein the terms comprise property values associated with the entity reference.
(3) The knowledge Graph may use properties associated with the entity reference to include information updating missing data elements.
The patent that describes how Google’s knowledge graph updates themselves is:
Methods and systems are provided for a question answering. In some implementations, a data element to be updated is identified in a knowledge graph and a query is generated based at least in part on the data element. The query is provided to a query processing engine. Information is received from the query processing engine in response to the query. The knowledge graph is updated based at least in part on the received information.
I read that article from Dejan SEO about duplicate pages, and thought it was worth exploring more. As I was looking around at Google patents that included the word “Authority” in them, I found this patent which doesn’t quite say the same thing that Dejan does, but is interesting in that it finds ways to distinguish between duplicate pages on different domains based upon priority rules, which is interesting in determining which duplicate page might be the highest authority URL for a document.
A system and method identifies a primary version out of different versions of the same document. The system selects a priority of authority for each document version based on a priority rule and information associated with the document version and selects a primary version based on the priority of authority and information associated with the document version.
Since the claims of a patent are what patent examiners at the USPTO look at when they are prosecuting a patent, and deciding whether or not it should be granted. I thought it would be worth looking at the claims contained within the patent to see if they helped encapsulate what it covered. The first one captures some aspects of it that are worth thinking about while talking about different document versions of particular duplicate pages, and how the metadata associated with a document might be looked at to determine which is the primary version of a document:
What is claimed is:
1. A method comprising: identifying, by a computer system, a plurality of different document versions of a particular document; identifying, by the computer system, a first type of metadata that is associated with each document version of the plurality of different document versions, wherein the first type of metadata includes data that describes a source that provides each document version of the plurality of different document versions; identifying, by the computer system, a second type of metadata that is associated with each document version of the plurality of different document versions, wherein the second type of metadata describes a feature of each document version of the plurality of different document versions other than the source of the document version; for each document version of the plurality of different document versions, applying, by the computer system, a priority rule to the first type of metadata and the second type of metadata, to generate a priority value; selecting, by the computer system, a particular document version, of the plurality of different document versions, based on the priority values generated for each document version of the plurality of different document versions; and providing, by the computer system, the particular document version for presentation.
This doesn’t advance the claim that the primary version of a document is considered the canonical version of that document, and all links pointed to that document are redirected to the primary version.
There is another patent that shares an inventor with this one that refers to one of the duplicate content URL being chosen as a representative page, though it doesn’t use the phrase “canonical.” From that patent:
Duplicate documents, sharing the same content, are identified by a web crawler system. Upon receiving a newly crawled document, a set of previously crawled documents, if any, sharing the same content as the newly crawled document is identified. Information identifying the newly crawled document and the selected set of documents is merged into information identifying a new set of documents. Duplicate documents are included and excluded from the new set of documents based on a query-independent metric for each such document. A single representative document for the new set of documents is identified in accordance with a set of predefined conditions.
In some embodiments, a method for selecting a representative document from a set of duplicate documents includes: selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score, where each respective document in the plurality of documents has a fingerprint that identifies the content of the respective document, the fingerprint of each respective document in the plurality of documents indicating that each respective document in the plurality of documents has substantially identical content to every other document in the plurality of documents, and a first document in the plurality of documents is associated with the query-independent score. The method further includes indexing, in accordance with the query independent score, the first document thereby producing an indexed first document; and with respect to the plurality of documents, including only the indexed first document in a document index.
Systems and methods for indexing a representative document from a set of duplicate documents are disclosed. Disclosed systems and methods comprise selecting a first document in a plurality of documents on the basis that the first document is associated with a query independent score. Each respective document in the plurality of documents has a fingerprint that indicates that the respective document has substantially identical content to every other document in the plurality of documents. Disclosed systems and methods further comprise indexing, in accordance with the query independent score, the first document thereby producing an indexed first document. With respect to the plurality of documents, only the indexed first document is included in a document index.
Regardless of whether the primary version of a set of duplicate pages is treated as the representative document as suggested in this second patent (whatever that may mean exactly), I think it’s important to get a better understanding of what a primary version of a document might be.
The primary version patent provides some reasons why one of them might be considered a primary version:
(1) Including of different versions of the same document does not provide additional useful information, and it does not benefit users.
(2) Search results that include different versions of the same document may crowd out diverse contents that should be included.
(3) Where there are multiple different versions of a document present in the search results, the user may not know which version is most authoritative, complete, or best to access, and thus may waste time accessing the different versions in order to compare them.
Those are the three reasons this duplicate pages patent says it is ideal to identify a primary version from different versions of a document that appears on the Web. The search engine also wants to furnish “the most appropriate and reliable search result.”
How does it work?
The patent tells us that one method of identifying a primary version is as follows.
The different versions of a document are identified from a number of different sources, such as online databases, websites, and library data systems.
For each document version, a priority of authority is selected based on:
(1) The metadata information associated with the document version, such as
Exclusive right to publish
(2) As a second step, the document versions are then determined for length qualification using a length measure. The version with a high priority of authority and a qualified length is deemed the primary version of the document.
If none of the document versions has both a high priority and a qualified length, then the primary version is selected based on the totality of information associated with each document version.
The patent tells us that scholarly works tend to work under the process in this patent:
Because works of scholarly literature are subject to rigorous format requirements, documents such as journal articles, conference articles, academic papers and citation records of journal articles, conference articles, and academic papers have metadata information describing the content and source of the document. As a result, works of scholarly literature are good candidates for the identification subsystem.
Meta data that might be looked at during this process could include such things as:
Article identifiers such as Digital Object Identifier, PubMed Identifier, SICI, ISBN, and the like
Network locution (e.g., URL)
The duplicate pages patent goes into more depth about the methodology behind determining the primary version of a document:
The priority rule generates a numeric value (e.g., a score) to reflect the authoritativeness, completeness, or best to access of a document version. In one example, the priority rule determines the priority of authority assigned to a document version by the source of the document version based on a source-priority list. The source-priority list comprises a list of sources, each source having a corresponding priority of authority. The priority of a source can be based on editorial selection, including consideration of extrinsic factors such as reputation of the source, size of source’s publication corpus, recency or frequency of updates, or any other factors. Each document version is thus associated with a priority of authority; this association can be maintained in a table, tree, or other data structures.
The patent includes a table illustrating the source-priority list.
The patent includes some alternative approaches as well. It tells us that “the priority measure for determining whether a document version has a qualified priority can be based on a qualified priority value.”
A qualified priority value is a threshold to determine whether a document version is authoritative, complete, or easy to access, depending on the priority rule. When the assigned priority of a document version is greater than or equal to the qualified priority value, the document is deemed to be authoritative, complete, or easy to access, depending on the priority rule. Alternatively, the qualified priority can be based on a relative measure, such as given the priorities of a set of document versions, only the highest priority is deemed as qualified priority.
I was in a Google Hangout on air within the last couple of years where I and a number of other SEOs (Ammon Johns, Eric Enge, Jennifer Slegg, and I) asked some questions to John Mueller and Andrey Lipattse, and we asked some questions about duplicate pages. It seems to be something that still raises questions among SEOs.
The patent goes into more detail regarding determining which duplicate pages might be the primary document. We can’t tell whether that primary document might be treated as if it is at the canonical URL for all of the duplicate documents as suggested in the Dejan SEO article that I started with a link to in this post, but it is interesting seeing that Google has a way of deciding which version of a document might be the primary version. I didn’t go into much depth about quantified lengths being used to help identify the primary document, but the patent does spend some time going over that.
Is this a little-known ranking factor? The Google patent on identifying a primary version of duplicate documents does seem to find some importance in identifying what it believes to be the most important version among many duplicate documents. I’m not sure if there is anything here that most site owners can use to help them have their pages rank higher in search results, but it’s good seeing that Google may have explored this topic in more depth.
In general, the subject matter of this specification relates to identifying or generating augmentation queries, storing the augmentation queries, and identifying stored augmentation queries for use in augmenting user searches. An augmentation query can be a query that performs well in locating desirable documents identified in the search results. The performance of the query can be determined by user interactions. For example, if many users that enter the same query often select one or more of the search results relevant to the query, that query may be designated an augmentation query.
In addition to actual queries submitted by users, augmentation queries can also include synthetic queries that are machine generated. For example, an augmentation query can be identified by mining a corpus of documents and identifying search terms for which popular documents are relevant. These popular documents can, for example, include documents that are often selected when presented as search results. Yet another way of identifying an augmentation query is mining structured data, e.g., business telephone listings, and identifying queries that include terms of the structured data, e.g., business names.
These augmentation queries can be stored in an augmentation query data store. When a user submits a search query to a search engine, the terms of the submitted query can be evaluated and matched to terms of the stored augmentation queries to select one or more similar augmentation queries. The selected augmentation queries, in turn, can be used by the search engine to augment the search operation, thereby obtaining better search results. For example, search results obtained by a similar augmentation query can be presented to the user along with the search results obtained by the user query.
This past March, Google was granted a patent that involves giving quality scores to queries (the quote above is from that patent). The patent refers to high scoring queries as augmentation queries. Interesting to see that searcher selection is one way that might be used to determine the quality of queries. So, when someone searches. Google may compare the SERPs they receive from the original query to augmentation query results based upon previous searches using the same query terms or synthetic queries. This evaluation against augmentation queries is based upon which search results have received more clicks in the past. Google may decide to add results from an augmentation query to the results for the query searched for to improve the overall search results.
How does Google find augmentation queries? One place for it to look is at query logs and click logs. As the patent tells us:
To obtain augmentation queries, the augmentation query subsystem can examine performance data indicative of user interactions to identify queries that perform well in locating desirable search results. For example, augmentation queries can be identified by mining query logs and click logs. Using the query logs, for example, the augmentation query subsystem can identify common user queries. The click logs can be used to identify which user queries perform best, as indicated by the number of clicks associated with each query. The augmentation query subsystem stores the augmentation queries mined from the query logs and/or the click logs in the augmentation query store.
This doesn’t mean that Google is using clicks to directly determine rankings But it is deciding which augmentation queries might be worth using to provide SERPs that people may be satisfied with.
There are other things that Google may look at to decide which augmentation queries to use in a set of search results. The patent points out some other factors that may be helpful:
In some implementations, a synonym score, an edit distance score, and/or a transformation cost score can be applied to each candidate augmentation query. Similarity scores can also be determined based on the similarity of search results of the candidate augmentation queries to the search query. In other implementations, the synonym scores, edit distance scores, and other types of similarity scores can be applied on a term by term basis for terms in search queries that are being compared. These scores can then be used to compute an overall similarity score between two queries. For example, the scores can be averaged; the scores can be added; or the scores can be weighted according to the word structure (nouns weighted more than adjectives, for example) and averaged. The candidate augmentation queries can then be ranked based upon relative similarity scores.
I’ve seen white papers from Google before mentioning synthetic queries, which are queries performed by the search engine instead of human searchers. It makes sense for Google to be exploring query spaces in a manner like this, to see what results are like, and using information such as structured data as a source of those synthetic queries. I’ve written about synthetic queries before at least a couple of times, and in the post Does Google Search Google? How Google May Create and Use Synthetic Queries.
Implicit Signals of Query Quality
It is an interesting patent in that it talks about things such as long clicks and short clicks, and ranking web pages on the basis of such things. The patent refers to such things as “implicit Signals of query quality.” More about that in the patent here:
In some implementations, implicit signals of query quality are used to determine if a query can be used as an augmentation query. An implicit signal is a signal based on user actions in response to the query. Example implicit signals can include click-through rates (CTR) related to different user queries, long click metrics, and/or click-through reversions, as recorded within the click logs. A click-through for a query can occur, for example, when a user of a user device, selects or “clicks” on a search result returned by a search engine. The CTR is obtained by dividing the number of users that clicked on a search result by the number of times the query was submitted. For example, if a query is input 100 times, and 80 persons click on a search result, then the CTR for that query is 80%.
A long click occurs when a user, after clicking on a search result, dwells on the landing page (i.e., the document to which the search result links) of the search result or clicks on additional links that are present on the landing page. A long click can be interpreted as a signal that the query identified information that the user deemed to be interesting, as the user either spent a certain amount of time on the landing page or found additional items of interest on the landing page.
A click-through reversion (also known as a “short click”) occurs when a user, after clicking on a search result and being provided the referenced document, quickly returns to the search results page from the referenced document. A click-through reversion can be interpreted as a signal that the query did not identify information that the user deemed to be interesting, as the user quickly returned to the search results page.
These example implicit signals can be aggregated for each query, such as by collecting statistics for multiple instances of use of the query in search operations, and can further be used to compute an overall performance score. For example, a query having a high CTR, many long clicks, and few click-through reversions would likely have a high-performance score; conversely, a query having a low CTR, few long clicks, and many click-through reversions would likely have a low-performance score.
The reasons for the process behind the patent are explained in the description section of the patent where we are told:
Often users provide queries that cause a search engine to return results that are not of interest to the users or do not fully satisfy the users’ need for information. Search engines may provide such results for a number of reasons, such as the query including terms having term weights that do not reflect the users’ interest (e.g., in the case when a word in a query that is deemed most important by the users is attributed less weight by the search engine than other words in the query); the queries being a poor expression of the information needed; or the queries including misspelled words or unconventional terminology.
A quality signal for a query term can be defined in this way:
the quality signal being indicative of the performance of the first query in identifying information of interest to users for one or more instances of a first search operation in a search engine; determining whether the quality signal indicates that the first query exceeds a performance threshold; and storing the first query in an augmentation query data store if the quality signal indicates that the first query exceeds the performance threshold.
The patent can be found at:
Inventors: Anand Shukla, Mark Pearson, Krishna Bharat and Stefan Buettcher
Assignee: Google LLC
US Patent: 9,916,366
Granted: March 13, 2018
Filed: July 28, 2015
Methods, systems, and apparatus, including computer program products, for generating or using augmentation queries. In one aspect, a first query stored in a query log is identified and a quality signal related to the performance of the first query is compared to a performance threshold. The first query is stored in an augmentation query data store if the quality signal indicates that the first query exceeds a performance threshold.
References Cited about Augmentation Queries
These were a number of references cited by the applicants of the patent, which looked interesting, so I looked them up to see if I could find them to read them and share them here.
Jane Yung-jen Hsu and Wen-tau Yih. 1997. Template-based information mining from HTML documents. In Proceedings of the fourteenth national conference on artificial intelligence and ninth conference on Innovative application of artificial intelligence (AAAI’97/IAAI’97). AAAI Press, pp. 256-262. cited by applicant .
This is a continuation patent, which means that it was granted before, with the same description, and it now has new claims. When that happens, it can be worth looking at the old claims and the new claims to see how they have changed. I like that the new version seems to focus more strongly upon structured data. It tells us that it might use structured data in sites that appear for queries as synthetic queries, and if those meet the performance threshold, they may be added to the search results that appear for the original queries. The claims do seem to focus a little more on structured data as synthetic queries, but it doesn’t really change the claims that much. They haven’t changed enough to publish them side by side and compare them.
What Google Has Said about Structured Data and Rankings
Google spokespeople had been telling us that Structured Data doesn’t impact rankings directly, but what they have been saying does seem to have changed somewhat recently. In the Search Engine Roundtable post, Google: Structured Data Doesn’t Give You A Ranking Boost But Can Help Rankings we are told that just having structured data on a site doesn’t automatically boost the rankings of a page, but if the structured data for a page is used as a synthetic query, and it meets the performance threshold as an augmentation query, it might be shown in rankings, thus helping in rankings (as this patent tells us.)
Note that this isn’t new, and the continuation patent’s claims don’t appear to have changed that much so that structured data is still being used as synthetic queries, and is checked to see if they work as augmented queries. This does seem to be a really good reason to make sure you are using the appropriate structured data for your pages.
My last Post was Five Years of Google Ranking Signals, and I start that post by saying that there are other posts about ranking signals that have some issues. But, I don’t want to turn people away from looking at one recent post that did contain a lot of useful information.
Cyrus did a video with Ross Hudgins on Seige Media where he talked about those Ranking signals with Cyrus, called Google Ranking Factors with Cyrus Shepard. I’m keeping this post short on purpose, to make the discussion about ranking the focus of this post, and the star. There is some really good information in the Video and in the post from Cyrus. Cyrus takes a different approach on writing about ranking signals from what I wrote, but it’s worth the time visiting and listening and watching.
There are some other pages about Google Ranking Signals that don’t consider up-to-date information or sometimes use questionable critical thinking to argue that some of the signals that they include are actually something that Google considers. I’ve been blogging about patents from Google, Yahoo, Microsoft, and Apple since 2005, and have been exploring what those might say are ranking signals for over a decade.
Representatives from Google have stated that “Just because we have a patent on something, doesn’t mean we are using it.” The first time I heard them say that was after Go Daddy started advertising domain registrations of up to 10 years, because one Google patent (Information Retrieval Based on Historical Data) said that they might look at length of domain registration as a ranking signal, based on the thought that a “spammer would likely only register a domain for a period of one year.” (but actually, many people register domains for one year, and have their registrations on auto-renewal, so a one year registration is not evidence that a person registering a domain for just one year is a spammer.).
I’ve included some ranking signals that are a little older, but most of the things I’ve listed are from the past five years, often with blog posts I’ve written about them, and patents that go with them. This list is a compilation of blog posts that I have been working on for years, taking many hours of regular searching through patent filings, and reading blog posts from within the Search and SEO industries, and reading through many patents that I didn’t write about, and many that I have. If you have questions about any of the signals I’ve listed, please ask about them in the comments.
Some of the patents I have blogged about have not been implemented by Google yet, but could be. A company such as Google files a patent to protect the intellectual property behind their ideas, the work that their search engineers and testing teams put into those ideas. It is worth looking at, reading, and understanding many of these patents because they provide some insights into ideas that Google may have explored when developing ranking signals, and they may give you ideas of things that you may want to explore, and questions to keep in mind when you are working upon optimizing a site. Patents are made public to inspire people to innovate and invent and understand new ideas and inventions.
1. Domain Age and Rate of Linking
Google does have a patent called Document scoring based on document inception date, in which they tell us that they will often use the date that they first crawl a site, or the first time they see a document referenced in another site, as the age of that site. The patent also tells us that Google may look at the links pointed to a site, and calculate what the average rate of links pointed to a site may be and use that information to rank a site, based upon that linking.
2. Use of Keywords
Matt Cutts wrote a newsletter for librarians in which he explained how Google crawled the web, making an inverted index of the Web with terms found on Documents from the Web that it would match up with query terms when people performed searches. It shows us the importance of Keywords in queries and how Google finds words that contain those keywords as an important part of performing searches. A copy of that newsletter can be found here: https://www.analistaseo.es/wp-content/uploads/2014/09/How-Google-Index-Rank.pdf
3. Related Phrases
Google Recently updated its first phrase-based indexing patent, which tells us in its claims that pages with more related phrases on them rank higher than pages with less related phrases on them. That patent is: Phrase-based searching in an information retrieval system. Related phrases are phrases that are complete phrases that may predict the topic a page it appears upon is about. Google might look at the queries that a page is optimized for, and look at the highest ranking pages for those query terms, and see which meaningful complete phrases frequently occur (or co-occur) on those high ranking pages.
Techniques are disclosed that locate implicitly defined semantic structures in a document, such as, for example, implicitly defined lists in an HTML document. The semantic structures can be used in the calculation of distance values between terms in the documents. The distance values may be used, for example, in the generation of ranking scores that indicate a relevance level of the document to a search query.
If a list in page has a heading on it, the items in that list are all considered to be equal distance away from the list. The words contained under a main heading on a page are all considered to be equal distance away from that main heading. All of the words on a page are considered to be equal distance away from the title to that page. So, a page that is titled “Ford” which has the word “motors” on that page is considered to be relevant for the phrase “Ford Motors.” Here is an example of how that semantic closeness works with a heading and a list:
The patent tells us that it may look at words that have more than one meaning in knowledge bases (such as bank, which could mean a building money is stored in, or the ground on one side of a river, or what a plane does when it turns in the air.) The search engine may take terms from that knowledge base that show what meaning was intended and collect them at “Context Terms” and it might look for those context terms when indexing pages those words are on, so that it indexes the correct meaning
8. Language Models Using Ngrams
Google may give pages quality scores based upon language models created from those pages when it looks at the ngrams on the pages of a site. This is similar to the Google Book Ngram Viewer.
Here is an ngram analysis using a well-known phrase, with 5 words in it:
The quick brown fox jumps
quick brown fox jumps over
brown fox jumps over the
fox jumps over the lazy
jumps over the lazy dog
Ngrams from a complete page might be collected like that, and from a collection of good pages and bad pages, to build language models (and Google has done that with a lot of books, as we see from the Google Ngram Viewer covering a very large collection of books.) It would be possible to tell which pages are gibberish from such a set of language models. This Gibberish content patent also mentions a keyword stuffing score that it would try to identify.
If they do, the authoritative results may be merged into the original results. The way it describes authoritative results:
In general, an authoritative site is a site that the search system has determined to include particularly trusted, accurate, or reliable content. The search system can distinguish authoritative sites from low-quality sites that include resources with shallow content or that frequently include spam advertisements. Whether the search system considers a site to be authoritative will typically be query-dependent. For example, the search system can consider the site for the Centers for Disease Control, “cdc.gov,” to be an authoritative site for the query “cdc mosquito stop bites,” but may not consider the same site to be authoritative for the query “restaurant recommendations”. A search result that identifies a resource on a site that is authoritative for the query may be referred to as an authoritative search result.
11. How Well Databases Answers Match Queries
This patent doesn’t seem to have been implemented yet. But it might, and is worth thinking about. I wrote the post How Google May Rank Websites Based Upon Their Databases Answering Queries, based upon the patent Resource identification from organic and structured content. It tells us that Google might look at searches on a site, and how a site might answer them, to see if they are similar to the queries that Google receives from searchers. If they are, it might rank results from those sites higher. The patent also shows us that it might include the database results from such sites within Google Search results. If you start seeing that happening, you will know that Google decided to implement this patent. Here is the screenshot from the patent:
Some patents provide a list of the “Advantages” of following a process in the patent, as does this one:
The following advantages are described by the patent in following the approach it describes.
1) Events in a given location can be ranked so that popular or interesting events can be easily identified.
2) The ranking can be adjusted to ensure that highly-ranked events are diverse and different from one another.
3) Events matching a variety of event criteria can be ranked so that popular or interesting events can be easily identified.
4) The ranking can be provided to other systems or services that can use the ranking to enhance the user experience. For example, a search engine can use the ranking to identify the most popular events that are relevant to a received search query and present the most popular events to the user in response to the received query.
5) A recommendation engine can use the ranking to provide information identifying popular or interesting events to users that match the users’ interests.
14.The Amount of Weight from a Link is Based upon the Probability that someone might click upon it
I came across an update to the reasonable surfer patent, which focused more upon anchor text used in links than the earlier version of the patent, and told us that the amount of weight (PageRank) that might pass through a link was based upon the likelihood that someone might click upon that link.
identifying: context relating to one or more words before or after the links, words in anchor text associated with the links, and a quantity of the words in the anchor text, the weight being determined based on whether the particular feature data corresponds to the stored feature data associated with the one or more links or whether the particular feature data..
In a search engine that answers questions based upon crawling and indexing facts found within structured data on a site, that search engine works differently than a search engine which looks at the words used in a query, and tries to return documents that contain the same words as the ones in the query; hoping that such a matching of strings might contain an actual answer to the informational need that inspired the query in the first place. Search using Structured Data works a little differently, as seen in this flowchart from a 2017 Google patent:
This newer patent tells us that it might solve that book search in this manner:
In particular, for each encoded data item associated with a given identified schema, the system searches the locations in the encoded data item identified by the schema as storing values for the specified keys to identify encoded data items that store values for the specified keys that satisfy the requirements specified in the query. For example, if the query is for semi-structured data items that have a value “Ernest Hemingway” for an “author” key and that have values in a range of “1948-1952” for a “year published” key, the system can identify encoded data items that store a value corresponding to “Ernest Hemingway” in the location identified in the schema associated with the encoded data item as storing the value for the “author” key and that store a value in the range from “1948-1952” in the location identified in the schema associated with the encoded data item as storing the value for the “year published” key. Thus, the system can identify encoded data items that satisfy the query efficiently, i.e., without searching encoded data items that do not include values for each key specified in the received query and without searching locations in the encoded data items that are not identified as storing values for the specified keys.
It was interesting seeing Google come out with a patent about searching semi-structured data which focused upon the use of JSON-LD. We see them providing an example of JSON on one of the Google Developer’s pages at: Introduction to Structured Data
As it tells us on that page:
This documentation describes which fields are required, recommended, or optional for structured data with special meaning to Google Search. Most Search structured data uses schema.org vocabulary, but you should rely on the documentation on developers.google.com as definitive for Google Search behavior, rather than the schema.org documentation. Attributes or objects not described here are not required by Google Search, even if marked as required by schema.org.
I’ve used the analogy of how XML sitemaps are machine-readable, compared to HTML Sitemaps, and that is how JSON-LD shows off facts in a machine-readable way on a site, as opposed to content that is in HTML format. As the patent tells us that is the purpose behind this patent:
In general, this specification describes techniques for extracting facts from collections of documents.
The patent discusses schemas that might be on a site, and key/value pairs that could be searched, and details about such a search of semi-structured data on a site:
The aspect further includes receiving a query for semi-structured data items, wherein the query specifies requirements for values for one or more keys; identifying schemas from the plurality of schemas that identify locations for values corresponding to each of the one or more keys; for each identified schema, searching the encoded data items associated with the schema to identify encoded data items that satisfy the query; and providing data identifying values from the encoded data items that satisfy the query in response to the query. Searching the encoded data items associated with the schema includes: searching, for each encoded data item associated with the schema, the locations in the encoded data item identified by the schema as storing values for the specified keys to identify whether the encoded data item stores values for the specified keys that satisfy the requirements specified in the query.
The patent providing details of the use of JSON-LD to provide a machine readable set of facts on a site can be found here:
Storing semi-structured data
Inventors: Martin Probst
Assignee: Google Inc.
US Patent: 9,754,048
Granted: September 5, 2017
Filed: October 6, 2014
Methods, systems, and apparatus, including computer programs encoded on computer storage media, for storing semi-structured data. One of the methods includes maintaining a plurality of schemas; receiving a first semi-structured data item; determining that the first semi-structured data item does not match any of the schemas in the plurality of schemas; and in response to determining that the first semi-structured data item does not match any of the schemas in the plurality of schemas: generating a new schema, encoding the first semi-structured data item in the first data format to generate the first new encoded data item in accordance with the new schema, storing the first new encoded data item in the data item repository, and associating the first new encoded data item with the new schema.
By using Structured Data such as in Schema Vocabulary in JSON-LD formatting, you make sure that you provide precise facts in key/value pairs that provide an alternative to the HTML-based content on the pages of a site. Make sure that you follow the Structured Data General Guidelines from Google when you add it to a site. That page tells us that pages that don’t follow the guidelines may not rank as highly, or may become ineligible for rich results appearing for them in Google SERPs.
I spoke at SMX Advanced this week on Schema markup and Structured Data, as part of an introduction to its use at Google.
I had the chance to visit Seattle, and tour some of it. I took some photos, but would like to go back sometimes and take a few more, and see more of the City.
One of the places that I did want to see was Pike Place market. It was a couple of blocks away from the Hotel I stayed at (the Marriott Waterfront.)
It is a combination fish and produce market, and is home to one of the earliest Starbucks.
I could see living near the market and shopping there regularly. It has a comfortable feel to it.
This is a view of the Farmers Market from the side. I wish I had the chance to come back later in the day, and see what it was like other than in the morning.
This was a nice little park next to Pike Place Market, which looked like a place to take your dog for a walk while in the area, and had a great view of Elliot Bay (the central part of Puget Sound.)
This is a view of the waterfront from closer to the conference center.
You can see Mount Ranier from the top of the Conference Center.
My presentation for SMX Advanced 2018:
Schema, Structured Data & Scattered Databases Such as the World Wide Web. My role in this session is to introduce Schema and Structured Data and how Google is using them on the Web.
Google is possibly best known for the PageRank Algorithm invented by founder Lawrence Page, whom it is named after. In what looks like the second patent filed by someone at Google was the DIPRE (Dual interative pattern relation expansion) patent, invented and filed by Sergey Brin. He didn’t name it after himself (Brinrank) like Page did with PageRank.
The provisional patent filed for this invention was the whitepaper, “Extracting Patterns and Relations from Scattered Databases such as the World Wide Web.” The process behind it is set out in the paper, and it involves a list of 5 books, titles, their authors, Publishers, Year published. Unlike PageRank, it doesn’t involve crawling webpages, and indexing links from Page to page and anchor text. Instead, it involves collecting facts from page to page, and when it finds pages that contain properties and attributes from these five books, it is supposed to collect similar facts about other books on the same site. And once it has completed, it is supposed to move on to other sites and look for those same 5 books, and collect more books. The idea is to eventually know where all the books are on the Web, and facts about those books, that could be used to answer questions about them.
This is where we see Google being concerned about structured data on the web, and how helpful knowing about it could be.
When I first started out doing inhouse SEO, it was for a Delaware incorporation business, and geography was an important part of the queries that my pages were found for. I had started looking at patents, and ones such as this one on “Generating Structured Data caught my attention. It focused on collecting data about local entities, or local businesses, and properties related to those. It was built by the team led by Andrew Hogue, who was in charge of the Annotation framework at Google, who were responsible for “The Fact Repository”, an early version of Google’s Knowledge Graph.
If you’ve heard of NAP consistency, and of mentions being important to local search, it is because Local search was focusing on collecting structured data that could be used to answer questions about businesses. Patents about location prominence followed, which told us that a link counted as a mention, and a patent on local authority, which determined which Website was the authoritative one for a business. But, it seemed to start with collecting structured data about businesses at places.
The DIPRE Algorithm focused upon crawling the web to find facts, and Google Maps built that into an approach that could be used to rank places and answer questions about them.
If you haven’t had a chance to use Google’s experimental table search, it is worth trying out. It can answer questions to find answers from data-based tables across the web, such as “what is the longest wooden pier in California”, which is the one in Oceanside, a town next to the one I live in. It is from a Webtables project at Google.
Database fields are sometimes referred to as schema and table headers which tell us what kind of data is in a table column may also be referred to as “schema”. A data-based web table could be considered a small structured database, and Google’s Webtable project found that there was a lot of information that could be found in web tables on the Web.
Try out the first link above (the WebTables Project Slide) when you get the chance, and do some searches on Google’s table search. The second paper is one that described the WebTables project when it first started out, and the one that follows it describes some of the things that Google researchers learned from the Project. We’ve seen Structured Snippets like the one above grabbing facts to include in a snippet (in this case from a data table on the Wikipedia page about the Oceanside Pier.)
When a data table column contains the same data that another table contains, and the first doesn’t have a table header label, it might learn a label from the second table (and this is considered a way to learn semantics or meaning from tables) These are truly scattered databases across the World Wide Web, but through the use of crawlers, that information can be collected and become useful, like the DIPRE Algorithm described.
In 2005, the Official Google Blog published this short story, which told us about Google sometimes answering direct questions in response to queries at the top of Web results. I don’t remember when these first started appearing, but do remember Definition results about a year earlier, which you could type out “Define:” and a word or ask “What is” before a word and Google would show a definition, and there was a patent that described how they were finding definitions from glossary pages, and how to ideally set up those glossaries, so that your definitions might be the ones that end up as responses.
In 2012, Google introduced the Knowledge Graph, which told us that they would be focusing upon learning about specific people, places and things, and answering questions about those instead of just continuing to match keywords in queries to keywords in documents. They told us that this was a move to things instead of strings. Like the books in Brin’s DIPRE or Local Entities in Google Maps.
We could start using the Web as a scattered database, with questions and answers from places such as Wikipedia tables helping to answer queries such as “What is the capital of Poland”
And Knowledge bases such as Wikipedia, Freebase, IMDB and Yahoo Finance could be the sources of facts about properties and attributes about things such as movies and actors and businesses where Google could find answers to queries without having to find results that had the same keywords in the document as the query.
In 2011, The Schema.org site was launched as a joint project from Google, Yahoo, Bing, and Yandex, that provided machine-readable text that could be added to web pages. This text is provided in a manner that is machine readable only, much like XML sitemaps are intended to be machine-readable, to provide an alternative channel of information to search engines about the entities pages are about, and the properties and attributes on those pages.
While Schema.org was introduced in 2011, it was built to be extendable, and to let subject matter experts be able to add new schema, like this extension from GS1 (the inventors of barcodes in brick and mortar stores) If you haven’t tried out this demo from them, it is worth getting your hands on to see what is possible.
In 2014, Google published their Biperpedia paper, which tells us about how they might create ontologies from Query streams (sessions about specific topics) by finding terms to extract data from the Web about. At one point in time, Search engines would do focused crawls of the web starting at sources such as DMOZ, so that the Index of the Web they were constructing contained pages about a wide range of categories. By using query stream information, they are crowdsourcing the building of resources to build ontologies about. This paper tells us that Biperpedia enabled them to build ontologies that were larger than what they had developed through Freebase, which may be partially why Freebase was replaced by wiki data.
The Google+ group I’ve linked to above on the Schema Resources Page has members who work on Schema from Google, such as Dan Brickley, who is the head of schema for Google. Learning about extensions is a good idea, especially if you might consider participating in building new ones, and the community group has a mailing list, which lets you see and participate in discussions about the growth of Schema.
When Google patents talk about paid search, they refer to those paid results as “content” rather than as advertisements.
A recent patent from Google (Combining Content with Search Results) tells us about how Google might identify when organic search results might be about specific entities, such as brands. It may also recognize when paid results are about the same brands, whether they might be products from those brands.
In the event that a set of search results contains high ranking organic results from a specific brand, and a paid search result from that same brand, the process described in the patent might allow for the creation of a combined content result of the organic result with the paid result.
Merging Local and Organic Results in the Past
When I saw this new patent, it brought back memories of when Google found a way to merge organic search results with local search results. The day after I wrote about that, in the following post, I received a call from a co-worker who asked me if I had any idea why a top ranking organic result for a client might have disappeared from Google’s search results.
I asked her what the query term was, and who the client was. I performed the search, and noticed that our client was ranking highly for that query term in a local result, but their organic result had disappeared. I pointed her to the blog post I wrote the day before, about Google possibly merging local and organic results, with the organic result disappearing, and the local result getting boosted in rankings. It seemed like that is what happened to our client, and I sent her a link to my post, which described that.
Google did merge that client’s organic listing with their local listing, but it appeared that was something that they ended up not doing too often. I didn’t see them do that too many more times.
I am wondering, will Google start merging together paid search results with organic search results? If they would do that for local and organic results, which rank things in different ways, it is possible that they might with organic and paid. The patent describes how.
The newly granted patent does tell us about how paid search works in Search results at Google:
Content slots can be allocated to content sponsors as part of a reservation system, or in an auction. For example, content sponsors can provide bids specifying amounts that the sponsors are respectively willing to pay for presentation of their content. In turn, an auction can be run, and the slots can be allocated to sponsors according, among other things, to their bids and/or the relevance of the sponsored content to content presented on a page hosting the slot or a request that is received for the sponsored content. The content can be provided to a user device such as a personal computer (PC), a smartphone, a laptop computer, a tablet computer, or some other user device.
Combining Paid and Organic Results
Here is the process behind this new patent involving merging paid results (content) and organic results:
A search query is received.
Search results responsive to the query are returned, including one associated with a brand.
Content items (paid search results) based at least in part on the query, are returned for delivery along with the search results responsive to the query.
This approach includes looking to see if eligible content items are associated with a same brand as the brand associated in the organic search results.
If there is a paid result and an organic result that are associated with each othte, it may combine the organi search result and the eligible content item into a combined content item, and provide the combined content item as a search result responsive to the request.
When Google decides whether the eligible content item is associated with the same brand as an organi result, it is a matter of determining that one content item is sponsored by an owner of the brand.
A combined result (of the paid and the organic results covering the same brand) includes combining what the patent is referring to as “a visual universal resource locator (VisURL),”
That combined item would include:
Text from the paid result
A link to a landing page from the paid result into the combined content item
The combine items may also includ other information associated with the brand, such as:
A map to retail locations associated with brand retail presence.
Retail location information associated with the brand.
In addition to the brand owner, the organic result that could be combine might be from a retailer associated with the brand.
It can involve designating content from the sponsored item that is included in the combined content item as sponsored content (so it may show that content from the paid result as being an ad.)
It may also include “monetizing interactions with material that is included from the at least one eligible content item that is included in the combined content item based on user interactions with the material.” Additional items shown could include an image or logo associated with the brand, or one or more products associated with the brand, or combine additional links relevant to the result.
The patent behind this approach of combining paid and organic results was this one, granted in April:
Combining content with a search result
Inventors: Conrad Wai, Christopher Souvey, Lewis Denizen, Gaurav Garg, Awaneesh Verma, Emily Kay Moxley, Jeremy Silber, Daniel Amaral de Medeiros Rocha and Alexander Fischer
Assignee: Google LLC
US Patent: 9,947,026
Granted: April 17, 2018
Filed: May 12, 2016
Methods, systems, and apparatus include computer programs encoded on a computer-readable storage medium, including a method for providing content. A search query is received. Search results responsive to the query are identified, including identifying a first search result in a top set of search results that is associated with a brand. Based at least in part on the query, one or more eligible content items are identified for delivery along with the search results responsive to the query. A determination is made as to when at least one of the eligible content items is associated with a same brand as the brand associated with the first search result. The first search result and one of the determined at least one eligible content items are combined into a combined content item and providing the combined content item as a search result responsive to the request.
The patent does include details on things such as an “entity/brand determination engine,” which can be used to compare paid results with organic results, to see if they cover the same brand. This is one of the changes that indexing things instead of strings is bringing us.
The patent does have many other details, and until Google announces that they are introducing this, I suspect we won’t hear more details from them about it. Then again, they didn’t announce officially that they were merging organic and local results when they started doing that. Don’t be surprised if this becomes available at Google.