How Search Engines Work – Jeff Dodge Technologies

To understand search engine optimization (SEO), it helps to learn how search engines work. Search engines exist to help you find what you’re looking for online. To do so, they evaluate the innumerable sites and web elements that make up the world wide web and determine which sites best match any query.

The web itself is a collection of interlinked pages and resources that users access over a global network, which is the internet. Of course, you can access these documents directly by visiting the URL of a web page—its web address—using a web browser. But more often, people get to websites through a search engine. For business owners, this offers an invaluable opportunity.

What makes the web work?

Web pages are documents formatted using HTML, a web programming language that allows for embedded hyperlinks connecting one page to another. This is the single most important concept for understanding how the web works.

Web pages include content like text, images, forms, videos, hyperlinks, and more. This content is what users are after. You go to a web page to read, watch, listen, or carry out tasks like buying a product or signing up for a newsletter. You navigate using links between pages.

These actions are possible because of the content programmed into a web page. The nature of the web makes it easy to move through pages, from one to the next, based on what you intend to do.

What is a website?

A website is a collection of web pages that all reside on the same domain and are typically owned and managed by the same organization. The Mailchimp homepage, for example, is accessible through the URL https://mailchimp.com/.

Of this URL, “mailchimp.com” is the domain. When you look at other URLs on this website, you’ll notice that they share the same domain, even though the full URL is different. For example:

https://mailchimp.com/resources/
https://mailchimp.com/why-mailchimp/

Mailchimp also uses links to direct visitors to other areas of the website. For example, from the navigation area at the top of each page, you can easily click through to another page on the site. That’s possible through internal links, which are links between pages on the same domain.

The difference between internal and external links

Links to a different domain are external links. (You’ll notice an external link in the author byline at the bottom of this article.)

At the bottom of every page, Mailchimp includes a footer section. This helps visitors navigate to particular pages using both internal and external links. In this case, the external links point to social media profile pages.

Most websites use more internal than external links. Usually, all the pages on a website link to other pages on the same website, generating its own miniature web of interlinked documents.

Internal links connect pages that relate to one another and exist on the same domain, but the power of the web has more to do with external links. External links build connections to web pages that exist and operate outside the confines of a single organization. They help form part of the network of billions of pages that exist on the web.

The reasons to use an external link vary. It could be that you include a statistic in an article, and you want to link to the source of the data on another website. This not only adds credibility to what you post, but it also contributes to the expansive network of the web.

What a search engine does

Search engines perform 3 main tasks:

Crawling
Indexing
Delivering search results

In simple terms, crawling is the act of accessing web pages on the internet. Indexing is deriving meaning from the content of web pages and building a relational database between them. Delivering search results means interpreting a user’s search query, then delivering results from the index that best answer this query.

How crawling works

Crawling URLs is a task carried out by a computer program known as a crawler or a spider. The job of the crawler is to visit web pages and extract the HTML content it finds. One of the primary things a crawler looks for is links.

Every web page has a single unique identifier, its URL. Enter the URL into your browser address bar, and you’ll go to the web page. Web pages themselves consist of content that’s marked up in HTML.

HTML is a machine-readable language, so an external program like a crawler can visit a URL, extract the HTML, and access the content in a structured manner. Importantly, it can differentiate between text and hyperlinks.

When crawlers examine the HTML code for a page like this, which contains the article you’re reading, it will find each paragraph is offset by a piece of code called the paragraph element or p-tag at the beginning and at the end. This identifies a block of paragraph text—the p-tag at the start opens the paragraph element, and the p-tag at the end closes it. Although you don’t see this code unless you inspect the page, the crawler sees it and understands that this page contains text content that’s designed for visitors to read.

Links are also visible and interpreted by crawlers because of their HTML code. Programmers code links with an anchor element at the beginning and at the end. Links also include an “attribute” that provides the destination of the hyperlink, and “anchor text.” Anchor text is the linked text seen by readers, often displayed in browsers in blue with an underline.

It’s a straightforward task for a crawler to process this block of HTML and separate out the text from the link. However, on a single web page, there’s a lot more than a paragraph and a link. To see this sort of data yourself, visit any web page in your browser, right-click anywhere on the screen, then click “View Source” or “View Page Source.” On most pages, you’ll find hundreds of lines of code.

For every web page that a crawler encounters, it will parse the HTML, which means it breaks the HTML up into its component parts to process further. The crawler extracts all the links it finds on a given page, then schedules them for crawling. In effect, it builds itself a little feedback loop:

Crawl URL → Find links to URLs → Schedule URLs for crawling → Crawl URL

So you can give a crawler a single URL as a source to start crawling from, and it will keep going until it stops finding new URLs to crawl—this could be thousands or even millions of URLs later.

In short, crawling is a method of discovery. Search engines determine what's out there by sending out web crawlers to find web pages using links as signposts for the next place to look.

This is why internal links on your website are important, as they allow search engine crawlers to discover all the pages on your site. Through external links, they'll discover other websites as they explore the network of interconnected pages that make up the internet.

How indexing works

As search engines crawl the web, they build a repository of web pages they find, which they then use to generate their index.

Think about the index you would find in the back of a textbook when you were in school. If you wanted to learn about cellular structure, you could look in the index of a biology book and find the pages on this topic. Indexing web pages works similarly.

An index is useful because it allows for quick searching. Search engines like Google also need a fast way to retrieve information and deliver search results, so indexing is crucial.

Search engines take every web page they crawl and parse the HTML document to separate out all the links. They do this so they can store the destination URL that each link points to, along with the anchor text used. Similarly, they take all the text content found and split this out into a set of word occurrences.

Using this parsed data, they generate an inverted index by assigning the web page URL against each of the words on the page. Once they store a URL this way, it’s indexed. This means it has the potential to be in a set of search results.

For every URL that’s indexed, search engines store as many of these word-URL relationships as they deem relevant, along with the other associated metadata they’ve collected about the page. This is the data they use when determining which URLs show up in their search results.

How search results get delivered

Crawling and indexing happen automatically and constantly. The index gets updated in real time. This collection and storage of data runs on its own, in the background—uninfluenced by searchers typing in queries.

However, delivering search results is entirely driven by user input through their search queries. If someone searches “best television streaming service,” the search engine matches each word with documents in its index.

But simply matching words with indexed pages results in billions of documents, so they need to determine how to show you the best matches first. This is where it gets tricky—and why SEO is important. How do search engines decide, out of billions of potential results, which ones to show? They use a ranking algorithm.

Algorithms are a set of rules that computer programs follow to perform a specific process. A ranking algorithm is a large number of algorithms and processes, all working in unison.

The ranking algorithm looks for factors like these:

Do all the words in the search query appear on the page?
Do certain combinations of the words appear on the page (for example, “best” and “streaming”)?
Do the words show up in the title of the page?
Are the words present in the URL of the page?

These are basic examples, and there are hundreds of other factors that the ranking algorithm considers when determining which results to show. These are ranking factors.

The reason Google became the dominant search engine across the globe is simple—its ranking algorithm was better than the ranking algorithms of its rivals.

Making sense out of complexity

Search engines are extremely complicated structures that process inconceivable amounts of data every single day. They apply complex algorithms to make sense of that data and satisfy searchers.

Thousands of the world’s best software engineers are working on ever more granular refinements and improvements, which makes companies like Google responsible for advancing some of the most sophisticated technology on the planet.

Technologies like machine learning, artificial intelligence, and natural language processing will continue to have more of an impact on search result delivery. You don’t need to understand all the complexity, but by applying a range of basic best practices, it’s possible to make your website discoverable for the words and phrases that your customers search.