HTML, a Rambling Introduction

HTML, or Hypertext Markup Language, is the format for describing the content and structure of a web page. At least in theory. Over time, it has haphazardly gained all sorts of syntax to govern the appearance of the web page (fonts, colors, and so on) that should probably be specified in CSS (Cascading Style Sheets) instead. When writing an web page from scratch, it's still common to use the appearance functionality; when it does the trick, it's often more convenient than CSS for relatively simple web sites.

You don't need to very much HTML to use d3 effectively, but it's very helpful to know some of the history that explains the mish-mash of HTML features. This page therefore introduces bits and pieces of HTML in chronological order, interspersed with some history. For a more complete tutorial that focuses just on syntax, try the W3Schools tutorial. There are also lots of other good tutorials around the web.

Text-Only Browsers and Early HTML

The world-wide web was created by Tim Berners-Lee at CERN in 1989. CERN does high particle physics research, which produces reams of data, and therefore reams of documentation. The idea was to share this data within the academic community, and make it easy to navigate via hyperlinks. At the time, the internet was strictly academic (commercial use was forbidden until 1994), and strictly text-based. The first web browser looked like this:

See the little numbers interspersed through the text (e.g. [8])? Those are hyperlinks; type the number to follow them. Needless to say, there wasn't much formatting you could do in a text browser: give the document a title, specify paragraphs and line breaks, and indicate that some text should be emphasized in a couple of different ways. Most terminals supported bold and underline, but variations existed, so it made sense to indicate text should be emphasized, but not how it should be emphasized. The sample below is a slightly modernized version of early HTML, so that it will display properly in modern browsers:

<html>
  <head>
    <title>Sample Web Page</title>
  </head>
  <body>
    Here is some text, followed by a paragraph break.<p>
    This is a <a href="https://home.cern/science/computing/birth-web">link to the CERN page on the birth of the web</a>.
  </body>
</html>

The basic structure of HTML is still the same: it's just a plain text file that contains tags (the things between <> characters) to specify structure and formatting. Note that HTML tags are not case sensitive. Some people prefer to write them lowercase, while others prefer to write them uppercase. In most cases, an opening tag has a corresponding closing tag that begins with a / (e.g. <head> and </head>). Here's a list of the tags in the previous example, and what each of them means:

<html>...</html>: Every HTML file starts with an opening <html> and ends with a closing </html> tag. Everything else goes between the two. (Technically, this isn't quite right. You'll often see web pages that start with a <!DOCTYPE ...> tag before the opening <html> that specifies the version of HTML used by the page.)
<head>...</head>: This section specifies the "header" of a web page, as opposed to the content of it. You should always have a head section, and it should always contain a title tag, since that title will show up in the browser's title bar. It's also common to add tags to the header that direct the browser which CSS and javascript files to load and use. Web browser load all the files referenced in the head section before trying to display the content in the body section, so specifying CSS and javascript files in the head ensures they are ready to go when the browser tries to display the page. Often this is a good thing, although current "best practice" is to put most script tags at the end of the body. (You'll encounter both; some scripts only work properly if they're loaded in the header; some others only work properly if they're loaded last, at the end of the body.)
<title>...</title>: This is the title of your web page. It shows up in your browser's title bar or tab bar, and in search results. It does not show up on the web page itself.
<body>...</body>: The content of your web page goes between the body tags.
<p>: This specifies the end of a paragraph. In keeping with the "specify structure, not content," it would be up to the browser to decide whether to indent the first line of each paragraph, but a blank line between each paragraph, or do something else. Note that you can't specify paragraph breaks by using blank lines in the HTML file because web browsers ignore the amount of white space in HTML files just like most programming languages do. You use white space and indentation to make the HTML file itself more readable, but it doesn't affect the page's appearance. To be consistent with the philosophy of matching open and close tags, it's common in modern HTML to start a paragraph with a <p> tag and close it with a </p> tag, but browsers don't require that.

You can find the original list of HTML tags here. It includes various types of lists that remain quite useful, but does not include the html, head, and body tags. Those came a bit later as web pages became larger and more complex. The key point is that all the original tags specify document structure, not appearance. This makes sense, since the appearance was necessarily dictated by the capabilities of text-onlyterminals.

Graphical Web Browsers

Sharing hyperlinked text documents was a great idea, and the web took off pretty rapidly. Graphical UNIX workstations were also becoming more common, and within a few year it became conceivable (although hard) to connect Windows PCs and Macs to the internet. So people started experimenting with graphical web browsers. The first really successful one was NCSA Mosaic in 1993, developed by Marc Andreessen and Eric Bina. (NCSA is the National Center for Supercomputing Application based at UIUC. One of the main backbones of the early internet was NSFNet, which was created to allow researchers to use supercomputers remotely, without traveling to Urbana-Champagne or somewhere else. So NCSA was a major driver of the internet in the early days.)

Mosaic added various tags to control the appearance of web pages rather than the structure, such as the <i> and <b> tags for italic and bold. Later on, it added equivalent <em> (emphasis) and <strong> tags as appearance-agnostic specifiers of emphasis. In text browsers, i and em usually show up as underlined, since most terminals didn't support italics. See how an incoherent mess is starting to form?

A few of mosaic's important features were support for Windows and Mac, support for images via the <img> tag and eventually tables. For instance, here's the tag for the image above:

<img src="https://upload.wikimedia.org/wikipedia/en/b/b7/NCSA_Mosaic.PNG">

Just like a, the img tag uses an attribute to specify the location of the image file. If you look at the W3Schools documentation, you'll see there are other attributes you can use as well, for example to specify the image's width and height. Just like <em>, you don't need a closing tag, but you can us /> instead of just > if you want to be pedantic.

The Browser Wars

Andreessen left NCSA in short order, to co-found Netscape with Jim Clark. In 1994, they released Netscape Navigator (generally just called netscape), which was a newer, better browser than mosaic. 1994 is also about when wired internet access started to become common in college dorms so that accessing web pages with graphics was feasible, and when it became fairly common for students to have PCs in their dorm rooms that were powerful enough to connect to the internet and display graphical web pages. It's also when the internet opened up to commercial use and started to become widely accessible from home, albeit via slow, dial-up modems.

Netscape 1.0 couldn't do much more than mosaic could, but it was easier to install and more stable. And Netscape (corporation) could start hyping the concept that everything could run inside the web browser. Not at all true at the time, but it was a direct challenge to Microsoft's Windows monopoly. If all your programs ran inside netscape, you could use any computer that could run netscape. It wouldn't need to be a Windows machine, and that made Netscape a competitive threat.

The upshot? Microsoft released Internet Explorer. It's aim wasn't to be good, just to displace netscape. Ship it with windows, make it the default browser, and make it capable of displaying MS Word files. With luck, HTML, the web, and netscape wouldn't take hold. Cue the browser wars.

IE 1.0 didn't have much impact, but Netscape continued developing their browser and rapidly adding new tags to enrich HTML. The goal here was to enrich the newly possible graphical aspects of web browsers, but also to have HTML tags that only worked right in netscape. Microsoft responded in kind. Here's an article from 1997 about the browser wars' impact on HTML. Very few of these tags were added because they were a good idea; they were added to be something the other browser didn't have.

Why no HTML examples in this section? Because most of the tags and attributes are now considered cruft that should not be used. CSS has supplanted it. A few, like the universally reviled blink tag, don't even work any longer.

Browser Wars Pt. 2: Javascript, IE4, and the Document Object Model

One of netscape's gimmicky, useless features was a programming language called javascript. You could use javascript to make scrolling text or pull-down menus that didn't work very well. This was a marketing feature, not a well-planned feature with a clear use case, and it was created in 10 days. Microsoft responded with (crummy) javascript support in IE, along with a competing Visual Basic-derived language called VBScript. Microsoft also introduced support first for Cascading Style Sheets (CSS) to once again separate appearance from structure, and the document object model (DOM) to allow javascript and vbscript to modify a web page's contents. None of it worked very well, but contemporary versions of netscape were even worse; by 2000, Microsoft had won the browser wars.

It took many years, and the release of Google Chrome to make most of these fancy new features reliably usable, but now "good" HTML might look something like this:

<html>
  <head>
    <title>Sample Modern Web Page</title>
    <link rel="stylesheet" href="URL1" />
    <link rel="stylesheet" href="URL2" />
    <script src="URL3" />
    <script src="URL4" />
  </head>
  <body>
    <h1 class="some_header_class" style="font-weight: bold">
      This is a header to which the "some_header_class" style will be applied and that will also be displayed in bold
    </h1>

    This is <span id="a_span" class="some_class some_other_class">some text to which the "some_class" and
      some_other_class" styles will be applied, and which can be accessed from javascript and CSS by the id "a_span"
      that is unique to this tag.</span>

    This is <div id="a_div" class="some_class some_other_class">a block of text to which the same styles will be applied,
      and which can be access by the unique id "a_div".  Because this is a div and not a span, there will be a line break
      between "This is" and "a block" unless the styles that are applied override the default.</div>

    <invented_tag class="yet_another_class" style="font-color: red">
      You can use any tag you want, even if it isn't part of HTML.  The browser will largely ignore these tags,
      but you can still give them an id, a class, a style, and any other attributes you wish.  Styles will be
      applied based on the tag name, id, classes, and style attribute, and you will be able to manipulate the tag
      and its attributes from javascript via the DOM.
    </invented_tag>
  </body>
</html>

Lots of existing tags remain useful in modern HTML because they give convenient defaults for a document's appearance and structure. But a few tags and attributes have gained outsize importance:

<link rel="stylesheet" href="..." />: This tag specifies CSS stylesheets that the browser should load and apply to the document. It's common for a web page to load and use several different stylesheets to control various aspects of the appearance. You might modularize your own styles into several files, but you will often pull in style sheets from elsewhere on the web. For instance, Bootstrap is a library to create attractive buttons and other interface elements. For the most part it is implemented as very fancy style sheets. Your web page would include links to one or several bootstrap CSS files, and in addition you might link to a CSS file containing your own formatting instructions. You can also embed your CSS style rules directly in your HTML file using the <style> tag, but putting them in a separate file is usually cleaner. <link> tags usually go in the HEAD section of the HTML file so they are loaded before the browser tries to display the web page. After all, if your style files aren't downloaded and ready to go, your web page will look truly crappy.
style="..." attribute: You can add a style attribute to any tag to directly specify some formatting to apply to that tag in CSS syntax. This is useful if you need to specify one or two simple things for that tag only. For any formatting that is more complicated, or that you might apply to more than one tag, it's wise to separate it into a CSS file.
<script src="..." />: This tag says to load and execute a javascript file. It is often in the HEAD section as well, because any references in the BODY to this code will break if the code isn't already loaded. It's common to have multiple script tags as well. For instance, you are likely to have one tag to load d3, and another to load your own javascript file that uses d3. The self-closing /> is important here, because you can also include javascript code directly between open <script> and close </script> tags. Without the self-closer, your browser will look for a closing script tag and get confused if it doesn't exist. As with the style tag, it only makes sense to directly embed your javascript code inside the HTML file if it is extremely short.
<span> and <div>: Both of these tags just specify a section of your HTML file to which you can apply styles and that you can manipulate from javascript via the DOM. They are invisible on their own. By default, div is treated as a separate block, which means you'll typically get a line break before it. With span, you won't get the line break. You can override that behavior in either case through CSS. spans and divs are extremely useful with d3, because you will need to create placeholders in your web mage where you can insert and modify visualizations. These are usually the tags to use for that purpose.
id attribute: You can add an id attribute to any tag to give it a unique name. No two tags on the same page should ever have the same id. You'll use this to directly refer to individual tags in both CSS and javascript to control them. The attribute is optional: you only need to use it for tags that you will need to reference from somewhere else.
class attribute: You can add a class attribute to any tag to specify a space-separated list of classes to which it belongs. Many tags can have the same class, and individual tags can have many classes. You can define CSS styles that apply to all tags that are part of a particular class, and you can reference all tags belonging to a particular class from javascript.
Custom tags: The browser will more or less ignore any tags it doesn't recognize, but still let you access it from CSS and javascript. In other words, tags that aren't part of HTML will be treated more or less like <div> tags. Since you can apply style and manipulate HTML via tag names as well as ids and classes, this lets you create user-defined tags that improve readability. For instance, you might define a <navbar>...</navbar> tag; through judicious use of CSS and javascript, you can manipulate this snippet of HTML to look and behave like a navigation bar. This isn't central to the way d3 works, but more general web application frameworks like React, Vue, and Angular are built around the concept.

Chrome, GMail, and Google Maps

By 2004, Microsoft had not only one the browser wars, it had left IE to rot at version 6. It did support recent HTML, CSS, or javascript standards very well at all, javascript was slow, and it was riddled with security holes. Mozilla firefox, the successor to netscape, was better but imperfect in all respects. So was Safari, which Apple released in 2003. But no complex web page would look and behave identically in the three browsers, and IE was the least functional but unavoidable. In the midst of this, Google released GMail, and a year later Google Maps. Both were genuinely, interactive, responsive web applications of the sort Netscape was touting 10 years earlier. Opening a new e-mail or scrolling a map without loading a full new web page was revolutionary. Dragging the map to move it rather than clicking scroll buttons was even more amazing.

Suddenly web applications were a reality, which helped force a reckoning among browser makers. Microsoft started losing market share and realized it had to fix IE, and all the browser makers eventually accepted they had to agree on and implement common standards for HTML, CSS, and javascript. Then, in 2008, google released chrome based on the same HTML rendering engine as Safari (the two have since diverged), but with a far faster javascript engine. The complexity of web applications like gmail and maps could suddenly go up dramatically because the javascript code executed so much faster. This pushed further improvements to all the browsers, and lots of work to improve javascript.

In terms of HTML, not much has changed in the last 20 years. But the way it is used has. Even though nearly all the old tags and attributes still work, modern web design relies much more heavily on simple, structural elements, like div tags, combined with CSS for formatting and javascript for interactivity.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly