Programming an HTML Downloader in C++

Written By: Alwyn Malachi Berkeley

- 25 Sep 2006 -

Description: This program requests the source code (HTML) of a webpage and then gives the user multiple options for viewing the web server's response. We will use the Winsock API to manipulate sockets on the system, and create reusable code along the way.

  1. Creating the Project
  2. main.cpp Headers and the Internet Namespace
  3. The Webpage Class
  4. The Internet Namespace's Functions
  5. GetDomain
  6. Custom_itoa
  7. GetWebpage
  8. SaveFile
  9. Internet.hpp and Writing main.cpp
  10. Compile and Conclusion

Implementing GetDomain

The GetDomain function has one purpose. The purpose of this function is to retrieve the domain name from a full URL. So for example if the URL of my personal homepage is entered as "http://www.malachix.com/index.php" then GetDomain will parse the URL and return only "www.malachix.com". Create yet another empty source code file. I'll explain this function in two parts, the top half and bottom half. The source code for the first part is as follows:

const char* GetDomain(const string &strWebpageURLParam) {
        // extracts the domain from a URL
        // ex: www.malachix.com from http://www.malachix.com/index.php
        
                // we promised not to make changes to strWebpageURLParam so copy the
                // data to a temporary object for the remainder of this function
                string strTempURL = strWebpageURLParam;
 
                // find the double front slashes in the URL
                const string::size_type theDoubleSlashPosition = strTempURL.find("//", 0);
 
                // if there are double slashes in the URL then remove those
                // double slashes and everything before them
                if (theDoubleSlashPosition != string::npos) {
                        strTempURL = strTempURL.substr(theDoubleSlashPosition + 2, strTempURL.length() - (theDoubleSlashPosition));
                }
 
                // since were eliminating everything before "//" the start position is always 0
                const string::size_type theStartPosition = 0;
 
                // count the number of dots in the URL
                int intDotCount = 0;
                for (int X = 0; X < strTempURL.length(); X++) {
                        if (strTempURL.at(X) == ) intDotCount++;
                }
 
                // find the front slash after the domain in the URL
                string::size_type theEndPosition = strTempURL.find_first_of("/", 0);
 
                // if there was no slash found then the user probably
                // typed the domain in by itself...
                if (theEndPosition == string::npos) {
                        // ...so the ending position is the URL's length
                        theEndPosition = strTempURL.length();
                }

The domain is a substring of the URL. So in this top half of the function, we gather all the necessary information we need to determine where the domain begins and where it ends.

The first thing we need to do is copy the URL to a temporary string variable though. That is because we specifically stated in our parameter list that the URL string was constant and would not be altered in any way.

The second thing we need to do is remove the "http://" prefix from the URL string. It is easier to find the domain when there is nothing that precedes it. To remove the prefix we first find the position of the "//" in the URL. If the URL starts with the "http://" prefix then the theDoubleSlashPosition variable will hold the position where the "//" begins. If there was no "//" in the URL then theDoubleSlashPosition will save a special value called string::npos(no position). Since we now know where "//" is in the URL we can now truncate the first part of the string. We don't write a clause for if there wasn't a "//" in the URL because in that case the "http://" prefix would already be gone.

The third thing we do is set a variable called intStartPosition. This variable remembers where we will start extracting characters from later on in the code. The variable is always set to 0 because we will always start reading from the 0 position since the "http://" prefix will always be removed.

The fourth thing we find is the number of dots/periods in the URL. This is very useful information we can use to determine how the user wrote the URL.

Lastly, we need to have an end position. The end position is wherever the first "/" appears in the temporary URL variable. The "/" would be the character right after the .com/.net/.org etc.

Now that we have useful information such as where to start reading, stop reading, and the number of dots/periods in the URL; we can now put the information to use. The second half of the function that puts the values we attained to use is below:

         // Using the dots we can determine that the URL begins with...
                string strDomain;
                switch (intDotCount) {
                        case 0: // incorrect number of dots so invalid URL
                                throw std::invalid_argument("URL is invalid");
                                break;
                        case 1: // ...nothing, just the URL
                                strDomain = "www." + strTempURL.substr(theStartPosition, theEndPosition);
                                return strDomain.c_str();
                        case 2: // ..."www"
                                strDomain = strTempURL.substr(theStartPosition, theEndPosition);
                                return strDomain.c_str();
                        default: // must be the domain + some file extension like ".html"
                                strDomain = strTempURL.substr(theStartPosition, theEndPosition);
                                return strDomain.c_str();
                }
        }

In this second and last half of the function we are concerned with two things. The first thing we are concerned with is extracting the domain. The second thing we are concerned with is returning it in the format "www.xxxxxxxx.com". Here is how we attain the effect that we want.

We use the number of dots we attained earlier to determine how the URL was written. If the URL has no dots in it then it must be an incorrect URL. If the URL had 1 dot in it then it must be formatted like "DomainName.com". If the URL had 2 dots then it must have been formatted like "www.DomainName.com". If the URL has over 2 dots then it must be written like "www.DomainName.com/SomePage.php" or something similar. By knowing these things we can logically determine whether or not we need to append a "www." onto the front of the domain.

As for retrieving the domain itself we already have the values we need for that. The expression for that is the same in all three cases, simply read from theStartPosition to theEndPosition.

At this point you should be saying to yourself "WOW, now I get it! The dots, the positions, it all makes sense". Save this source file as "InternetHelperFunctions.cpp" and then move onto the second function now.

<< Previous

Next >>