Click here to Skip to main content
15,887,326 members
Please Sign up or sign in to vote.
1.00/5 (2 votes)
I'm trying to get a webpage's HTML body and the code just does not seem to work over HTTP.

www.google.com works
http://www.Blogger.com does not work
neither Blogger.com it returns but the info that the page has moved to http...


C++
#include <winsock2.h>
#include <windows.h>
#include <iostream>
#include <string>
#include <locale>
#include "Web.h"

#pragma comment(lib,"ws2_32.lib")
using namespace std;


string Web::DownloadData(string url)
{
    //open website
    Connect(&url[0u]);

    //format website HTML
    for (size_t i=0; i<website_HTML.length(); ++i)
        website_HTML[i]= tolower(website_HTML[i],local);

    //display HTML
    cout <<website_HTML;

    cout<<"\n\n";

    return website_HTML;
}

void Web::Connect(char *url)
{
    WSADATA wsaData;
    SOCKET Socket;
    SOCKADDR_IN SockAddr;


    int lineCount=0;
    int rowCount=0;

    struct hostent *host;
    char *get_http= new char[256];

        memset(get_http,' ', sizeof(get_http) );
        strcpy(get_http,"GET / HTTP/1.1\r\nHost: ");
        strcat(get_http,url);
        strcat(get_http,"\r\nConnection: close\r\n\r\n");

        if (WSAStartup(MAKEWORD(2,2), &wsaData) != 0)
        {
            cout << "WSAStartup failed.\n";
            system("pause");
            //return 1;
        }

        Socket=socket(AF_INET,SOCK_STREAM,IPPROTO_TCP);
        host = gethostbyname(url);

        SockAddr.sin_port=htons(80);
        SockAddr.sin_family=AF_INET;
        SockAddr.sin_addr.s_addr = *((unsigned long*)host->h_addr);

        cout << "Connecting to "<< url<<" ...\n";

        if(connect(Socket,(SOCKADDR*)(&SockAddr),sizeof(SockAddr)) != 0)
        {
            cout << "Could not connect";
            system("pause");
            //return 1;
        }

        cout << "Connected.\n";
        send(Socket,get_http, strlen(get_http),0 );

        char buffer[10000];

        int nDataLength;
            while ((nDataLength = recv(Socket,buffer,10000,0)) > 0)
            {
                int i = 0;

                while (buffer[i] >= 32 || buffer[i] == '\n' || buffer[i] == '\r')
                {
                    website_HTML+=buffer[i];
                    i += 1;
                }
            }
        closesocket(Socket);
        WSACleanup();

        delete[] get_http;
}


What I have tried:

-------------------------------------------
Posted
Updated 4-Jul-17 8:03am
v3

1 solution

So the code works and the connection is there but you just don't get what you want.

That is because you are communicating with the server on the raw HTTP level. The received data will begin with an HTTP response that you have to analyse.

If you don't want to do that yourself you have to use a library that does it for you (a web client library).

See also URL redirection - Wikipedia[^]. It explains what is happening and shows an example of such a redirection with the HTPP header and the HTML content.
 
Share this answer
 

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900