Click here to Skip to main content
15,867,453 members
Please Sign up or sign in to vote.
5.00/5 (2 votes)
See more:
I am trying to find out how to analyze raw data of a Gmail email and save its elements (i.e. Body, Subject, Date and attachments). I found code samples for parsing multi-part data which I think can be used.

Raw Gmail Data - See this explaination[^]

What I am looking for is a specific solution for Gmail raw data, being a complex example on MIME, having multiple types of elements embedded (images, HTML / rich text, attachments). My goal is to extract these elements and store them separately. I am not looking for an interactive protocol such as POP3 but a static one, meaning that with this raw data, one can get the entire email along with its elements even when offline.

IDE - I am using Visual Studio Ultimate, C++ along with Win32 API.

What I have tried:

For example this article[^] seems to have the building blocks for parsing such email. However I am looking for a solution dedicated to such raw data, as this type of data is quite complex, combining various elements, attachments, all in one file (or block of data).


Here is my current code.

LPCSTR szMailId, LPCSTR szMailBody;
MIMELIB::CONTENT c;
while ((*szMailBody == ' ') || (*szMailBody == '\r') || (*szMailBody == '\n'))
{
    szMailBody++;
}
char deli[] = "<pre class=\"raw_message_text\" id=\"raw_message_text\">";
szMailBody = strstr(szMailBody, deli);
szMailBody += strlen(deli);

CStringA Body = szMailBody;
Body = Body.Left(Body.Find("<//pre><//div><//div><//div><//body><//html>"));
Body = Body.Mid(Body.Find("<html>"));

szMailBody = Body.GetString();
if (c.Parse(szMailBody) != MIMELIB::MIMEERR::OK)
    return;
// Get some headers
auto senderHdr = c.hval("From");
string strDate = c.hval("Date");    // Example Sat, 13 Jan 2018 07:54:39 -0500 (EST)
auto subjectHdr = c.hval("Subject");

auto a1 = c.hval("Content-Type", "boundary");
// Not a multi-part mail if empty
// Then use c.Decode() to get and decode the single part body
if (a1.empty())
    return;
vector<MIMELIB::CONTENT> Contents;
MIMELIB::ParseMultipleContent2(szMailBody, strlen(szMailBody), a1.c_str(), Contents);

int i;
for (i = 0; i < Contents.size(); i++)
{
    vector<char> d;
    string type = Contents[i].hval("Content-type");
    d = Contents[i].GetData(); // Decodes from Base64 or Quoted-Printable
}
Posted
Updated 15-Jan-18 7:33am
v5

In addition to your thread at the mentioned article which seems to have solved your problem and to have an answer here:

There are no Gmail specific "raw" data. It is the format of mail messages as defined by RFC 2822: Internet Message Format[^] and related RFCs like RFC 2045 - 2049 for the MIME extensions.

Those RFCs contain the necessary information to write a parser.

[EDIT]
Example code using the mimelib.h file from the mentioned article. Compiled and tested with VS 2017. Requires /Zc:strictStrings-.

C++
#include "stdafx.h"

#include <windows.h>
#include <WinInet.h>
#include <string>
#include <sstream>
#include <vector>
#include <memory>
#include <intrin.h>

using namespace std;

#include "mimelib.h"

#pragma comment(lib, "crypt32")

MIMELIB::MIMEERR ParsePart(MIMELIB::CONTENT& c, const char* szPart = "")
{
    MIMELIB::MIMEERR merr = MIMELIB::MIMEERR::OK;
    auto boundary = c.hval("Content-Type", "boundary");
    // Single part
    if (boundary.empty())
    {
        std::string strPart = (szPart && *szPart) ? szPart : "1";
        auto typeHdr = c.hval("Content-Type");
        if (typeHdr.empty())
        {
            wprintf(L"Part %hs: Default (single)\n", strPart.c_str());
            typeHdr = "text/plain;";
        }
        else
        {
            wprintf(L"Part %hs: %hs\n", strPart.c_str(), typeHdr.c_str());
        }
        auto fileName = c.hval("Content-Disposition", "filename");
        if (fileName.empty())
        {
            // Create a file name from part and an extension that matches the content type
            std::string ext = "txt";
            auto subTypeS = typeHdr.find('/');
            auto subTypeE = typeHdr.find(';');
            if (subTypeS > 0 && subTypeE > subTypeS)
            {
                subTypeS++;
                ext = typeHdr.substr(subTypeS, subTypeE - subTypeS);
            }
            if (ext == "plain")
                ext = "txt";
            else if (ext == "octet-stream")
                ext = "bin";
            fileName = "Part";
            fileName += strPart;
            fileName += '.';
            fileName += ext;
        }
        // Get the decoded body of the part
        vector<char> partData;
        c.DecodeData(partData);
        // TODO: Decode fileName if it is inline encoded
        FILE *f;
        errno_t err = fopen_s(&f, fileName.c_str(), "wb");
        if (err)
        {
            char errBuf[128];
            strerror_s(errBuf, err);
            fwprintf(stderr, L" Failed to create file %hs: %hs\n", fileName.c_str(), errBuf);
        }
        else
        {
            fwrite(partData.data(), partData.size(), 1, f);
            fclose(f);
            wprintf(L" Saved part to file %hs\n", fileName.c_str());
        }
    }
    else
    {
        // Decoded part of mail (full mail with top level call)
        auto data = c.GetData();
        // Split it into the boundary separated parts 
        vector<MIMELIB::CONTENT> Contents;
        merr = MIMELIB::ParseMultipleContent2(data.data(), data.size(), boundary.c_str(), Contents);
        if (MIMELIB::MIMEERR::OK == merr)
        {
            int part = 1;
            for (auto & cp : Contents)
            {
                std::string strPart;
                if (szPart && *szPart)
                {
                    strPart = szPart;
                    strPart += '.';
                }
                char partBuf[16];
                _itoa_s(part, partBuf, 10);
                strPart += partBuf;
                ParsePart(cp, strPart.c_str());
                ++part;
            }
        }
    }
    return merr;
}

int main(int argc, char *argv[])
{
    if (argc < 2)
    {
        fwprintf(stderr, L"Usage: ParseMail <file>\n");
        return 1;
    }
    struct _stat st;
    if (_stat(argv[1], &st))
    {
        fwprintf(stderr, L"File %hs not found\n", argv[1]);
        return 1;
    }
    FILE *f = NULL;
    errno_t err = fopen_s(&f, argv[1], "rb");
    if (err)
    {
        char errBuf[128];
        strerror_s(errBuf, err);
        fwprintf(stderr, L"File %hs can't be opened: %hs\n", argv[1], errBuf);
        return 1;
    }
    char *buf = new char[st.st_size + 1];
    fread(buf, 1, st.st_size, f);
    buf[st.st_size] = 0;
    fclose(f);

    MIMELIB::CONTENT c;
    MIMELIB::MIMEERR merr = c.Parse(buf);
    if (merr != MIMELIB::MIMEERR::OK)
    {
        fwprintf(stderr, L"Error pasing mail file %hs\n", argv[1]);
    }
    else
    {
        auto senderHdr = c.hval("From");
        auto dateHdr = c.hval("Date");
        auto subjectHdr = c.hval("Subject");
        wprintf(L"From: %hs\n", senderHdr.c_str());
        wprintf(L"Date: %hs\n", dateHdr.c_str());
        wprintf(L"Subject: %hs\n\n", subjectHdr.c_str());
        merr = ParsePart(c);
    }
    delete[] buf;
    return merr;
}

Example output for a multipart mail:
From: [redacted]
Date: Tue, 26 Sep 2017 09:44:15 +0200
Subject: =?ISO-8859-1?Q?WG=3A_Haftverzichtserkl=E4rung_f=FCr_[...]_Fa=2E_S
iS?=; =?ISO-8859-1?Q?_-_EMB_168_-_12=2E10=2E2017?=

Part 1.1: text/plain; charset="UTF-8"
 Saved part to file Part1.1.txt
Part 1.2: text/html; charset="UTF-8"
 Saved part to file Part1.2.html
Part 2: application/octet-stream; name="HaVerzSiS.pdf"
 Saved part to file HaVerzSiS.pdf
Part 3: image/jpeg; name="Liegeplatz FS EMB.jpg"
 Saved part to file Liegeplatz FS EMB.jpg
[/EDIT]
 
Share this answer
 
v2
Comments
Michael Haephrati 9-Jan-18 5:34am    
Unfortunately the thread correspondence didn't solve my problem. So can you point to me of a source code example for parsing this format? (RFC 2822: Internet Message Format)
Jochen Arndt 9-Jan-18 5:59am    
I had to search for one too but "c++ mime parser" gets lots of results. I would look for a well known / often used library.

I suggest also to extent the code from the article to know where the parsing fails. Maybe your mail is not well formed (unlikely because it would be complained by a mail client), the article code is not RFC compliant, or does not support a specific part used by your mail.

You might also add the failing mail content to your question (anonymised and with shortened MIME data lines = only first and last line) and/or at the article. Then I and others may have a look.
Michael Haephrati 9-Jan-18 6:02am    
Its not about a specific mail but in general and I am reading the emails directly from gmail.com. If you use gmail you can try with any gmail message and see.
Jochen Arndt 9-Jan-18 6:24am    
Maybe it is a problem of using the article code and/or getting the mail content. You have to pass it as it is (a std::string). That is: No conversions for line feeds (must be CR-LF) or character encodings (must be 7/8 bit).

For example save the content to a file and check it with a text and/or hex editor. If it looks OK, read the file into a char[] buffer, append a NULL byte and assign that to a std::string. Then it should work. If not, you should ask again in the article forum.
Michael Haephrati 9-Jan-18 6:58am    
If I save the data into a file, and name it test.html, it will be shown properly. Can I hire you privately to do such job?
Here is a copy of a typical Gmail message (interestingly from you):
Delivered-To: anonymousnam AT gmail.com
Received: by 10.2.15.201 with SMTP id 70csp1706370jao;
        Fri, 12 Jan 2018 02:08:02 -0800 (PST)
X-Google-Smtp-Source: ACJfBovDfSZaL48gp2hiXRdWrkQ2fN4ADImypAfgO6nn3bL9YXe9pyOS1NCsj6nejU8n0AFGoP8W
X-Received: by 10.107.140.78 with SMTP id o75mr23661888iod.219.1515751682675;
        Fri, 12 Jan 2018 02:08:02 -0800 (PST)
ARC-Seal: i=1; a=rsa-sha256; t=1515751682; cv=none;
        d=google.com; s=arc-20160816;
        b=Wronqhb0qgSWSapehG2PSI00FvI+Y2o3/MG8O6czKU/v9/9ETX4ObQ6hBP4fzRoAko
         U8mhgMZqhMoIKVv7czqG2g0S/VxBBkmPNUv7JbLZdZISzsO9e46SdfbhSKJMEdrESxtW
         7tankKOzdVFh5kbOX0ZWJrrCO1/a15lEo5MBChjr0apydxskoXq7p2vNmafiC9pqKads
         jNK2+Pkc0Y2OeEfL67Vs8IlXN+u1y2TYn9A8uZbfdNmPL6zq3rJ31v9hDdyXM3p3oTg4
         YP0K1+el4JFRgF4zo0Pyg4gFl62QzIv/SfP12o6ihsOIJZ68eS7PoDHYfZvKQXLySAjH
         2l2w==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816;
        h=content-transfer-encoding:subject:date:to:from:mime-version
         :message-id:arc-authentication-results;
        bh=8sfAQGYMvxQ7PtXZ5Em7odijeBxnxUxWk1qh7LONjxM=;
        b=xHGbZSMYY74D6WFzT2SVpmOjKqALpbjgEopoaeGKE+2mUj77Is+gvHb/Q81aFnvjDY
         0xPKbsKUM6vPYCO9FU9QFumqP/XrYxEVQ5EOzxFk0SGV18QuzLGTkIxTVz97ARYLpnII
         M/gvPlSNX8hUfyzjh/0NsiD/64FMoCmLanRA0aQb+73TUHcKKwIEMhbwgQ9xizvShBKz
         hCMWDz92D6qhTI5Dhhfzjy/xm4j4TTNQxAW5rOgdm5LFg22tgTdkhenGSWUMtgOalstY
         0r4he7xoor6Ut0rT3QBZKr+5kywuyi3XZGVUmXvwKtss3ChDGcL4xqwos4HUzsW6i8YP
         1GFw==
ARC-Authentication-Results: i=1; mx.google.com;
       spf=pass (google.com: domain of anonymousnam AT codeproject.com designates 76.74.234.221 as permitted sender) smtp.mailfrom=anonymousnam AT codeproject.com
Return-Path: <anonymousnam AT odeproject.com>
Received: from mail.notifications.codeproject.com (mail.notifications.codeproject.com. [76.74.234.221])
        by mx.google.com with ESMTPS id l64si3747147iof.279.2018.01.12.02.08.02
        for <anonymousname AT gmail.com>
        (version=TLS1 cipher=AES128-SHA bits=128/128);
        Fri, 12 Jan 2018 02:08:02 -0800 (PST)
Received-SPF: pass (google.com: domain of anonymousname AT codeproject.com designates 76.74.234.221 as permitted sender) client-ip=76.74.234.221;
Authentication-Results: mx.google.com;
       spf=pass (google.com: domain of anonymousname AT codeproject.com designates 76.74.234.221 as permitted sender) smtp.mailfrom=anonymousname AT codeproject.com
Message-Id: <5a588902.43bb6b0a.f549b.f8cdSMTPIN_ADDED_MISSING@mx.google.com>
Received: from CP-WEB2 (cp-web2.codeproject.com [192.168.5.52]) by mail.notifications.codeproject.com (Postfix) with ESMTP id 2FB2E1E0DE8 for <anonymousname AT gmail.com>; Fri, 12 Jan 2018 04:55:19 -0500 (EST)
MIME-Version: 1.0
From: CodeProject Answers <anonymousname AT codeproject.com>
To: anonymousname AT gmail.com>
Date: 12 Jan 2018 05:08:01 -0500
Subject: CodeProject | A reply was posted to your comment
Content-Type: text/html; charset=us-ascii
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.=
w3.org/TR/html4/loose.dtd"><html><head>
<meta http-equiv=3D"Content-Type" content=3D"text/html; charset=3Dus-ascii"=
>
<meta name=3D"viewport" content=3D"width=3Ddevice-width">
</head><body style=3D"background-color: white;font-size: 14px; font-family:=
 'Segoe UI', Arial, Helvetica, sans-serif">
<style type=3D"text/css">
body,table,p,div { background-color:white; }
body,table,p,div { background-color:white; }
body, td, p,h1,h2,h6,h3,h4,li,blockquote,div{ font-size:14px; font-family: =
'Segoe UI', Arial, Helvetica, sans-serif; } =20
h1 {font-size: 26px; font-weight: bold; color: #498F00; margin-bottom:5px;m=
argin-top:0px;} =20
h2 { font-size: 24px; font-weight: 500; }
h4 { font-size: 16px; }
h3 {font-size: 11pt; font-weight:bold;} =20
h6 {font-size:6pt;color:#666;margin:0;} =20
table =09=09=09{ width: 100%;} =20
table.themed =09{ background-color:#FAFAFA; } =20
a =09=09=09=09{ text-decoration:none;} =20
a:hover =09=09{ text-decoration:underline;} =20
.tiny-text=09=09{ font-size: 12px; }
.desc =09=09=09{ color:#333333; font-size:12px;}
.themed td  =09{ padding:2px; } =20
.themed .alt-item { background-color:#FEF9E7; } =20
.header =09=09{ font-weight:bold; background-color:#FF9900; vertical-align:=
middle;} =20
.footer =09=09{ font-weight:bold; background-color: #488E00; color:White; v=
ertical-align:middle; }
.signature =09=09{ border-top: solid 1px #CCCCCC; padding-top:0px; margin-t=
op:10px; max-height:150px; overflow:auto;}

.content-list=09=09{ margin-bottom: 17px;}
.content-list-item=09{ margin:     10px 0; }
.doctype img=09=09{ vertical-align:bottom; padding-right:3px;}
.entry=09=09=09    { font-size: 14px; line-height:20px; margin: 0;}
.title=09=09=09    { font-size: 16px; font-weight:500; padding:0; }
.entry=09=09=09=09{ font-size: 14px; color:#666; }
.author, .author a  { font-size: 11px; font-weight:bold; }
.location=09=09    { font-size: 11px; font-weight:bold; color: #999}
.summary            { font-size: 12px; color: #999; padding: 0px 0 10px; }
.theme-fore         { color: #f90; }
.theme-back         { background-color: #f90; }
</style>
<table cellspacing=3D"1" cellpadding=3D"3" class=3D"header" border=3D"0" st=
yle=3D"background-color: #FF9900;width: 100%;font-weight: bold;vertical-ali=
gn: middle"><tbody><tr><td style=3D"font-size: 14px; font-family: 'Segoe UI=
', Arial, Helvetica, sans-serif">
<img border=3D"0" src=3D"https://www.codeproject.com/App_Themes/CodeProject=
/Img/logo225x40.gif" width=3D"225" height=3D"40"></td></tr></tbody></table>

<p style=3D"background-color: white;font-size: 14px; font-family: 'Segoe UI=
', Arial, Helvetica, sans-serif">Michael Haephrati has pos=
ted a reply to your comment about=20
"<a href=3D"https://www.codeproject.com/Answers/1224946/Conversion-to-Unico=
de-Cplusplus-Microsoft-UTF-Nati?cmt=3D969334#cmt969334" style=3D"text-decor=
ation: none">Conversion to Unicode (C++, Microsoft, UTF-16, Native Windows)=
</a>":</p>=20

<blockquote style=3D"font-size: 14px; font-family: 'Segoe UI', Arial, Helve=
tica, sans-serif">Apparently there is no expiration date to questions and w=
hen I looked for unanswered questions, I got here...</blockquote>

<hr class=3D"divider" noshade=3D"noshade" size=3D"1">
<div style=3D"background-color: white;font-size: 14px; font-family: 'Segoe =
UI', Arial, Helvetica, sans-serif"><a href=3D"https://www.codeproject.com" =
style=3D"text-decoration: none">CodeProject</a></div>
<div class=3D"small" style=3D"background-color: white;font-size: 14px; font=
-family: 'Segoe UI', Arial, Helvetica, sans-serif">Note: T=
his message has been sent from an unattended email box.</div>
</body></html>

So you can see that the message headers are control labels followed by a colon. The start of the message body is separated from the message headers by a blank line, and may be rich text, HTML or plain text. The mail RFC lists all the possible header label names.
 
Share this answer
 
v3
Comments
Jochen Arndt 12-Jan-18 6:06am    
Now we and mail harvesters run by spammers know your Gmail address.

You should make them both anonymous.
Richard MacCutchan 12-Jan-18 8:58am    
Yes, thanks for noticing.
Michael Haephrati 12-Jan-18 7:01am    
Is this your solution??? the question is how can such raw data be converted to the email's ingredients, such as inline photos, attachments, etc.
Richard MacCutchan 12-Jan-18 9:06am    
Images, attachments etc are usually converted to base64 encoding. As mentioned by me and Jochen, all this information is freely available. Perhaps you could show some of your code and explain exactly what your problem is.
Michael Haephrati 12-Jan-18 10:52am    
I updated the question with my recent source code. This source code is from the point that szMailBody contains the email's raw data and using mimelib (https://www.codeproject.com/KB/cpp/1114232/mimelib.zip ) I am trying to extract from the raw data all elements (embedded images, HTML body, attachments, etc.) assuming that can be done statically, i.e. with no need for any interaction with the Gmail server.
Problem, I am expecting Contents to contain all elements but it doesn't.

This content, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)



CodeProject, 20 Bay Street, 11th Floor Toronto, Ontario, Canada M5J 2N8 +1 (416) 849-8900