Scrap Emails From Gmail via Drush

In the following are steps to scrap emails from gmail account. We first archivete all emails into a file(.mbox). Then we run our custom drush command to scap all emails into output-email.csv file. At last, we use tool such as BriteVerify.com to filter only valid emails.

Archive All Emails

Go to Google Takeout and create archive file(.mbox) from your gmail account

Scrap with Drush

Once you have archive(.mbox) file, run the following drush command:

drush scrap-email --file-name=path/to/name-of-file.mbox

This will create file output-email.csv in your current directory with all the emails
Please, see Appendix below for scrap-email drush command, so you can install on your own machine

Verify Emails

Like with many things, it is also the case with our custom Drush scrap command that it is not perfect and it scraps some bad emails. To clean out the bad emais, we used tool BriteVerify.

Appendix

Here is full Drush command for scraping emails. Please, ensure to put it in file named scrap.drush.inc. For how to install, please, see post – Implementing Custom Drush Commands


<?php
// Same as error_reporting(E_ALL);
//ini_set('error_reporting', E_ALL);
ini_set('memory_limit', '850M');
set_time_limit(0);

function scrap_drush_command()
{
    $items = array();
    $items['scrap_email'] = array(
        'description' => "Scraps all emails from google archive(.mbox) and stores it in output-email.csv in current dir",
        'arguments' => array(//            'type' => 'The type of the smile (half_moon, polity, etc.)',
        ),
        'options' => array(
            'file-name' => 'path to name of the google archive file(.mbox). It can be relative to current dir',
        ),
        'examples' => array(
            'drush scrap_email --file-name=my-gmail.blox' => 'scraps all emails from my-gmail.mbox and stores emails in output-emails.csv in current dir',
        ),
        'aliases' => array('semail'),
        'bootstrap' => DRUSH_BOOTSTRAP_DRUSH, // No bootstrap at all.
    );
    return $items;
}

function drush_scrap_email()
{
    $filepath = drush_get_option('file-name');
    if (!file_exists($filepath)) {
        $filepath = getcwd() . '/' . $filepath;
        if (!file_exists($filepath)) {
            drush_die('File - ' . $filepath . ' doesn\'t exist', 0);
        }
    }

   drush_log('begin scraping...','ok');

    $chunk = 10 * 1024 * 1024; // bytes per chunk (10 MB)

    $f = fopen($filepath, 'rb') or die("Couldn't get handle for " . $filepath);
    $data = '';
    if ($f) {
        while (!@feof($f)) {
            $data .= fgets($f, 4096);
        }
        fclose($f);
    }

    drush_log('done reading string of size: ' . mb_strlen($data, '8bit') . '... start searching','ok');

    $pattern = "/([a-zA-Z0-9._-]+@[a-zA-Z0-9._-]+\.[a-zA-Z0-9._-]+)/";
    preg_match_all($pattern, $data, $matches);

    $all_emails = array_unique(array_values($matches[0]));
    $all_emails_filtered = array_filter($all_emails, 'filter_bad_emails');

    print_r($all_emails_filtered);
    drush_log('Count:' . count($all_emails_filtered),'ok');

    drush_log('writing...','ok');

    $date = date('m-j-y');
    $filename = 'output-emails-'.$date.'.csv';
    $filepath = getcwd() . '/' . $filename;

    $file = fopen($filepath, "w") or die("Couldn't get handle for " . $filepath);
    if ($file) {
        foreach($all_emails_filtered as $email){
            fputcsv($file, array($email));
        }
    }

    fclose($file);
    drush_print('done');
}

function filter_bad_emails($email)
{
    $char = $email[0];
    $email_tokens = explode('@', $email);
    $domain_name = array_pop($email_tokens);
    $ext_tokens = explode('.', $domain_name);
    $ext = array_pop($ext_tokens);
    if ($char == '-' || $char == '_' || $char == '.' || is_numeric($char) || (strlen($email) > 30) || (strlen($ext) > 4) || is_numeric($ext) || ($ext == 'c') || ($ext == 'n') || ($domain_name == 'mail.gmail.com')) {
        return false;
    } else {
        return true;
    }
}

Locally Untracked Files in Git While Making a Grails Plugin In-Place

At my current workplace, the grails application is divided into application and an extra plugin. To avoid package, maven-install every single time new change made in plugin that need to be seen in application, I decided to make this plugin in-place. To make it in-place plugin, i had to update BuildConfig.groovy as follows:

...
grails.plugin.location.nameOfPluginPlugin = '../../pluginLocation'
...
plugins {
    ...
    //uncomment plugin dependency
}
...

At the same time, i don’t want these changes go into others developers local repos, so i wanted to untrack BuildConfig.groovy file. Here are steps to untrack file from git repo.

Ignoring File in Git

To ignore file globally,here are steps

Step-1: Create global git ignore file, not tracked in repository that is user-specific

git config --global core.excludesfile pathTo/.gitignore_global
 

Note: You can find different sample git ignore files per technology here

Step-2: Added BuildConfig.groovy in my local global git ignore file as follows

...
/relative/pathToApp/BuildConfig.groovy
...

This will keep it ignored, but we still need it make it be untracked before it can be ignored

Step-3: Untrack the file itself since it is already tracked

git rm --cached BuildConfig.groovy

This makes the file to be ignored and also untracked, however. By pushing to the shared repo others will make this file to be untracked as well, which we don’t want

Ignoring vs Untracked

Ignore will only apply to untracked files. So, in our case, where we need the file to still be tracked, the solution of ignoring described above will not work. Instead, we set the file to be assumed unchanged by Git:

git update-index --assume-unchanged appname/grails-app/conf/BuildConfig.groovy  

By ‘assume unchanged’, the file is ignored and no changes appear in git repo. This makes the grails plugin in-place only for me since the changes are not tracked anymore and ,thus, pushed into the shared main repository

Useful links:

Moving from Mercurial and BitBucket to Git and GitHub

While skeptical at first, after trying it out github has won my favor as my favorite version control. This post records steps made to transfer from Mercurial and BitBucket to Git and GigHub while keeping history

Step 1 – Set up Password caching following article ‘Set Up Git’

Step 2 – Install git-hg module. This allows to clone a mercurial repository from BitBucket and then converts it to git repository that you can push into the GitHub.com

Followed article ‘Moving Your Mercurial To Rep To Git’

Step 3 – Clone the existing GitHub repository and push it to GitHub as following:

git-hg clone http://bitbucket.org/some/repo name-git-repo
cd name-git-repo
git remote add origin http://github.com/some/repo.git
git push origin master

Potential Issues

1. “fatal: remote part of refspec is not a valid name in .”
This happens when you created new branch that you would like to push into github repo while the github repo doesn’t contain the new brach. Solutions:

git config push.default current

This will change default setting for ‘push’ to create new branch in the github repo if doesn’t exist before pushing. Here are other options:

  • nothing : Do not push anything
  • matching : Push all matching branches (default)
  • tracking : Push the current branch to whatever it is tracking
  • current : Push the current branch

Useful Links

Shelving Uncommitted Changes in Mercurial

There may be a time when your realize that the changes you just made (uncommitted changes), need to be on separate branch. That happend to me in my new workplace few days ago. My new workplace use branching extensively to ensure code review process is taking place. Every feature or defect is done on separate branch. After work completed, the developer initiates a pull request via bitBucket so that there is another developer who approves the changes before merged into main trunk. So, for my first changes i forgot and i completed my changes on the main trunk just to later learn it has to be on a separate branch.  The Shelving feature in Mercurial came to rescue me. I was surprised to learn that a task that can become time consuming and challenging was accomplished so easy thanks to shelving feature. This post will cover how to install Shelve extension and then use this feature to move some uncommitted changes into a different branch.

Install Extension – hgshelve

You may not need this because some IDE like IntelliJ comes with the extention – hgshelve preinstalled and configured, however, if you are like me who likes command line mercurial hg then we have to install.

1. Clone the extension – hgshelve

>hg clone ssh://hg@bitbucket.org/tksoh/hgshelve

2. Configure the extension – hgshelve

To do so, update your global mercurial config .hgrc file section ‘extensions’ in your home directory as follows:

[extensions]
 hgshelve=/path_to_dir_cloned_above_step_1/hgshelve.py

This completes the installation and configuration of the extension – hgshelve. Now, you are able to take advantage of the power that shelving brings to you

Using Shelve

1. Shelve Uncommitted Changes

First, i shelve my uncommitted changes while on the main branch by executing command:

>hg shelve --name DE4653_DE3847_Shelved

Once started, mercurial will be asking to confirm each change that adding to shelve something like:

>$ hg shelve my_dir/my_file.ext
>examine changes to 'my_dir/my_file.ext'? [Ynsfdaq?]

The options – [Ynsfdaq?] stand for:

y - shelve this change
n - skip this change

s - skip remaining changes to this file
f - shelve remaining changes to this file

d - done, skip remaining changes and files
a - shelve all changes to all remaining files
q - quit, shelveing no changes

? - display help</pre>

If you like to see every change added to this shelve then select ‘Y’ or ‘a’ for shelving all of the changes without confirming each

This shelves the changes and i am safe to change branches

Verify Shelved Changes

1. To verify your changes has been shelved, run:

>hg shelve --list
DE4653_DE3847_Shelved

This would list all of the shelved changes. In our case, there is only one

2. Another way to ensure changes has been shelved is to run ‘hg status’. If changes shelved, this should return nothing

3. If you doing for the first time and you are afraid of losing, then look at .hg/shelve directory in application dir (precisely, mercurial working directory). There you should find text files containing the changes per each shelving. You can back those up. You can email to your friend developer to finish, perhaps,  in situations where you have shelved uncommitted changes and you are unable to complete because you going out town. Or email to yourself to finish at home. By putting these text files in your home computer .hg/shelve folder, makes those available to unshelve and finish

Unshelve Changes

2. Create New Branch. To create new branch, run command:

>hg branch DE4653_DE3847

This not only creates new branch, it also automatically puts you in the new branch. To verify what branch you are at any time, run ‘hg branch’

3. Unshelve Changes

When ready, to unshelve your changes, run:

>hg unshelve --name DE4653_DE3847_Shelved

This takes our uncomitted changes shelved previously and adds to our working directory. To verify you can run ‘hg st’ or ‘hg shelve –list’. The latter will return nothing in our case because after unshelving the shelvDE4653_DE3847_Shelved is destroyed.

Now, we are able to commit the new changes in the separate appropriate branch, so we can create pull request , thus, accomplishing our goal.

4. Nudge Changes

Not related  shelving, however part of the exercise – putting your changes on separate branch. The last step after unshelleving the changes and committing, perhaps, is to push your branch with the unshelled changes into the central repo. This allows others to code review and approve pull request(i.e. bitbucket)

The general ‘hg push’ method will push all branches in your local repo that you may not want. You only need to push your new branch. For that purpose, use ‘hg nudge’ that is a push command but only for the branch you are on at the given moment. Before that works you need to update .hgrc – mercurial global configuration in home directory as follows:

...
[alias]
nudge = push --rev .

Afterwards, make sure you are in the proper branch and call method:

>hg nudge

This pushes just unshelved changes as separate branch into centra repo

Bug – Added Files Unable Shelve

It turns out that for newest current version of the ‘hgshelve’ extension there is bug which prevents to shelve any files that has been added to working directory and has status ‘A’.

The best solution i found is to use a different mercurial extension – hgattic that contains the shelve feature but does not have the bug. Make sure you grab it from the repo version that works with the newest version of mercurial https://bitbucket.org/sinbad/hgattic/

Everything applies the same way except the commands are named slight different. It also add some extra goodies about all of which you can learn from hgattic wiki

Useful Links: