Dilip Raj Baral

Quick SVN guide for Git users; SVN: The Git Way

2018-01-13T00:00:00+00:00

Why would a Git user want to switch to SVN, you ask?

Well, sometimes you just don’t have a choice. Imagine working on a project that has been maintained in SVN for a decade. “But migrating an SVN codebase to Git is not a big deal at all.” But there are things like CI/CD integrations to worry about too. That isn’t a really big deal either but sometimes people take “Don’t fix what ain’t broke.” a little too seriously.

Reasons aside, having good Version Control System (Distributed VCS for that matter) concepts, I didn’t want to go through SVN guides from scratch to start with. While there were plenty of resources on the web regarding SVN to Git migration, I couldn’t find a quick and concise guide that would help me work with an SVN repo right away. If you are like me, you will find this article helpful. The following steps show you how you can work with SVN the Git way.

Cloning a new repo

Checking out a repo is similar to how we do it in Git.

svn checkout

Example

The following checks out your code to your current working directory.

svn checkout https://mysvnrepo.com/myrepo/trunk .

Creating a new topic branch

In SVN, branches (and tags) are nothing but simply a copy of one branch. A literal copy-paste of the files, unlike a pointer to commits in Git. This fact took me a while to digest and get used to.

The following commands are SVN equivalent to git checkout -b branch.

svn copy   -m "Message"

Example

svn copy --parents https://mysvnrepo.com/myrepo/trunk https://mysvnrepo.com/myrepo/branches/feature-branch
svn switch https://mysvnrepo.com/myrepo/branches/feature-branch

Working on the repo

Adding new files

To add new files, you would use:

svn add

As for modified files, we don’t need to add them. We can commit straight away.

svn commit -m "Commit message"

To commit only specific files, we need to list files after the commit message.

svn commit -m "Commit message"

If we want to commit a single file, we can do the following too.

svn commit  -m "Commit message"

Checking out new changes

The following is the SVN equivalent to git fetch && git merge or git pull.

svn update

Merging your feature branch to trunk

Merging a branch in SVN is similar to how we do it in Git.

svn merge

Example

svn update
svn switch https://mysvnrepo.com/myrepo/trunk
svn update
svn merge https://mysvnrepo.com/myrepo/branches/feature-branch
svn commit -m "Merge feature branch to trunk"

Deleting feature branch after merging

To delete a feature branch (or any branch for that matter), svn delete is used.

Example

svn delete https://mysvnrepo.com/myrepo/branches/feature-branch -m "Delete feature branch after merging"

Topic Modelling using LDA with MALLET

2017-06-04T00:00:00+00:00

Machine Learning for Language Toolkit, in short MALLET, is a tool written in Java for applications of machine learning such as natural language processing, document classification, clustering, topic modeling, and information extraction to texts. To learn what MALLET has to offer in detail, visit this page.

In this post, we see how we can create topic models from a large collection of unlabeled text documents and use the model to infer topics in new documents.

Topic models use different algorithms to extract topics from a corpus of texts. MALLET uses Gibbs sampling based implementations of Latent Dirichlet Allocation (LDA), Pachinko Allocation and Hierarchical LDA. Check this page if you want to know about topic modeling in detail.

Setting up MALLET

Go to the MALLET download page and download the latest version of MALLET. At the time of writing this post, the latest version is 2.0.8.

Installation on Windows

Ideally, unzip MALLET into your C:. Your path to MALLET will then be something similar to C:\mallet-2.0.8. This directory is referred to as the MALLET directory here onwards. Now you will be able to access MALLET from anywhere on the command prompt using C:\mallet-2.0.8\bin\mallet. To avoid typing the full path every time, we can set up an environment variable. To do so, go to Start Menu > Control Panel > System > Advanced System Settings > Environment Variables. Under User variables section, select PATH and click Edit. Go to the end of the text, type ; followed by C:\mallet-2.0.8\bin\ and click Save. Now you will be able to access MALLET with just the mallet command. To verify it is working, type the following on the command prompt.

> mallet --help

You should see a list of MALLET commands.

Note: Windows uses backslash (\\) as a directory separator while *nix systems use forward slash (/). Examples in this post were run on a *nix system (macOS). Hence, forward slash has been used as a directory separator. You should remember to change them to a backslash while running them on Windows Command Prompt.

Installation on *nix (Linux, FreeBSD, Mac OS X)

Unzip MALLET. Typically, you would unzip to paths like /usr/local/bin or /opt. For this post, I have unzipped to /usr/local/opt/mallet-2.0.8. This path is referred to as the MALLET directory here onwards. To avoid typing the full path every time, we can set up a path variable. To do so, open ~/.bashrc or ~/.bash_profile (for bash shell) depending upon your distribution and add the following line.

export PATH=$PATH:/usr/local/opt/mallet-2.0.8/bin

To put the changes into effect, type the following in your shell:

$ . ~/[.bashrc | .bash_profile]

You can now access MALLET from anywhere. To verify that it works type:

$ mallet --help

It should list all the MALLET commands.

Working with MALLET

Topic modeling with MALLET is all about three simple steps:

Import data (documents) into MALLET format
Train your model using the imported data
Use the trained model to infer the topic composition of a new document

In this tutorial, we will use the sample data that comes pre-packaged with MALLET. The sample data is found in sample-data directory inside MALLET directory. Before proceeding further, change your current directory to MALLET directory by typing:

$ cd [Your MALLET directory]

Note: tree command may not be available by default in your system and you might have to install it manually.

Importing Data

There are two methods of importing data into MALLET format.

Importing directories

You would import a directory if the source data consists of many separate files. In this case, each file is considered as one instance. The following command imports all files from a directory sample-data/web/en and converts to single MALLET file named train.mallet in your current directory.

$ mallet import-dir \
--input sample-data/web/en/ \
--output train.mallet \
--remove-stopwords TRUE \
--keep-sequence TRUE

Here, options except input and output are optional. You can also pass more than one directory; directory names should be separated by spaces.

remove-stopwords TRUE removes words such as a, an, the, if and so on. By default, MALLET’s default English dictionary of stop words is used. If you wish to supply your own list of stopwords, which you would customize for your application, you can do so by passing the file name to the stoplist-file option. The stoplist contains stop words separated by space, a tab character, or a line break.

The MALLET toolkit requires keep-sequence option set to TRUE for topic modeling.

To see more options type

$ mallet import-dir --help

In this tutorial, we are using this method.

Importing a file

You’d use this method if all of your data is in a single file, one instance per line, in the following format:

[instance_name] [label] [text without line breaks]

instance_name uniquely identifies each instance. For topic modeling, instance_name and label can be the same.

You’d type the following command.

$ mallet import-file \
--input [file_name] \
--output train.mallet \

All the options that apply to import-dir also apply to import-file.

Note: If you are importing extremely large file or file collections, you might get ‘Exception in thread “main” java.lang.OutOfMemoryError: Java heap space’ error. If you encounter this error, you have run into your memory limit which is 1 GB by default. To update the limit, open the file named mallet (or mallet.bat in case of Windows) in ‘bin’ directory inside mallet directory with a text editor, find the line ‘MEMORY=1g’ and update the value ‘1g’ to higher values like ‘2g’, ‘4g’ or higher depending on your system’s RAM.

Training the model

After you have imported documents into MALLET format, you need to build a topic model. The following command takes the file train.mallet which we created in the previous section, creates 5 topics (topics.txt) and calculates the topic proportion for each instance (topic-composition.txt).

$ mallet train-topics \
--input train.mallet \
--inferencer-filename inferencer.mallet \
--num-topics 5 \
--output-topic-keys topics.txt \
--output-doc-topics topic-composition.txt

If you open topics.txt, you will see 5 lines. In each line, the first number is the topic number, the second number is the indication of the weight of that topic and the words following them are the most frequently occurring words that fall into that topic.

topic-composition.txt file lists the composition of each instance or document under the topics listed in topic.txt. In each line, the first value is the instance number, the second value is an instance or document name and the numbers following are the weight of corresponding topics in topics.txt.

To see more options, type

$ mallet train-topics --help

Deciding the number of topics

There is no natural number of topics. To find a suitable number of topics, we have to run train-topics with a varying number of topics and see how the topic composition breaks down. If the majority of the words group to a very narrow number of topics, we need to increase the number of topics. On the other hand, if related words fall under different topics, the setting is too broad and we need to narrow it down by reducing the number of topics.

Inferring topic composition of new documents

To infer the topic composition of new documents, you first need to import the new documents into MALLET format similar to what we did in the first section.

$ mallet [import-dir | import-file] \
--input [directory_name | file_name] \
--output new.mallet \
--remove-stopwords TRUE \
--keep-sequence TRUE \
--use-pipe-from train.mallet

Notice the use-pipe-from option though. It is very important that you include this option at this stage. This option is used to make sure that the new data is compatible with our training data, i.e. new data and training data have the same alphabet mappings.

Finally, the following command infers the topic composition of the new documents and stores it in new-topic-composition.txt.

$ mallet infer-topics \
--input new.mallet \
--inferencer inferencer.mallet \
--output-doc-topics new-topic-composition.txt

To see more options type

$ mallet infer-topics --help

This will infer the topic composition of new documents and save it to new-topic-composition.txt.

Please leave your comments or any query you have in the comment section below. I will be happy to help.

Tendencies-based collaborative filtering algorithm

2016-12-06T00:00:00+00:00

As part of my academic research project titled Impact of Recommender System, I got to study various collaborative filtering algorithms. I was supposed to study, implement, and compare them. Tendencies-based was the best among them in terms of accuracy and computational efficiency. It was proposed by Fidel Cacheda and his team of researchers from University of A Coruna in their paper titled Comparison of Collaborative Filtering Algorithms: Limitations of Current Techniques and Proposals for Scalable, High-Performance Recommender Systems. It was as accurate as other collaborative filtering algorithms like item-based, similarity fusion, and others, if not more accurate than them. It was the most computationally efficient.

Algorithm

Tendencies-based algorithm, instead of looking for relations between users or items, looks at the differences between them.

Often, users with similar opinions rate items in a different way: some users mostly give positive ratings and rate really bad items negative while others usually rate negative and give positive ratings to the best items only. This algorithm deals with these variations using the concept of user tendency and item tendency.

Notation

$r_{ui}$ denotes the rating given by user u to item i. $\hat{r}_{ui}$ denotes the prediction made by the algorithm for the rating of item i by user u. $\mu_u$ denotes user mean rating and $\mu_i$ denotes item mean rating. $I_u$ is the set of items rated by user u, and $U_i$ is the set of users who rated item i.

Tendency Calculation

Tendency of a user ($\tau_u$) tells if a user tends to rate items positively. It is defined as the average difference between their ratings and the item mean.

\[\tau_u = \frac{1}{|I_u|} \sum_{i \in I_u} (r_{ui} - \mu_i)\]

Tendency of an item ($\tau_i$) refers to whether users consider it an especially good or especially bad item.

\[\tau_i = \frac{1}{|U_i|} \sum_{u \in U_i} (r_{ui} - \mu_u)\]

Prediction Calculation

The algorithm defines four cases based on the signs of the user and item tendencies.

If both the user and the item have a positive tendency:

\[\hat{r}_{ui} = \max(\mu_u + \tau_i, \mu_i + \tau_u)\]

If both the user and the item have a negative tendency:

\[\hat{r}_{ui} = \min(\mu_u + \tau_i, \mu_i + \tau_u)\]

If the user has a negative tendency but the item has a positive tendency:

\[\hat{r}_{ui} = \mu_u + \mu_i + \beta(\tau_u + \tau_i)\]

If the user has a positive tendency but the item has a negative tendency:

\[\hat{r}_{ui} = \mu_u + \mu_i + \beta(\tau_u + \tau_i)\]

Here, $\beta$ is a parameter that controls the contribution of the item and user mean.

As observed, a simple formula is used in the four cases and the calculation is highly efficient: training time complexity is O(mn) and rating can be predicted in O(1) time.

Implementation code can be downloaded from my GitHub repository.

Text search using Stochastic Diffusion Search

2016-11-06T00:00:00+00:00

Stochastic Diffusion Search (SDS), a multi-agent population-based global search and optimization algorithm, is a distributed mode of computation utilizing interaction between simple agents. SDS shows off a strong mathematical framework. It is robust, has minimal convergence criteria and linear time complexity.

SDS has been applied to diverse problems such as text search, object recognition, feature tracking, mobile robot self-localization and site selection for wireless networks. As a part of my Optimization Techniques laboratory project, I implemented a text search using SDS.

Basic SDS Algorithm

SDS algorithm has many variations. The following is the basic version.

1. For all agents do
2. INITIALIZE: Agent picks a random hypothesis 
3. TEST: Agent partially evaluates her hypothesis 
- If test criterion = TRUE, agent = Active (satisfied) 
- Else agent = Inactive (dissatisfied) 
4. DIFFUSE
- Inactive agent meets a randomly chosen agent 
- Inactive agent updates/changes hypothesis 
5. REPEAT until Halting criterion.

SDS Text Search Algorithm

INITIALIZATION PHASE
Each agent selects a haystack offset (hypothesis) at random.
WHILE (NOT all agents are active)
TESTING PHASE
Each agent randomly selects an offset less than or equal to the length of needle (needle offset) and matches the character in haystack at (haystack offset + needle offset) with the character in needle at needle offset
IF the letters match
Agent is active.
ELSE
Agent is inactive.
DIFFUSION PHASE
Each inactive agent selects another agent at random.
IF the selected agent is active
The inactive agent adopts the hypothesis (index) of that agent.
ELSE
The inactive agent selects a new index (hypothesis) at random.
END WHILE
Each agent's haystack offset is the starting index of needle.

Implementation

The following link links to the Java implementation of above algorithm on GitHub.

https://github.com/rajbdilip/stochastic-diffusion-search

Observation

Stochastic Diffusion Search text search is linear and fast. Agents initially search, finish communicating with each other, and hence finish searching in very few iterations, i.e., around 100 iterations with 5 agents in the haystack of length around 500. For smaller increments in the number of agents, the number of iterations required to converge to the solution decreases. But it cannot be ensured that big increments will significantly reduce the number of iterations. The number of iterations can rather increase.

However, if the needle is present in more than one offset, it cannot be guaranteed that this algorithm will find all of the occurrences regardless of the number of agents used. Since SDS is random, a different offset is returned every time.

References

al-Rifaie, Mohammad Majid, and John Mark Bishop. “Stochastic diffusion search review.” Paladyn, Journal of Behavioral Robotics 4.3 (2013): 155-173.

Social Media Integration into CiviCRM - CiviSocial

2016-10-01T00:00:00+00:00

So my proposal for Google Summer of Code 2016 was accepted and I was one of the 1206 Google student developers all around the globe. I got a chance to work on CiviCRM alongside its developers from all around the world. CiviCRM is a web-based, open source constituency relationship management specifically designed for the needs of non-profit, non-governmental, and advocacy groups, and serves as an association management system. Volunteers, activists, voters as well as more general sorts of business contacts such as employees, clients, or vendors can be managed using CiviCRM.

My project was titled “Social Media Integration” and the project aimed to boost the exposure of CiviCRM as a platform and make it even easier for people to connect. Specifically, I had to develop an extension to CiviCRM that would allow users to more easily fill forms and sign petitions using social login. It would also allow event registrations in CiviCRM to be reflected in RSVPs for parallel Facebook events. Moreover, it would allow CiviCRM admins to integrate multiple social networks and pull any relevant users activity data.

The coding began on May 22, 2016 and went through August 23, 2016. By the end of the program, most of the project goals were met with a few pending updates. Exact features of the extension can be found here. The extension is hosted on GitHub. The installation and configuration instructions can be found here.

I will further work on the extension to add more features. Any code contributions or feature suggestions are welcome.

Facebook PHP SDK 4.0 - Re-asking declined permissions

2014-07-16T00:00:00+00:00

UPDATE:

Facebook PHP SDK now uses the getReRequestUrl() method of the FacebookRedirectLoginHelper class to generate a URL to re-request denied permissions from a user.

public string getReRequestUrl(string $redirectUrl, array $scope = [], string $separator = '&')

Read the documentation here.

So I was testing Facebook Login integration on the website www.treasherlocked.com that I have been developing for a while. Permissions like email and user_location were required by the web app. So, it was programmed to re-ask the denied permissions in the Login Dialog if a user denies any.

$facebook = new Facebook(APP_ID, APP_SECRET, REDIRECT_URI);
if ( $facebook->IsAuthenticated() ) {
// Verify if all of the scopes have been granted
if ( !$facebook->verifyScopes( unserialize(SCOPES) ) ) {
header( "Location: " . $facebook->getLoginURL( $facebook->denied_scopes) );
exit;
}
...
}

Note: $facebook is a custom class that I built and not a part of Facebook PHP SDK.

But it wasn’t showing the Login Dialog again when permissions were denied. Instead Facebook was redirecting to Redirect URI, creating a redirect loop. I even asked on Stack Overflow but got no answer. As I was approaching the deadline, I couldn’t afford waiting much and kept on looking for solutions. After hours of googling, I landed on a Facebook Login API doc page that actually addresses this issue. All that needs to be done is append a rerequest = true parameter to the login URL’s query string. But this feature was not yet implemented in Facebook PHP SDK 4.0. There was a proposal on GitHub for this feature though. So I took the liberty of forking the project and made a small change in the getLoginURL() method’s prototype and definition in the FacebookRedirectLoginHelper class. The getLoginURL() prototype then looked like

public function getLoginUrl($redirectUrl, $scope = array(), $rerequest = false, $version = null)

I sent a pull request to the project repo which was quickly merged. (Read this article if you want to know how to contribute to an open source project if you aren’t contributing already.)

So, if you need to re-ask declined permissions, all you have to do is pass true to the third parameter. The code on the callback script will look something like the following.

getLoginUrl( $redirect_uri, $declined_scopes, true );
exit;
}
...
?>