Trung H

Today two guys asked me about how to pass an interview for a Software Engineering position (as I was offered SE jobs from Palantir and Google, and failed at tons of others). My answer was simple: think hard of a good solution and speak out loud what you are thinking, try to code without bugs, know how to test it and know what you are talking about, especially different data structures.

What I mean by “know” is that you should be able to explain the concept in a simple way and to answer most of the questions related to the concept.

To give an example, in an interview question, you can be asked: “Given a stream of sales: (item_id, number_of_items), how would you recommend a top N best-selling items”. Then be prepared to expect the conversation to be like:

Possible Question: how do you solve the problem?
Possible Answer: I use a max heap to remain the order of items with most sales. For any query, I return the top N items out of the heap. It takes O(NlgK) where K is the number of items in the heap.
Possible Question: How do you update the heap when a new tuple is received?
Possible Answer: For each tuple received from the stream, I update the counter of item_id in the heap, it takes O(lgK).
Possible Question: How do you know which node in the heap to update?
Possible Answer: In order to know which node to update in the heap, I use a hashmap item_id->node_id to keep track of where the item is located in the heap.
Possible Question: How does a hashmap work?
Possbile Answer: It keeps a list of linked list of (key, value). For lookup/modification, it uses the hash function of the key to locate the linked link, then iterates through the linked list to search for the key or add the new item to the linked list.
Possible Question: Implement a hashmap?
Possible Answer:
class HashMap(object):
def put(k, v):
….
def get(k):
….
Possible Question: How to remain O(1) for get/put?
Possible Answer: Keep the hash function of the key to be uniformly distributed.
Possible Question: What happens when the number of items is larger than the size of the list.
Possible Answer: I double the size of the list. Then, redistribute the items to the new list based on the hash(key) % list.size
Possible Question: How do you design the hash function?
Possible Answer: Do s.t like this:

f(x)=(((31+x.t1)∗31+x.t2)∗31+x.t3)… as suggested in Effective Java by Joshua Bloch.
Possible Question: Why do you choose 31?
Possible Answer: Because 31 is prime
Possible Question: Why do you choose a prime number?
Possible Answer: Assuming

k=ab⇒f(x)=(((ab+t1)∗ab+t2)ab+t3…),

⇒f(x)=(((abab+t1∗ab+t2)ab+t3…)=(((ababab+t1abab+t2ab+t3)…). Then,

(t1=a,t3=abab) collides with

(t1=1,t3=aabab). Hence, the less factors

k has, the less collisions can happen.
Finally, interviewer will possibly say “cool”. Other possible questions include “how does the heap work?”, “implement a heap”, “what if I want top N-items in only last week”, “how to get top N from the heap in O(NlgN)”, “how does arraylist work?”, “implement LRU”,…

Make sure you know when interviewing for a grad level role:

Data structures: array, vector, trees, queue, stack, heap, hashmap, treemap.
Algorithms: sorts, recursive programming, dynamic programming, string matching, basic Graph algorithms (BFS, DFS, Dijkstra).
Concurrency: forking, deadlock, mutex, semaphore, Map/Combiner/Reduce.
Memory allocation and garbage collection.
Sql, database transaction and different isolation levels.
Bit operations: |, &, ^, <<, >>.
How disk works (seek, read, write).
Maintainability: different design patterns, git/svn, service-oriented architecture.
Scalability: load balancers, cache, proxy, index, shards, queue, replication.
Availability: share-nothing architecture (so SPOF), eventual consistency, RAIDs.
Consistency: vector clock, CAP theorem, FLP theorem, different levels of consistency, 2 phase, 3 phase commit, names of some concensus algorithms, consistent hashing, gossip protocols.
Basic concrete maths and probability.
Zen of Python (my bonus ;)).

Places to practice: Leetcode, Careercup, Glassdoor. Topcoder and Codeforces for more advanced algorithms.

Apart from technical performance, make sure you are polite, humble, and clear. Despite all of these effort, you still may fail for some unknown reasons. Not all interviewers are looking for the same thing or there will be other people who can do better than you. Just try your best and move on.

In this Saturday morning, I decide to have a look through an ensemble method that has recently driven many successes on Kaggle competitions called “Gradient Boosting Machine”. I then try to reimplement it in python so that I can understand it better in practice.

As its name indicates, GBM trains many models turn by turn and each new model gradually minimises the loss function of the whole system using Gradient Descent method. Assuming each individual model

i is a function

h(X;pi)(which we call “base function” or “base learner”) where

X is the input and

pi is the model parameter. Now let’s choose a loss function

L(y,y′) where

y is the training output,

y′ is the output from the model. In GBM,

y′=∑Mi=1βih(X;pi) where

M is the number of base learners. What we need to do now is to find:

β, P = arg min {β i, p i} M 1 L (y, \sum i = 1 M β i h (X; p i))

However this is not easy to achieve optimal parameters. Instead we can try a greedy approach that reduces the loss function stage-by-stage:

β m, p m = arg min β, p L (y, F m - 1 (X) + β h (X; p))

And then we update:

F m = F m - 1 (X) + β m h (X; p m)

In order to reduce

L(y,Fm), an obvious way is to step toward the direction where the gradient of

L descents:

- g m (X) = - [\partial L ( y , F ( X ) ) \partial F ( X )] F (X) = F m - 1 (X)

However what we want to find out is

βm and

pm so that

Fm∼Fm−1–gm. In another way,

βmh(X;pm) is most similar to

−gm:

β m, p m = arg min β, p \sum i = 1 N [- g m (x i) - β h (x i; p)] 2

We can then fine-tune

βm so that:

β m = arg min β' L (y, F m - 1 (X + β' h (x; p)))

I wrote a small python script to demonstrate a simple GBM trainer to learn

y=xsin(x) over the base function

y=xexp(x) using the loss function

L(y,y′)=∑i(yi–y′i)2:

from numpy import *

from matplotlib import *

from scipy.optimize import minimize

def gbm(L, dL, p0, M=10):

    """ Train an ensemble of M base leaners.
    @param L: callable loss function
    @param dL: callable loss derivative
    @param p0: initial guess of parameters
    @param M: number of base learners

    """

    p = minimize(lambda p: square(-dL(0) - p[0]*h(p[1:])).sum(), p0).x

    F = p[0]*h(p[1:])

    Fs, P, losses = [array(F)], [p], [L(F)]

    for i in xrange(M):

        p = minimize(lambda p: square(-dL(F) - p[0]*h(p[1:])).sum(), p0).x

        p[0] = minimize(lambda a: L(F + a*h(p[1:])), p[0]).x

        F += p[0]*h(p[1:])

        Fs.append(array(F))

        P.append(p)

        losses.append(L(F))

    return F, Fs, P, losses

X = arange(1, 10, .1)

Y = X*sin(X)

plot(X, Y)

h = lambda a: a[0]*exp(a[1]*X)

L = lambda F: square(Y - F).sum()

dL = lambda F: F - Y

a0 = asarray([1, 1, 1])

# Build an ensemble of 100 base leaners

F, Fs, P, losses = gbm(L, dL, a0, M=100)

f = figure(figsize=(10, 10))

_ = plot(X, Y)

_ = plot(X, zip(*Fs))

#plot losses after each base leaner is added

plot(losses)

At the end, I would like to recommend Scikit learn to you – a great Python library to work with GBM.

Bibliography
Friedman H J. Greedy Function Approximation: A Gradient Boosting Machine. IMS 1999 Reitz Lecture. URL:http://www-stat.stanford.edu/~jhf/ftp/trebst.pdf.

Trung H

How to pass a technical interview for a fresh Software Engineering role?

An introduction to Gradient Boosting Machine

About Me