Writings

Transforming Python Lists into Spark Dataframes

Data represented as dataframes are generally much easier to transform, filter, or write to a target source. In Spark, loading or querying data from a source will automatically be loaded as a dataframe.

Here’s an example of loading, querying, and writing data using PySpark and SQL:

import pyspark

# Define your SparkContext and SparkSession
sc = pyspark.context.SparkContext(master='host', appName='Sample App')
session = pyspark.sql.session.SparkSession(sc)

"""
Load your data using:
  - spark.read.json('some/path/or/url') 
  - spark.read.parquet('some/path/or/url')
  - spark.read.csv('some/path/or/url')
  - spark.read.text('some/path/or/url'), etc.
"""
data = spark.read.json('some/path/or/url') 
data.createOrReplaceTempView("table")

# Apply some SQL query to the data, which results in a DataFrame
df = session.sql("""
  select col1, col2, sum(col3)
  from table
  where col4 = 'some_val'
  group by col1, col2
""")

# Write the query results to a target in your desired format (say, JSON)
df.write.json('target/path/')

The example above works conveniently if you can easily load your data as a dataframe using PySpark’s built-in functions. But sometimes you’re in a situation where your processed data ends up as a list of Python dictionaries, say when you weren’t required to use spark.read and/or session.sql. How can you load your data as a Spark DataFrame in order to take advantage of its capabilities?

How the loss and cost functions got their forms in logistic regression

In logistic regression, the loss function $\mathcal{L}(\hat{y}, y)$ and the cost function J take the forms

$\begin{align} \mathcal{L}(\hat{y}, y) &= - \left(y \log (\hat{y}) + (1 - y) \log (1 - \hat{y}) \right) \\ J &= \frac{1}{m} \sum_{i=1}^m \mathcal{L}\left(\hat{y}^{(i)}, y^{(i)}\right) \end{align}$

where $\hat{y} = \sigma(W^\text{T} X + b)$ is the sigmoid of the linear superposition of the features represented in matrix form, with $W$ as the vector of the feature weights, and $b$ as the intercept term. The sigmoid of some function $z$ is defined as

$\sigma(z) = \frac{1}{1 + e^{-z}}.$

To understand why $\mathcal{L}$ and $J$ take such forms, first note that $\hat{y}$ is the probability of the binary classification variable $y$ to be equal to a positive example ( $y = 1$ ).

Using Breaks To Get More Deep Work Done

It’s a great thing to constantly have goals that require prolonged periods of deep concentration. This is something I always look forward to. Deep work gives us a sense of great accomplishment when we’re finished, as well as having expanded our expertise on the domains we’ve tackled during the process.

But of course, many of us don’t buy into the “delayed gratification” thing perhaps due to biological and historical reasonsSadly, many of us never liked school in the past, and would be happy to never go back to school again. . Nature, by default, always follows the path of lowest energy and/or least resistance.

How Are Bubbles Formed?

I’ve always thought of a bubble as a compounded result of residual greed when the optimism of the many are perpetually validated.

This reminds me of what Warren Buffett tells us to do when we see compelling evidence of an impending bubble:

“Be fearful when others are greedy, and greedy when others are fearful.” – Warren Buffett

But only a few have the discipline to do this, because everyone can easily forget about the principle when they’re constantly seduced by social proofs of getting rich by everyone they know, everywhere they look.

On Change vs. Being Who You Are

One of the things I’ve more understood lately was the concept of adapting how I behave to the context I was in. I’m aware that I’ve been doing this, but I was also constantly questioning it.

I’ve been wondering if I might be betraying myself, or perhaps this is psychologically unhealthy in the long term, to act against my familiar behavioural inclinations.

Machine Learning: Entropy and Classification

A Simple Classification Example

Let’s say we have a dataset with categorical features $P$ , $Q$ , $R$ , and a binary target variable $Z$ :

⊕

id	Feature $P$	Feature $Q$	Feature $R$	Target Variable $Z$
1	a	c	e	$G$
2	b	d	e	$G$
3	b	d	f	$H$
4	a	d	e	$G$
5	a	c	f	$H$
6	b	d	f	$H$

The goal is to find the feature that best predicts the value of $Z$ .

MySQL: Columns as Ordered Week Dates

Let’s say you have data containing some metrics and their values across an ordered set of dates in a week. Since most screens are longer horizontally than vertically, it’s sometimes better to present data where one metric lies in a row and the dates lie in columns, rather than the usual way around.

The usual way we show tables is like this:

⊕

date	Visitors	Orders	Revenue	Metric4	etc.
2016-02-28	1423	19	900	…	…
2016-02-29	1534	38	2037	…	…
2016-03-01	2645	57	5612	…	…
…	…	…	…	…	…

Because most screens are in landscape mode and because we read from left to right, there are times when it makes sense to pivot the table as follows:

⊕

metric	2016-02-28	2016-02-29	2016-03-01	…
Visitors	1423	1534	2645	…
Orders	19	38	57	…
Revenue	900	2037	5612	…
Metric4	…	…	…	…
Metric5	…	…	…	…
etc.	…	…	…	…

This may not be “tidy data” as defined by Hadley Wickham in his excellent paper, but pivoting as such results in easier navigation/scrolling when you have more metrics than dates.

Deriving the Normal Equation

Consider a linear model

$X\vec{\theta} = \vec{y}$

where

$X = \left( \begin{array}{ccc} x_{1,1} & \dots & x_{1,n} \\ \vdots & \ddots & \vdots \\ x_{m,1} & \dots & x_{m,n} \end{array} \right)$

is a matrix of real numbers with $m$ as the number of samples (or rows), and $n$ is the number of features (or columns),

$\vec{\theta} = \left( \begin{array}{c} \theta_1 \\ \vdots \\ \theta_m \end{array} \right)$

is a matrix (also called a vector) of coefficients $\theta_i$ , and

$\vec{y} = \left( \begin{array}{c} y_1 \\ \vdots \\ y_m \end{array} \right)$

is a matrix of target variables $y_i$ per ith sample.

A Beginner's Guide on Using Data to Assess Business Performance

Running an online business that’s growing slower than projected is never an ideal scenario. What can tremendously help diagnose the problem is to have data and know how to gain insights from it. It is only through the collection and analysis of data where you can free yourself from guesswork, start validating assumptions, and gain insights on how you should be operating your business.