-
Transforming Python Lists into Spark Dataframes
Data represented as dataframes are generally much easier to transform, filter, or write to a target source. In Spark, loading or querying data from a source will automatically be loaded as a dataframe.
Here’s an example of loading, querying, and writing data using PySpark and SQL:
import pyspark # Define your SparkContext and SparkSession sc = pyspark.context.SparkContext(master='host', appName='Sample App') session = pyspark.sql.session.SparkSession(sc) """ Load your data using: - spark.read.json('some/path/or/url') - spark.read.parquet('some/path/or/url') - spark.read.csv('some/path/or/url') - spark.read.text('some/path/or/url'), etc. """ data = spark.read.json('some/path/or/url') data.createOrReplaceTempView("table") # Apply some SQL query to the data, which results in a DataFrame df = session.sql(""" select col1, col2, sum(col3) from table where col4 = 'some_val' group by col1, col2 """) # Write the query results to a target in your desired format (say, JSON) df.write.json('target/path/')
The example above works conveniently if you can easily load your data as a dataframe using PySpark’s built-in functions. But sometimes you’re in a situation where your processed data ends up as a list of Python dictionaries, say when you weren’t required to use
spark.read
and/orsession.sql
. How can you load your data as a Spark DataFrame in order to take advantage of its capabilities? -
How the loss and cost functions got their forms in logistic regression
In logistic regression, the loss function and the cost function J take the forms
where is the sigmoid of the linear superposition of the features represented in matrix form, with as the vector of the feature weights, and as the intercept term. The sigmoid of some function is defined as
To understand why and take such forms, first note that is the probability of the binary classification variable to be equal to a positive example ().
-
Using Breaks To Get More Deep Work Done
It’s a great thing to constantly have goals that require prolonged periods of deep concentration. This is something I always look forward to. Deep work gives us a sense of great accomplishment when we’re finished, as well as having expanded our expertise on the domains we’ve tackled during the process.
But of course, many of us don’t buy into the “delayed gratification” thing perhaps due to biological and historical reasonsSadly, many of us never liked school in the past, and would be happy to never go back to school again. . Nature, by default, always follows the path of lowest energy and/or least resistance.
-
How Are Bubbles Formed?
I’ve always thought of a bubble as a compounded result of residual greed when the optimism of the many are perpetually validated.
This reminds me of what Warren Buffett tells us to do when we see compelling evidence of an impending bubble:
“Be fearful when others are greedy, and greedy when others are fearful.” – Warren Buffett
But only a few have the discipline to do this, because everyone can easily forget about the principle when they’re constantly seduced by social proofs of getting rich by everyone they know, everywhere they look.
-
On Change vs. Being Who You Are
One of the things I’ve more understood lately was the concept of adapting how I behave to the context I was in. I’m aware that I’ve been doing this, but I was also constantly questioning it.
I’ve been wondering if I might be betraying myself, or perhaps this is psychologically unhealthy in the long term, to act against my familiar behavioural inclinations.
-
Machine Learning: Entropy and Classification
A Simple Classification Example
Let’s say we have a dataset with categorical features , , , and a binary target variable :
id Feature Feature Feature Target Variable 1 a c e 2 b d e 3 b d f 4 a d e 5 a c f 6 b d f The goal is to find the feature that best predicts the value of .
-
MySQL: Columns as Ordered Week Dates
Let’s say you have data containing some metrics and their values across an ordered set of dates in a week. Since most screens are longer horizontally than vertically, it’s sometimes better to present data where one metric lies in a row and the dates lie in columns, rather than the usual way around.
The usual way we show tables is like this:
date Visitors Orders Revenue Metric4 etc. 2016-02-28 1423 19 900 … … 2016-02-29 1534 38 2037 … … 2016-03-01 2645 57 5612 … … … … … … … … Because most screens are in landscape mode and because we read from left to right, there are times when it makes sense to pivot the table as follows:
metric 2016-02-28 2016-02-29 2016-03-01 … Visitors 1423 1534 2645 … Orders 19 38 57 … Revenue 900 2037 5612 … Metric4 … … … … Metric5 … … … … etc. … … … … This may not be “tidy data” as defined by Hadley Wickham in his excellent paper, but pivoting as such results in easier navigation/scrolling when you have more metrics than dates.
-
Deriving the Normal Equation
Consider a linear model
where
is a matrix of real numbers with as the number of samples (or rows), and is the number of features (or columns),
is a matrix (also called a vector) of coefficients , and
is a matrix of target variables per ith sample.
-
A Beginner's Guide on Using Data to Assess Business Performance
Running an online business that’s growing slower than projected is never an ideal scenario. What can tremendously help diagnose the problem is to have data and know how to gain insights from it. It is only through the collection and analysis of data where you can free yourself from guesswork, start validating assumptions, and gain insights on how you should be operating your business.