In Pursuit of Hubris

TIL: Git Blame with Following

Sat, 13 Apr 2024 12:46:00 -0700

Developers usually use git blame in GUI tools like GitHub Blame

or using GitLens blame in VSCode:

Even though GUI tools is intuitive, but the Git CLI has much more powerful tooling for finding something closer to the real story behind your code.

There are many scenarios that CLI is valuable, the first is ignoring the whitespace changes.

For example, if you formatted your C++ codebase with clang-format or Javascript codebase with prettier, you haven’t actually changed the codebase, but you’re the owner of tons of lines of code.

The git blame -w option will ignore these type of whitespace changes.

The other great option is -C which will look for code movement between files in a commit.

For example, if you refactor a function from one file to another, the normal git blame will simply show you as the author in the new file, but the -C option will follow that movement and show the last person who actually change those lines of code.

-C is extremely helpful when I need to find out the original author of some lines of code after file renames or refactors, to know more about the background and context behind this code

According to the git blame doc, you could pass -C up to three times to ask Git try even harder:

-C[]
           In addition to -M, detect lines moved or copied from other files that were modified in the same commit.
           This is useful when you reorganize your program and move code around across files.
           When this option is given twice, the command additionally looks for copies from other files in the commit that creates the file.
           When this option is given three times, the command additionally looks for copies from other files in any commit.

(it’s a bit of odd design)

Let’s take the access.rb file of ActiveModel module in Rails framework for example:

git blame activemodel/lib/active_model/access.rb

Figure 1: Vanilla git blame

Ok, it looks like Jonathan Hefner wrote all of this code it appears, let’s look at the same code with git blame -w -C -C -C activemodel/lib/active_model/access.rb

Figure 2: git blame -w -C -C -C

Now we can see that Git has followed this code from file to file over the course of multiple renames, it turns out Jonathan Hefner is the most recent file renamer, Guillermo Iguaran is the original author.

If we want to know the history about this file, it’s much better to ask Guillermo rather than Jonathan, which is beyond what the GUI blame or normal Git blame tool reveals

TIL: Git Conditional Configs

Sun, 07 Apr 2024 12:38:00 -0700

Every Git user will have probably been asked to set up their Git at the first time:

git config --global user.name "Ramsay Leung"
git config --global user.email ramsayleung@gmail.com

The above command will simply add the user.name and user.email value into your ~/.gitconfig file

> cat ~/.gitconfig
[user]
    name = Ramsay Leung
    email = ramsayleung@gmail.com
[core]
    quotepath = false
[init]
    defaultBranch = master

You could also specify --local argument to writes the config values to .git/config in whatever project you’re currently in.

If you need to simultaneously contribute to your work and open source project on the same laptop, with different Git config values, e.g.(company email address for work-specific projects, personal email address for open source project), what should you do?

You could definitely set up work-specific config as global config, then set up personal config with --local for every personal project separately. It works, but tedious and easy to mess-up.

Fortunately, starting from Git version 2.13, Git supports conditional configuration includes, you are capable of setting up different configs for different repositories.

If you add the following config to your global config file:

[includeIf "gitdir:~/projects/oss/"]
    path = ~/.gitconfig-oss

[includeIf "gitdir:~/projects/work/"]
    path = ~/.gitconfig-work

Then Git will look in the ~/.gitconfig-oss files for values only if the project you are currently working on matches ~/projects/oss/.

Caution: If you forget to specify the “/” at the end of the git dir, e.g. “~/projects/oss”, Conditional Config won’t work!

Therefore, you could have a “work” directory and work-specific config here and an “oss” directory with values for your open source projects, etc.

Git also supports other filters more than gitdir, you could specify a branch name as an include filter with onbranch

  ; include only if we are in a worktree where foo-branch is
; currently checked out
[includeIf "onbranch:foo-branch"]
        path = foo.inc

Check out the Git docs for more details

Rewind your Github summary

Mon, 01 Jan 2024 16:16:00 -0800

1 Goodbye 2023

As I farewelled to 2023, a year marked by numerous changes and personal evolution, I find myself recollecting the multitude of experiences that unfolded.

My 2023 journey was nothing short of fascinating and exciting, prompting me to revisit the year from various angles.

After seeing hoards of posts in social media generated by Github Contributions Chart, I thought I could also build an APP to summarize my Github contribution for every year for friends to have fun.

I spent my entire 4-days-new-year vocation to build this app named: Github Summary.

This project led me through a series of first-time experiences: first time to try Tailwind Css framework, first time to use and deploy project on Vercel, first time to build project on nextjs, first time to develop a public project on React(yes, I’ve tried to learn React for hundreds of times, but never get a chance to use it in real project), etc.

2 Happy 2024

While I hoped I could have completed this project by the close of 2023 to share summaries with friends, life’s timeline had other plans.

Now, as we step into 2024, I am thrilled to publish the GitHub Summary.

It’s never too late to showcase creative work, and this project is poised to generate insightful summaries not just for the past year but for the adventures that await in 2024.

Wishing everyone a Happy New Year! Feel free to explore GitHub Summary: https://github-summary.vercel.app/

How to share resource between CDK stacks

Wed, 28 Jun 2023 09:41:00 -0700

1 Introduction

1.1 IaC

Infrastructure as code(IaC) is the managing and provisioning of infrastructure through code instead of manual processes, for example, clicking button, adding or editing roles in AWS console.

1.2 AWS CloudFormation

AWS CloudFormation is the original IaC tool for AWS, released in 2011, which uses template files to automate and mange the setup of AWS resources.

1.3 AWS CDK

AWS Cloud Development Kit(CDK) is a product provided by AWS that makes it easier for developers to manage their infrastructure with familiar programming languages like TypeScript, Python, Java, etc.

And, CDK is standing on the shoulder of Cloudformation, providing tools for developers by leveraging Cloudformation.

A stack is a collection of AWS resources that you can manage as a single unit, like a box.

For instance, this box could include all the resources required to run an application or Lambda service, such as S3 Buckets (storage), Roles (authorization), Lambda Function (computing), API Gateway (access point), Alarm, Monitoring, etc.

2 Problem

I am currently working on a project which requires to set up two stacks, one stack( GlueStack ) for defining a list of AWS Glue tables and the other stack( ServiceStack ) for definition of Lambda service and associated resources.

In fact, S3 bucket names have to be globally unique within a partition, which means crossing the whole AWS customer base.

You are unable to create a S3 bucket with bucket name which is in use by another AWS customer or your own account.

So it’s safer to let CloudFormation generate a random bucket name for a developer when he need to initialize a S3 bucket.

However, there is new a problem I face: since the S3 bucket name is randomly generated characters, if GlueStack need to read the bucket created by ServiceStack, how could I share the bucket name between two stacks?

While these two stacks are isolated and separated, resources collection.

3 Solution

Fortunately, CDK offers a facility named CfnOutput to export a deployed resource, so that the consumer of the resource is able to Import required resource.

Define the required resource in ServiceStack (producer), for instance, a S3 bucket:

import { Bucket } from 'aws-cdk-lib/aws-s3';

const s3Bucket = new Bucket(this, 'MyBucketId', {});

Export the resource by specifying the value and exportName:

import { CfnOutput } from 'aws-cdk-lib';

// export the generated bucket name to other stack
new CfnOutput(this, 'exportRequiredS3Bucket', {
    value: s3Bucket.bucketName,
    exportName: 'exportRequiredS3Bucket',
});

Import the required resource in GlueStack (consumer):

import { Fn} from 'aws-cdk-lib';

const requiredS3BucketName = Fn.importValue('exportRequiredS3Bucket');

If we take a closer look at the synthesized CFN template for ServiceStack, we could find:

"Outputs": {
    "exportRequiredS3Bucket": {
        "Value": {
            "Ref": "MyBucketId737FC949"
        },
        "Export": {
            "Name": "exportRequiredS3Bucket"
        }
    }

The synthesized CFN template for GlueStack:

{
    "Fn::ImportValue": "exportRequiredS3Bucket"
}

This is the way about how to share value between two stacks.

4 Loose couping solution

Updated on 2023-12-02

People learn from mistake.

After applying this practice in my project, I recently learn that it’s not good practice to share resource across stack.

With using export/import, I tightly couple my stacks with a commitment that I can never update that unless I remove that couping later on.

It means it will become a disaster¹ whenever I need to update/delete the S3Bucket, CloudFormation will raise an error, complaining something like: “ServiceStack cannot be deleted as it’s in use by GlueStack”.

A better practice I learnt is adding a loose couping between ServiceStack and GlueStack by sharing a constant variable:

Define a constant variable somewhere:

export const Constants = {
    MyBucketName: 'TestBucket'
}

Refine the definition of s3Bucket

import { Bucket } from 'aws-cdk-lib/aws-s3';

const s3Bucket = new Bucket(this, 'MyBucketId', {
    bucketName: Constants.MyBucketName,
});

Refer the s3Bucket in GlueStack by MyBucketName instead of CDK exported reference
```
const requiredS3BucketName = Constants.MyBucketName;
```

Therefore, these two stacks are not directly coupled, but they are referencing the same constant variable.

Then, CloudFormation won’t prevent you from updating the S3Bucket as there is not direct relation between these two stacks anymore.

This is the benefit of loose couping.

5 Reference

https://stackoverflow.com/questions/63350346/delete-resource-with-references ↩︎

Topological Sort

Sun, 22 May 2022 10:34:00 +0800

1 Definition

In computer science, a topological sort or topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge uv from vertex u to vertx v, u comes before v in the ordering.

It sounds pretty academic, but I am sure you are using topological sort unconsciously every single day.

2 Application

Many real world situations can be modeled as a graph with directed edges where some events must occur before others. Then a topological sort gives an order in which to perform these events, for instance:

2.1 College class prerequisites

You must take course b first if you want to take course a. For example, in your alma mater, the student must complete PHYS:1511(College Physics) or PHYS:1611(Introductory Physics I) before taking College Physics II.

The courses can be represented by vertices, and there is an edge from College Physics to College Physics II since PHYS:1511 must be finished before College Physics II can be enrolled.

2.2 Job scheduling

scheduling a sequence of jobs or tasks based on their dependencies. The jobs are represented by vertices, and there is an edge from x to y if job x must be completed before job y can be started.

In the context of a CI/CD pipeline, the relationships between jobs can be represented by directed graph(specifically speaking, by directed acyclic graph). For example, in a CI pipeline, build job should be finished before start test job and lint job.

2.3 Program build dependencies

You want to figure out in which order you should compile all the program’s dependencies so that you will never try and compile a dependency for which you haven’t first built all of its dependencies.

A typical example is GNU Make: you specific your targets in a makefile, Make will parse makefile, and figure out which target should be built firstly. Supposing you have a makefile like this:

# Makefile for analysis report

output/figure_1.png: data/input_file_1.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i data/input_file_1.csv -o output/figure_1.png

output/figure_2.png: data/input_file_2.csv scripts/generate_histogram.py
python scripts/generate_histogram.py -i data/input_file_2.csv -o output/figure_2.png

output/report.pdf: report/report.tex output/figure_1.png output/figure_2.png
cd report/ && pdflatex report.tex && mv report.pdf ../output/report.pdf

Make will generate a DAG internally to figure out which target should be executed firstly with typological sort:

3 Directed Acyclic Graph

Back to the definition, we say that a topological ordering of a directed graph is a linear ordering of its vertices, but not all directed graphs have a topological ordering.

A topological ordering is possible if and only if the graph has no directed cycles, that is, if it’s a directed acyclic graph(DAG).

Let us see some examples:

The definition requires that only the directed acyclic graph has a topological ordering, but why? What happens if we are trying to find a topological ordering of a directed graph? Let’s take the figure 3 for an example.

The directed graph problem has no solution, this is the reason why directed cycle is forbidden

4 Kahn’s Algorithm

There are several algorithms for topological sorting, Kahn’s algorithm is one of them, based on breadth first search.

The intuition behind Kahn’s algorithm is pretty straightforward:

To repeatedly remove nodes without any dependencies from the graph and add them to the topological ordering

As nodes without dependencies are removed from the graph, the original nodes depend on the removed node should be free now.

We keep removing nodes without dependencies from the graph until all nodes are processed, or a cycle is detected.

The dependencies of one node are represented as in-degree of this node.

Let’s take a quick example of how to find out a topological ordering of a given graph with Kahn’s algorithm.

Now we should understand how Kahn’s algorithm works. Let’s have a look at a C++ implementation of Kahn’s algorithm:

#include 
#include 
// Kahn's algorithm
// `adj` is a directed acyclic graph represented as an adjacency list.
std::vector<int>
findTopologicalOrder(const std::vector<std::vector<int>> &adj) {
  int n = adj.size();

  std::vector<int> in_degree(n, 0);

  for (int i = 0; i < n; i++) {
    for (const auto &to_vertex : adj[i]) {
      in_degree[to_vertex]++;
    }
  }

  // queue contains nodes with no incoming edges
  std::deque<int> queue;
  for (int i = 0; i < n; i++) {
    if (in_degree[i] == 0) {
      queue.push_back(i);
    }
  }

  std::vector<int> order(n, 0);

  int index = 0;
  while (queue.size() > 0) {
    int cur = queue.front();
    queue.pop_front();
    order[index++] = cur;

    for (const auto &next : adj[cur]) {
      if (--in_degree[next] == 0) {
	queue.push_back(next);
      }
    }
  }

  // there is no cycle
  if (n == index) {
    return order;
  } else {
    // return an empty list if there is a cycle
    return std::vector<int>{};
  }
}

5 Bonus

When a pregnant woman takes calcium pills, she must make sure also that her diet is rich in vitamin D, since this vitamin makes the absorption of calcium possible.

After reading the demonstration of topological ordering, you (and I) too should take a certain vitamin, metaphorically speaking, to help you absorb. The vitamin D I pick for you (and myself) is two leetcode problems, which involve with the most typical use case of topological ordering – college class prerequisites:

6 Reference

About Me

Mon, 21 Feb 2022 00:00:00 +0000

About me

I’m Ramsay, a software engineer making a living by pressing keyboard, an amateur cook, an Emacs deadhead and Linux enthusiast.

The slogan of this site is In pursuit of Simplicity, because I prefer simple over complex

Projects

I’ve contributed multiple projects to open source community, I mainly focus on RSpotify right now due to the constrained energy.

RSpotify: A Spotify Web API wrapper implemented in Rust.

How To Design A Reliable Distributed Timer

Thu, 05 Aug 2021 09:19:36 +0000

1 Preface

I have been maintained a legacy distributed timer for months for my employer, then some important pay business are leveraging on it, with 1 billion tasks handled every day and 20k tasks added per second at most.

Even though it’s old and full of black magic code, but it also also have insighted and well-designed code. Based on this old, running timer, I summarize and extract as this article, and it wont include any running code(perhaps pseudocode, and a lot of figures, as an adage says: A picture is worth a thousand words).

if you are curious about the reason(I personally suggest to watch the TV series Silicon Valley, Richard has gave us a good example and answer)

2 Design

2.1 Algorithm

There are several algorithms in the world to implement timer, such as Red-Black Tree, Min-Heap and timer wheel. The most efficient and used algorithm is timer wheel algorithm, and it’s the algorithm we focus on.

As for timing wheel based timer, it can be modelled as two internal operations: per-tick bookkeeping and expiry processing.

Per-tick bookkeeping: happens on every ’tick’ of the timer clock. If the unit of granularity for setting timers is T units of time (e.g. 1 second), then per-tick bookkeeping will happen every T units of time. It checks whether any outstanding timers have expired, and if so it removes them and invokes expiry processing.
Expiry processing: is responsible for invoked the user-supplied callback (or other user requested action, depending on your model).

2.1.1 Simple Timing Wheels

The simple timing wheel keeps a large timing wheel, the below timing wheel has 8 slots, and each slot is holding the task which is going to be expired. Supposing every slot presentes one second(one tick as a second), then the current slot is slot 1, if we want to add a task needed to be triggered 2s later, then this task will be inserted into slot 3.

per-tick bookkeeping: O(1)

What happen if we want to add a task needed to be launched 20s later, the answer is we have no way to do so since there are only 8 slots. So if we have a large period of timer task, we have to maintain a large timing wheel with tons of slots, which requires exponential amount of memory.

2.1.2 Hashed Timing Wheel

Hashed Timing Wheel is an improved simple timing wheel. As we mentioned before, it will consume large resources if timer period is comparatively large. Instead of using one slot per time unit, we could use a form of hashing instead. Construct a circular buffer with a fixed number of slots(such as 8 slots). If current slot is 0, we want store 3s later task, we could insert into slot 3, then if we want bookkeep 9s-later task, we could insert into slot 1(9 % 8 = 1)

per-tick bookkeeping: O(1) - O(N)

It’s a tradeoff strategy, We trade space with time.

2.1.3 Hierarchical Timing Wheels

Since simple timing wheels and hashed timing wheel come with drawback of time efficiency or space efficiency. Back to 1987, after studying a number of different approaches for the efficient management of timers, Varghese and Lauck posted a paper to introduce Hierarchical Timing Wheels

Just make a long story short, I won’t dive deep into hierarchical timing wheels, you could easily understand it by a real life reference: the old water meter

the firse level wheel(seconds wheel) rotates one loop, triggering the second level(minutes wheel) ticks one slot, same for the third level(hour wheel). Therefore, we present a day(60*60*24 seconds) with 60+60+24 slots. If we want to present a month, we only need to a four level wheel(month wheel) with 30 slots.

per-tick bookkeeping: O(1)

2.2 Per-tick bookkeeping

After introducing timing wheel algorithm, let’s go back to the topic about designing a reliable distributed timer, it’s essential to decide how to store timer task. Taking implementation complexity and time, space trade off, we choose the Hashed Timing Wheel algorithm.

There are several internal components developed by my employer, one of them is named TableKV, a high-availability(99.999% ~ 99.9999%) NoSql service. TableKV supports 10m buckets(the terminology is table) at most, every table comes with full ACID properties of transactions support. You could simply replace TableKV with Redis as it provides the similar bucket functionality.

2.2.1 Insert task into slot

We are going to implement Hashed Timing Wheel algorithm with TableKV, supposing there are 10m buckets, and current time is 2021:08:05 11:17:33 +08=(the UNIX timestamp is =1628176653), there is a timer task which is going to be triggered 10s later with start_time = 1628176653 + 10 (or 100000010s later, start_time = 1628176653 + 10 + 100000000), these tasks both will be stored into bucket start_time % 100000000 = 28176663

2.2.2 Pull task out from slot

As clock tick-tacking to 2021:08:05 11:17:43 +08(1628176663), we need to pull tasks out from slot by calculating the bucket number: current_timestamp(1628176663) % 100000000 = 28176663. After locating the bucket number, we find all tasks in bucket 28176663 with start_time < current_timestamp=, then we get all expected expiry tasks.

2.3 Global clock and lock

As we mentioned before, when the clock tick-tacks to current_time, we fetch all expiry tasks. When our service is running on a distributed system, it’s universal that we will have multiple hosts(physical machines or dockers), with multiple current_times on its machine. There is no guarantee that all clocks of multiple hosts synchronized by the same Network Time Server, then all clocks might be subtly different. Which current_time is correct?

In order to get the correct time, it’s necessary to maintain a monotonic global clock(Of course, it’s not the only way to go, there are several ways to handle time and order). Since everything we care about clock is Unix timestamp, we could maintain a global system clock represented by Unix timestamp. All machines request the global clock every second to get the current time, fetching the expiry tasks later.

Well, are we done? Not yet, a new issue breaks into our design: if all machines can fetch the expiry tasks, these tasks will be processed more than one time, which will cause essential problems. We also need a mutex lock to guarantee only one machine can fetch the expiry task. You can implement both global clock and mutex lock by a magnificent strategy: an Optimistic lock

All machines fetch global timestamp(timestamp A) with version
All machines increase timestamp(timestamp B) and update version(optimistic locking), only one machine will success because of optimistic locking.
Then the machine acquired mutex is authorized to fetch expiry tasks with timestamp A, the other machines failed to acquire mutex is suspended to wait for 1 seconds.
Loop back to step 1 with timestamp B.

We could encapsulate the role who keep acquiring lock and fetch expiry data as an individual component named scheduler.

2.4 Expiry processing

Expiry processing is responsible for invoked the user-supplied callback or other user requested action. In distributed computing, it’s common to execute a procedure by RPC(Remote Procedure Call). In our case, A RPC request is executed when timer task is expiry, from timer service to callback service. Thus, the caller(user) needs to explicitly tell the timer, which service should I execute with what kind of parameters data while the timer task is triggered.

We could pack and serialize this meta information and parameters data into binary data, and send it to the timer. When pulling data out from slot, the timer could reconstruct Request/Response/Client type and set it with user-defined data, the next step is a piece of cake, just executing it without saying.

Perhaps there are many expiry tasks needed to triggered, in order to handle as many tasks as possible, you could create a thread pool, process pool, coroutine pool to execute RPC concurrently.

2.5 Decoupling

Supposing the callback service needs tons of operation, it takes a hundred of millisecond. Even though you have created a thread/process/coroutine pool to handle the timer task, it will inevitably hang, resulting in the decrease of throughout.

As for this heavyweight processing case, Message Queue is a great answer. Message queues can significantly simplify coding of decoupled services, while improving performance, reliability and scalability. It’s common to combine message queues with Pub/Sub messaging design pattern, timer could publish task data as message, and timer subscribes the same topic of message, using message queue as a buffer. Then in subscriber, the RPC client executes to request for callback service.

After introducing message queue, we could outline the state machine of timer task:

Thanks to message queue, we are able to buffer, to retry or to batch work, and to smooth spiky workloads

2.6 High availability guarantee

2.6.1 Missed expiry tasks

A missed expiry of tasks may occur because of the scheduler process being shutdown or being crashed, or because of other unknown problems. One important job is how to locate these missed tasks and re-execute them. Since we are using global `current_timestamp` to fetch expiry data, we could have another scheduler to use `delay_10min_timestamp` to fetch missed expiry data.

In order to look for a needle in a haystack, we need to set a range(delay_10min - current time), and then to batch find cross buckets. After finding these missed tasks, the timer publishes them as a message to message queue. For other open source distributed timer projects like Quartz, which provides an instruction to handle missed(misfire) tasks: Misfire instructions

If your NoSql component doesn’t support find-cross-buckets feature, you could also find every bucket in the range one by one.

2.6.2 Callback service error

Since the distributed systems are shared-nothing systems, they communicate via message passing through a network(asynchronously or synchronously), but the network is unreliable. When invoking the user-supplied callback, the RPC request might fail if the network is cut off for a while or the callback service is temporarily down.

Retries are a technique that helps us deal with transient errors, i.e. errors that are temporary and are likely to disappear soon. Retries help us achieve resiliency by allowing the system to send a request repeatedly until it gets an explicit response(success or fail). By leveraging message queue, you obtain the ability for retrying for free. In the meanwhile, the timer could handle the user-requested retries: It’s not the proper time to execute callback service, retry it later.

3 Conclusion

After a long way, we are finally here. The final full architecture would look like this:

The whole process:

Adding a timer task, with specified meta info and task info
Inserting task into bucket by hashed timing wheel algorithm(With task_state set to pending)
Fetch_current scheduler tries to acquire lock and get global current time
The Acquired lock scheduler fetches expiry tasks
Return the expected data.
& 7. Publishing task data as message to MQ with thread pool; And then set task_state to delivered
Message subscriber pulls message from MQ
Sending RPC request to callback service(set task_state to success or fail)
Retry(If necessary)

Wish you have fun and profit

4 Reference

Let's make everything iterable

Thu, 29 Apr 2021 11:48:00 +0800

Iterate through pagination in the Rest API

1 Preface

About 4 months ago, icewind1991 created an exciting PR that adding Stream/Iterator based versions of methods with paginated results, which makes enpoints in Rspotify more much ergonomic to use, and Mario completed this PR.

In order to know what this PR brought to us, we have to go back to the orignal story, the paginated results in Spotify’s Rest API.

2 Orignal Story

Taking the artist_albums as example, it gets Spotify catalog information about an artist’s albums.

The HTTP response body for this endpoint contains an array of simplified album object wrapped in a paging object and use limit field to control the number of album objects to return and offset field to set the index of the first album to return.

So designed endpoint in Rspotify looks like this:

/// Paging object
///
/// [Reference](https://developer.spotify.com/documentation/web-api/reference/#object-pagingobject)
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq)]
pub struct Page<T> {
    pub href: String,
    pub items: Vec<T>,
    pub limit: u32,
    pub next: Option<String>,
    pub offset: u32,
    pub previous: Option<String>,
    pub total: u32,
}

/// Get Spotify catalog information about an artist's albums.
///
/// Parameters:
/// - artist_id - the artist ID, URI or URL
/// - album_type - 'album', 'single', 'appears_on', 'compilation'
/// - market - limit the response to one particular country.
/// - limit  - the number of albums to return
/// - offset - the index of the first album to return
/// [Reference](https://developer.spotify.com/documentation/web-api/reference/#endpoint-get-an-artists-albums)
pub fn artist_albums<'a>(
    &'a self,
    artist_id: &'a ArtistId,
    album_type: Option<&'a AlbumType>,
    market: Option<&'a Market>,
) -> ClientResult<Page<SimplifiedAlbum>>;

Supposing that you fetched the first page of an artist’s ablums, then you would to get the data of the next page, you have to parse a URL:

{
    "next": "https://api.spotify.com/v1/browse/categories?offset=2&limit=20"
}

You have to parse the URL and extract limit and offset parameters, and recall the artist_albums endpoint with setting limit to 20 and offset to 2.

We have to manually fetch the data again and again until all datas have been consumed. It is not elegant, but works.

3 Iterator Story

Since we have the basic knowledge about the background, let’s jump to the iterator version of pagination endpoints.

First of all, the iterator pattern allows us to perform some tasks on a sequence of items in turn. An iterator is responsible for the logic of itreating over each item and determining when the sequence has finished.

If you want to know about about Iterator, Jon Gjengset has covered a brilliant tutorial to demonstrate Iterators in Rust.

All iterators implement a trait named Iterator that is defined in the standard library. The definition of the trait looks like this:

pub trait Iterator {
    type Item;

    fn next(&mut self) -> Option<Self::Item>;

    // methods with default implementations elided
}

By implementing the Iterator trait on our own types, we could have iterators that do anything we want. Then working mechanism we want to iterate over paginated result will look like this:

Now let’s dive deep into the code, we need to implement Iterator for our own types, the pseudocode looks like:

impl<T> Iterator for PageIterator<Request>
{
    type Item = ClientResult<Page<T>>;

    fn next(&mut self) -> Option<Self::Item> {
	match call endpoints with offset and limit {
	    Ok(page) if page.items.is_empty() => {
		we are done here
		None
	    }
	    Ok(page) => {
		offset += page.items.len() as u32;
		Some(Ok(page))
	    }
	    Err(e) => Some(Err(e)),
	}
    }
}

In order to iterate paginated result from different endpoints, we need a generic type to represent different endpoints. The Fn trait comes to our mind, the function pointer that points to code, not data.

Then the next version of pseudocode looks like:

impl<T, Request> Iterator for PageIterator<Request>
where
    Request: Fn(u32, u32) -> ClientResult<Page<T>>,
{
    type Item = ClientResult<Page<T>>;

    fn next(&mut self) -> Option<Self::Item> {

	match (function_pointer)(offset and limit) {
	    Ok(page) if page.items.is_empty() => {
		we are done here
		None
	    }
	    Ok(page) => {
		offset += page.items.len() as u32;
		Some(Ok(page))
	    }
	    Err(e) => Some(Err(e)),
	}
    }
}

Now, our iterator story has iterated to the end, the next item is that current full version code is here, check it if you are interested in :)

4 Stream Story

Are we done? Not yet. Let’s move our eyes to stream story.

The stream story is mostly similar with iterator story, except that iterator is synchronous, stream is asynchronous.

The Stream trait can yield multiple values before completing, similiar to the Iterator trait.

trait Stream {
    /// The type of the value yielded by the stream.
    type Item;

    /// Attempt to resolve the next item in the stream.
    /// Returns `Poll::Pending` if not ready, `Poll::Ready(Some(x))` if a value
    /// is ready, and `Poll::Ready(None)` if the stream has completed.
    fn poll_next(self: Pin<&mut Self>, cx: &mut Context<'_>)
	-> Poll<Option<Self::Item>>;
}

Since we have already known the iterator, let make the stream story short. We leverage the async-stream for using macro as Syntactic sugar to avoid clumsy type declaration and notation.

We use stream! macro to generate an anonymous type implementing the Stream trait, and the Item associated type is the type of the values yielded from the stream, which is ClientResult in this case.

The stream full version is shorter and clearer:

/// This is used to handle paginated requests automatically.
pub fn paginate<T, Fut, Request>(
    req: Request,
    page_size: u32,
) -> impl Stream<Item = ClientResult<T>>
where
    T: Unpin,
    Fut: Future<Output = ClientResult<Page<T>>>,
    Request: Fn(u32, u32) -> Fut,
{
    use async_stream::stream;
    let mut offset = 0;
    stream! {
	loop {
	    let page = req(page_size, offset).await?;
	    offset += page.items.len() as u32;
	    for item in page.items {
		yield Ok(item);
	    }
	    if page.next.is_none() {
		break;
	    }
	}
    }
}

5 Appendix

Whew! It took more than I expected. Since iterators is the Rust features inspired by functional programming language ideas, which contributes to Rust’s capability to clearly express high-level ideas at low-level performance.

It’s good to leverage iterators wherever possible, now we can be thrilled to say that all endpoints don’t need to manuallly loop over anymore, they are all iterable and rusty.

Thanks Mario and icewind1991 again for their works :)

Serde Tricks

Sun, 13 Dec 2020 22:29:00 +0800

The lesson learned from refactoring rspotify

1 Preface

Recently, I and Mario are working on refactoring rspotify, trying to improve performance, documentation, error-handling, data model and reduce compile time, to make it easier to use. (For those who has never heard about rspotify, it is a Spotify HTTP SDK implemented in Rust).

I am partly focusing on polishing the data model, based on the issue created by Koxiaet.

Since rspotify is API client for Spotify, it has to handle the request and response from Spotify HTTP API.

Generally speaking, the data model is something about how to structure the response data, and used Serde to parse JSON response from HTTP API to Rust struct, and I have learnt a lot Serde tricks from refactoring.

2 Serde Lesson

2.1 Deserialize JSON map to Vec based on its value.

An actions object which contains a disallows object, allows to update the user interface based on which playback actions are available within the current context.

The response JSON data from HTTP API:

{
    ...
	"disallows": {
	    "resuming": true
	}
    ...
}

The original model representing actions was:

#[derive(Clone, Debug, Serialize, PartialEq, Eq)]
pub struct Actions {
    pub disallows: HashMap<DisallowKey, bool>
}

#[derive(Clone, Serialize, Deserialize, Copy, PartialEq, Eq, Debug, Hash, ToString)]
#[serde(rename_all = "snake_case")]
#[strum(serialize_all = "snake_case")]
pub enum DisallowKey {
    InterruptingPlayback,
    Pausing,
    Resuming,
    ...
}

And Koxiaet gave great advice about how to polish Actions:

Actions::disallows can be replaced with a Vec or HashSet by removing all entires whose value is false, which will result in a simpler API.

To be honest, I was not that familiar with Serde before, after digging in its official documentation for a while, it seems there is now a built-in way to convert JSON map to Vec base on map’s value.

After reading the Custom serialization from documentation, there was a simple solution came to my mind, so I wrote my first customized deserialize function.

I created a dumb Actions struct inside the deserialize function, and converted HashMap to Vec by filtering its value.

#[derive(Clone, Debug, Serialize, PartialEq, Eq)]
pub struct Actions {
    pub disallows: Vec<DisallowKey>,
}

impl<'de> Deserialize<'de> for Actions {
    fn deserialize<D>(deserializer: D) -> Result<Self, D::Error>
    where
	D: Deserializer<'de>,
    {
	#[derive(Deserialize)]
	struct OriginalActions {
	    pub disallows: HashMap<DisallowKey, bool>,
	}

	let orignal_actions = OriginalActions::deserialize(deserializer)?;
	Ok(Actions {
	    disallows: orignal_actions
		.disallows
		.into_iter()
		.filter(|(_, value)| *value)
		.map(|(key, _)| key)
		.collect(),
	})
    }
}

The types should be familiar if you’ve used Serde before.

If you’re not used to Rust then the function signature will likely look a little strange. What it’s trying to tell is that d will be something that implements Serde’s Deserializer trait, and that any references to memory will live for the 'de lifetime.

2.2 Deserialize Unix milliseconds timestamp to Datetime

A currently playing object which contains information about currently playing item, and the timestamp field is an integer, representing the Unix millisecond timestamp when data was fetched.

The response JSON data from HTTP API:

{
    ...
	"timestamp": 1490252122574,
    "progress_ms": 44272,
    "is_playing": true,
    "currently_playing_type": "track",
    "actions": {
	"disallows": {
	    "resuming": true
	}
    }
    ...
}

The original model was:

/// Currently playing object
///
/// [Reference](https://developer.spotify.com/documentation/web-api/reference/player/get-the-users-currently-playing-track/)
#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq)]
pub struct CurrentlyPlayingContext {
    pub timestamp: u64,
    pub progress_ms: Option<u32>,
    pub is_playing: bool,
    pub item: Option<PlayingItem>,
    pub currently_playing_type: CurrentlyPlayingType,
    pub actions: Actions,
}

As before, Koxiaet made a great point about timestamp and =progress_ms=(I will talk about it later):

CurrentlyPlayingContext::timestamp should be a chrono::DateTime, which could be easier to use.

The polished struct looks like:

#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq)]
pub struct CurrentlyPlayingContext {
    pub context: Option<Context>,
    #[serde(
	deserialize_with = "from_millisecond_timestamp",
	serialize_with = "to_millisecond_timestamp"
    )]
    pub timestamp: DateTime<Utc>,
    pub progress_ms: Option<u32>,
    pub is_playing: bool,
    pub item: Option<PlayingItem>,
    pub currently_playing_type: CurrentlyPlayingType,
    pub actions: Actions,
}

Using the deserialize_with attribute tells Serde to use custom deserialization code for the timestamp field. The from_millisecond_timestamp code is:

/// Deserialize Unix millisecond timestamp to `DateTime`
pub(in crate) fn from_millisecond_timestamp<'de, D>(d: D) -> Result<DateTime<Utc>, D::Error>
where
    D: de::Deserializer<'de>,
{
    d.deserialize_u64(DateTimeVisitor)
}

The code calls d.deserialize_u64 passing in a struct. The passed in struct implements Serde’s Visitor, and look like:

// Vistor to help deserialize unix millisecond timestamp to `chrono::DateTime`
struct DateTimeVisitor;

impl<'de> de::Visitor<'de> for DateTimeVisitor {
    type Value = DateTime<Utc>;
    fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
	write!(
	    formatter,
	    "an unix millisecond timestamp represents DataTime"
	)
    }
    fn visit_u64<E>(self, v: u64) -> Result<Self::Value, E>
    where
	E: de::Error,
    {
	...
    }
}

The struct DateTimeVisitor doesn’t have any fields, it just a type implemented the custom visitor which delegates to parse the u64.

Since there is no way to construct DataTime directly from Unix millisecond timestamp, I have to figure out how to handle the construction. And it turns out that there is a way to construct DateTime from seconds and nanoseconds:

use chrono::{DateTime, TimeZone, NaiveDateTime, Utc};

let dt = DateTime::<Utc>::from_utc(NaiveDateTime::from_timestamp(61, 0), Utc);

Thus, what I need to do is just convert millisecond to second and nanosecond:

fn visit_u64<E>(self, v: u64) -> Result<Self::Value, E>
where
    E: de::Error,
{
    let second = (v - v % 1000) / 1000;
    let nanosecond = ((v % 1000) * 1000000) as u32;
    // The maximum value of i64 is large enough to hold millisecond, so it would be safe to convert it i64
    let dt = DateTime::<Utc>::from_utc(
	NaiveDateTime::from_timestamp(second as i64, nanosecond),
	Utc,
    );
    Ok(dt)
}

The to_millisecond_timestamp function is similar to from_millisecond_timestamp, but it’s eaiser to implement, check this PR for more detail.

2.3 Deserialize milliseconds to Duration

The simplified episode object contains the simplified episode information, and the duration_ms field is an integer, which represents the episode length in milliseconds.

The response JSON data from HTTP API:

{
    ...
	"audio_preview_url" : "https://p.scdn.co/mp3-preview/83bc7f2d40e850582a4ca118b33c256358de06ff",
    "description" : "Följ med Tobias Svanelid till Sveriges äldsta tegelkyrka"
    "duration_ms" : 2685023,
    "explicit" : false,
    ...
}

The original model was

#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq)]
pub struct SimplifiedEpisode {
    pub audio_preview_url: Option<String>,
    pub description: String,
    pub duration_ms: u32,
    ...
}

As before without saying, Koxiaet pointed out that

SimplifiedEpisode::duration_ms should be replaced with a duration of type Duration, since a built-in Duration type works better than primitive type.

Since I have worked with Serde’s custome deserialization, it’s not a hard job for me any more. I easily figure out how to deserialize u64 to Duration:

#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq)]
pub struct SimplifiedEpisode {
    pub audio_preview_url: Option<String>,
    pub description: String,
    #[serde(
	deserialize_with = "from_duration_ms",
	serialize_with = "to_duration_ms",
	rename = "duration_ms"
    )]
    pub duration: Duration,
    ...
}

/// Vistor to help deserialize duration represented as millisecond to `std::time::Duration`
struct DurationVisitor;
impl<'de> de::Visitor<'de> for DurationVisitor {
    type Value = Duration;
    fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
	write!(formatter, "a milliseconds represents std::time::Duration")
    }
    fn visit_u64<E>(self, v: u64) -> Result<Self::Value, E>
    where
	E: de::Error,
    {
	Ok(Duration::from_millis(v))
    }
}

/// Deserialize `std::time::Duration` from millisecond(represented as u64)
pub(in crate) fn from_duration_ms<'de, D>(d: D) -> Result<Duration, D::Error>
where
    D: de::Deserializer<'de>,
{
    d.deserialize_u64(DurationVisitor)
}

Now, the life is easier than before.

2.4 Deserialize milliseconds to Option

Let’s go back to CurrentlyPlayingContext model, since we have replaced millisecond (represents as u32) with Duration, it makes sense to replace all millisecond fields to Duration.

But hold on, it seems progress_ms field is a bit different.

The progress_ms field is either not present or a millisecond, the u32 handles the milliseconds, as its value might not be present in the response, it’s an Option, so it won’t work with from_duration_ms.

Thus, it’s necessary to figure out how to handle the Option type, and the answer is in the documentation, the deserialize_option function:

Hint that the Deserialize type is expecting an optional value.

This allows deserializers that encode an optional value as a nullable value to convert the null value into None and a regular value into Some(value).

#[derive(Clone, Debug, Serialize, Deserialize, PartialEq, Eq)]
pub struct CurrentlyPlayingContext {
    pub context: Option<Context>,
    #[serde(
	deserialize_with = "from_millisecond_timestamp",
	serialize_with = "to_millisecond_timestamp"
    )]
    pub timestamp: DateTime<Utc>,
    #[serde(default)]
    #[serde(
	deserialize_with = "from_option_duration_ms",
	serialize_with = "to_option_duration_ms",
	rename = "progress_ms"
    )]
    pub progress: Option<Duration>,
}

/// Deserialize `Option` from millisecond(represented as u64)
pub(in crate) fn from_option_duration_ms<'de, D>(d: D) -> Result<Option<Duration>, D::Error>
where
    D: de::Deserializer<'de>,
{
    d.deserialize_option(OptionDurationVisitor)
}

As before, the OptionDurationVisitor is an empty struct implemented Visitor trait, but key point is in order to work with deserialize_option, the OptionDurationVisitor has to implement the visit_none and visit_some method:

impl<'de> de::Visitor<'de> for OptionDurationVisitor {
    type Value = Option<Duration>;
    fn expecting(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
	write!(
	    formatter,
	    "a optional milliseconds represents std::time::Duration"
	)
    }
    fn visit_none<E>(self) -> Result<Self::Value, E>
    where
	E: de::Error,
    {
	Ok(None)
    }

    fn visit_some<D>(self, deserializer: D) -> Result<Self::Value, D::Error>
    where
	D: de::Deserializer<'de>,
    {
	Ok(Some(deserializer.deserialize_u64(DurationVisitor)?))
    }
}

The visit_none method return Ok(None) so the progress value in the struct will be None, and the visit_some delegates the parsing logic to DurationVisitor via the deserialize_u64 call, so deserializing Some(u64) works like the u64.

2.5 Deserialize enum from number

An AudioAnalysisSection model contains a mode field, which indicates the modality(major or minor) of a track, the type of scle from which its melodic content is derived. This field will contain a 0 for minor, a 1 for major, or a -1 for no result.

The response JSON data from HTTP API:

{
    ...
	"mode": 0,
    "mode_confidence": 0.414,
    ...
}

The original struct representing AudioAnalysisSection was like this, since mode field was stored into a f32=(=f8 was a better choice for this case):

#[derive(Clone, Debug, Serialize, Deserialize, PartialEq)]
pub struct AudioAnalysisSection {
    ...
	pub mode: f32,
    pub mode_confidence: f32,
    ...
}

Koxiaet made a great point about mode field:

AudioAnalysisSection::mode and AudioFeatures::mode are f32=s but should be =Option=s where =enum Mode { Major, Minor } as it is more useful.

In this case, we don’t need the Opiton type and in order to deserialize enum from number, we firstly need to define a C-like enum:

pub enum Modality {
    #[serde(rename = "0")]
    Minor = 0,
    #[serde(rename = "1")]
    Major = 1,
    #[serde(rename = "1")]
    NoResult = -1,
}

pub struct AudioAnalysisSection {
    ...
	pub mode: Modality,
    pub mode_confidence: f32,
    ...
}

And then, what’s the next step? It seems serde doesn’t allow C-like enums to be formatted as integers rather that strings in JSON natively:

working version:
{
    ...
	"mode": "0",
    "mode_confidence": 0.414,
    ...
}

failed version:
{
    ...
	"mode": 0,
    "mode_confidence": 0.414,
    ...
}

Then the failed version is exactly what we want. I know that the serde’s official documentation has a solution for this case, the serde_repr crate provides alternative derive macros that derive the same Serialize and Deserialize traits but delegate to the underlying representation of a C-like enum.

Since we are trying to reduce the compiled time of rspotify, so we are cautious about introducing new dependencies. So a custom-made serialize function would be a better choice, it just needs to match the number, and convert to a related enum value.

/// Deserialize/Serialize `Modality` to integer(0, 1, -1).
pub(in crate) mod modality {
    use super::enums::Modality;
    use serde::{de, Deserialize, Serializer};

    pub fn deserialize<'de, D>(d: D) -> Result<Modality, D::Error>
    where
	D: de::Deserializer<'de>,
    {
	let v = i8::deserialize(d)?;
	match v {
	    0 => Ok(Modality::Minor),
	    1 => Ok(Modality::Major),
	    -1 => Ok(Modality::NoResult),
	    _ => Err(de::Error::invalid_value(
		de::Unexpected::Signed(v.into()),
		&"valid value: 0, 1, -1",
	    )),
	}
    }

    pub fn serialize<S>(x: &Modality, s: S) -> Result<S::Ok, S::Error>
    where
	S: Serializer,
    {
	match x {
	    Modality::Minor => s.serialize_i8(0),
	    Modality::Major => s.serialize_i8(1),
	    Modality::NoResult => s.serialize_i8(-1),
	}
    }
}

3 Move into module

Update:

2021-01-15

from(to)_millisecond_timestamp have been moved into its module millisecond_timestamp and rename them to deserialize & serialize
from(to)_duration_ms have been moved into its module duration_ms and rename them to deserialize & serialize
from(to)_option_duration_ms have been moved into its module option_duration_ms and rename them to deserialize & serialize

4 Summary

To be honest, it’s the first time I have needed some customized works, which took me some time to understand how does Serde works. Finally, all investments paid off, it works great now.

Serde is such an awesome deserialize/serialize framework which I have learnt a lot of from and still have a lot of to learn from.

5 Reference

rspotify has come to async/await

Fri, 28 Feb 2020 01:27:00 +0800

1 Preface

Today, I am exited to introduce you the v0.9 release I have been continued to work on it for the past few weeks that adds async/await support now!

2 The road to async/await

What is rspotify: > For those who has never heared about rspotify before, rspotify is a Spotify web Api wrapper implemented in Rust.

With async/await’s forthcoming stabilization and reqwest adds async/await support now, I think it’s time to let rspotify leverage power from async/await. To be honest, I was not familiar with async/await before, because of my Java background from where I just get used to multiple thread and sync stuff(Yes, I know Java has future either).

After reading some good learning resources, such as Async book, Zero-cost Async IO, I started to step into the world of async/await. async/await is a way to write functions that can “pause”, return control to the runtime, ant then pick up from where they left off.

I think perhaps the most important part of async/await is runtime, which defines how to schedule the functions.

Now, by leveraging the async/await power of reqwest, rspotify could send HTTP request and handle response asynchronously.

Futhermore, not only do I refactor the old blocking endpoint functions to async/await version, but also keep the old blocking endpoint functions with a new additional feature blocking, then other developers could choose API to their taste.

3 Overview

album example:


use rspotify::client::Spotify;
use rspotify::oauth2::SpotifyClientCredentials;

#[tokio::main]
async fn main() {
    // Set client_id and client_secret in .env file or
    // export CLIENT_ID="your client_id"
    // export CLIENT_SECRET="secret"
    let client_credential = SpotifyClientCredentials::default().build();

    // Or set client_id and client_secret explictly
    // let client_credential = SpotifyClientCredentials::default()
    //     .client_id("this-is-my-client-id")
    //     .client_secret("this-is-my-client-secret")
    //     .build();
    let spotify = Spotify::default()
	.client_credentials_manager(client_credential)
	.build();
    let birdy_uri = "spotify:album:0sNOF9WDwhWunNAHPD3Baj";
    let albums = spotify.album(birdy_uri).await;
    println!("{:?}", albums);
}

Just change the default API to async, and moving the previous synchronous API to blocking module.

Notes that I think the v0.9 release of rspotify is going to be a huge break change because of the support for async/await, which definitely breaks backward compatibility.

So I decide to make an other break change into the next release, just refactoring the project structure to shorten the import path:

before:

use rspotify::spotify::client::Spotify;
use rspotify::spotify::oauth2::SpotifyClientCredentials;

after:

use rspotify::client::Spotify;
use rspotify::oauth2::SpotifyClientCredentials;

the spotify module is unnecessary and inelegant, so I just remove it.

4 Conclusion

rspotify v0.9 is now available! There is documentation, examples and an issue tracker!

Please provide any feedback, as I would love to improve this library any way I can! Thanks @Alexander so much for actively participate in the refactor work for support async/await.