Tuesday, October 30, 2018

Normalizing Filenames and Data with Bash

URLify: convert letter sequences into safe URLs with hex equivalents.

This is my 155th column. That means I've been writing for Linux Journal for:


$ echo "155/12" | bc
12

No, wait, that's not right. Let's try that again:


$ echo "scale=2;155/12" | bc
12.91

Yeah, that many years. Almost 13 years of writing about shell scripts and lightweight programming within the Linux environment. I've covered a lot of ground, but I want to go back to something that's fairly basic and talk about filenames and the web.

It used to be that if you had filenames that had spaces in them, bad things would happen: "my mom's cookies.html" was a recipe for disaster, not good cookies—um, and not those sorts of web cookies either!

As the web evolved, however, encoding of special characters became the norm, and every Web browser had to be able to manage it, for better or worse. So spaces became either "+" or %20 sequences, and everything else that wasn't a regular alphanumeric character was replaced by its hex ASCII equivalent.

In other words, "my mom's cookies.html" turned into "my+mom%27s+cookies.html" or "my%20mom%27s%20cookies.html". Many symbols took on a second life too, so "&" and "=" and "?" all got their own meanings, which meant that they needed to be protected if they were part of an original filename too. And what about if you had a "%" in your original filename? Ah yes, the recursive nature of encoding things....

So purely as an exercise in scripting, let's write a script that converts any string you hand it into a "web-safe" sequence. Before starting, however, pull out a piece of paper and jot down how you'd solve it.

Normalizing Filenames for the Web

My strategy is going to be easy: pull the string apart into individual characters, analyze each character to identify if it's an alphanumeric, and if it's not, convert it into its hexadecimal ASCII equivalent, prefacing it with a "%" as needed.

There are a number of ways to break a string into its individual letters, but let's use Bash string variable manipulations, recalling that ${#var} returns the number of characters in variable $var, and that ${var:x:1} will return just the letter in $var at position x. Quick now, does indexing start at zero or one?

Here's my initial loop to break $original into its component letters:



from Linux Journal - The Original Magazine of the Linux Community https://ift.tt/2qj7gL9
via IFTTT

No comments:

Post a Comment

Playing Grand Theft Auto Inside A Neural Network’s Hallucination? It’s Possible!

Ever imagined what a Neural Network's hallucination would look like? The post Playing Grand Theft Auto Inside A Neural Network’s Halluc...