run python script as shell script more than shebang

Normally we use shebang to run python scripts as like a shell script. But still there maybe some complicated scripts that we may want to do more logic just than hinting what we should use to run the target script.

For example, some normal cases are wrapping some python script as a shell script string and passing the string to our target python/sql interpreter:

echo "world!" | python -c "
import sys
piped_in = sys.stdin.read()
print('hello ' + piped_in)
"

or using bash cat eof avoid vairable expand:

py_script=$(cat << 'EOF'
import sys
piped_in = sys.stdin.read()
print('hello ' + piped_in)
EOF
)
echo "world!" | python -c "$py_script"

or just saving the content of python/sql to a real temporay script and then run it in the shell script.

But in such a way the python content is recoginzed as a string in a shell script, it's hard for us to edit and read it. With the trick mementioned in:

We can do the same thing in a way (we dont have to write python code in shell string now) like:

#!/bin/sh

# Shell commands follow
# Next line is bilingual: it starts a comment in Python, and is a no-op in shell
""":"

# Find a suitable python interpreter (adapt for your specific needs) 
for cmd in python3.5 python3 /opt/myspecialpython/bin/python3.5.99 ; do
   command -v > /dev/null $cmd && exec $cmd $0 "$@"
done

echo "OMG Python not found, exiting!!!!!11!!eleven" >2

exit 2

":"""
# Previous line is bilingual: it ends a comment in Python, and is a no-op in shell
# Shell commands end here
# Python script follows (example commands shown)

import sys
print("running Python!")
print(sys.version)

Be aware that here exec is a system_call (just the concept like fork and exec). It replaces current process and runs the new one.

we use this trick to submit some pyspark scripts¹ too:

#!/usr/bin/env bash
# encoding=utf-8
# Shell commands follow
# Next line is bilingual: it starts a comment in Python, and is a no-op in shell
""":"
# exec replace current process
# this shell script is ran with source command
# and spark-submit only treat py file as pyscript
cp ${BASH_SOURCE[0]} ${BASH_SOURCE[0]}.py
exec spark-submit \
    --num-executors 20 --executor-memory 9g \
    --executor-cores 8 --driver-memory 3g \
    --conf "spark.pyspark.python=/usr/bin/python3.6" \
    --conf "spark.pyspark.driver.python=/usr/bin/python3.6" \
    ${BASH_SOURCE[0]}.py
":"""
# Previous line is bilingual: it ends a comment in Python, and is a no-op in shell
# Shell commands end here
# Python script follows
import os, sys
os.environ['PYSPARK_PYTHON'] = "/usr/bin/python3.6"
os.environ['PYSPARK_DRIVER_PYTHON'] = "/usr/bin/python3.6"
import json
import requests
from datetime import datetime
from pyspark import *
from pyspark.sql import *
from pyspark.sql.types import *
from pyspark.sql.functions import *
sparkSession = SparkSession.builder \
      .master('yarn') \
      .appName("tmp_data_preprocessing") \
      .config("hive.exec.dynamic.partition", "true") \
      .config("hive.exec.dynamic.partition.mode", "nonstrict") \
      .enableHiveSupport() \
      .getOrCreate()
d_sample = sparkSession.sql(
"""
-- the real pyspark sql was removed
select 1
""")
fields= d_sample.schema.fields
print(fields)
d_sample = d_sample.toJSON()
def to_test_sql(x):
    d = json.loads(x)
    return "SELECT " + ", ".join(
        # None should print as null, and without quote
        ("cast({value!%s} as {type}) as {name}" % ("r" if d.get(f.name) is not None else "s")).format(
        value=d.get(f.name) if d.get(f.name) is not None else 'null',name=f.name,
        type=f.dataType.typeName())
        for f in fields
        # for idx, (name, value) in enumerate(d.items())
    )
test_sqls = d_sample.map(to_test_sql).collect()
print(" UNION ".join(test_sqls))

Footnotes

Here is a pyspark script for exporting hive table data as sql for local enviroment testing↩︎