View or edit on GitHub
This page is synchronized from doc/Pandas.md. Last modified on 2025-12-09 00:30 CET by Trase Admin.
Please view or edit the original file there; changes should be reflected here after a midnight build (CET time),
or manually triggering it with a GitHub action (link).
Pandas
Here are a few tips to make Pandas more robust and readable!
Always prefer column operations to iterrows and friends
Pandas is exceptionally fast if you do operations on columns.
You also usually find that your code is shorter, more readable, and less prone to ordering bugs if you stick to working on whole columns.
The Series class has a huge range of functions that you can run: check them out!
Avoid NaNs
Missing data in Pandas dataframes, and NaNs in Python generally, are very weird objects.
Their behaviour is often unclear or surprising (see The Weird World of Missing Values in Pandas).
Another example is that NaN is never equal to NaN, except when using pd.merge.
It's best to avoid them!
They can often creep in if you use pd.read_csv: unless you explicitly tell it not to it will try to auto-detect missing values; which is particularly sad for Norway with ISO code "NA"!
Always pass keep_default_na=False in to this function.
Avoid auto-detection
It is generally better to be explicit than try to infer behaviour from the data. This is more robust to errors and easier to read. That means:
- Do not auto-detect encodings in code: instead pass it in explicitly e.g.
pd.read_csv(encoding=latin-1, ...) - Do not auto-detect CSV separators: pass it in explicitly e.g.
pd.read_csv(sep=",", ...) - Do not auto-detect missing values: handle these cases in your code explicitly
- Do not auto-detect types: cast explicitly to the types are you are expecting
- Explicitly select the columns you are expecting rather than just reading them all in, e.g.
pd.read_csv(usecols=["country"], ...)
If you use pd.read_csv or similar, always stick to these defaults:
df = pd.read_csv(filename, sep=";", dtype=str, encoding="latin-1", keep_default_na=False, usecols=["exporter", "vol"])
df = df[df["exporter"] != "NA"].copy()
df = df.astype({"vol": int})
Also see our documentation on CSV Files in Trase.
Learn to use groupby and merge
They will make your code a lot faster and readable!