scipy.stats.nanmedian (array, axis = 0)
calculates the median, ignoring the Nan (not count) values of the array elements along the specified array axis.
Parameters:
array: Input array or object having the elements, including Nan values, to calculate the median.
axis: Axis along which the median is to be computed. By default axis = 0Returns: median of the array elements (ignoring the Nan values) based on the set parameters.
Code # 1:

Exit:
median using nanmedian: 3.0 median without handling nan value: nan
Code # 2: with multidimensional data
# median
from
scipy
import
median
from
scipy import
nanmedian
import
numpy as np
arr1
=
[[
1
,
3
,
27
],
[
3
, np.nan ,
6
],
[np.nan,
6
,
3
],
[
3
,
6
, np.nan]]
print
(
"median is:"
, median (arr1))
print
(
"median handling nan:" , nanmedian (arr1))
# using axis = 0
print
(
"median is with default axis = 0:"
,
median (arr1, axis
=
0
) )
print
(
"median handling nan with default axis = 0: "
,
nanmedian (arr1, axis
=
0
))
# using axis = 1
print
(
"median is with default axis = 1:"
,
median (arr1, axis
=
1
))
print
(
"median handling nan with default axis = 1:"
,
nanmedian (arr1, axis
=
1
))
Exit:
median is: nan median handling nan: 3.0 median is with default axis = 0: [nan nan nan] median handling nan with default axis = 0: [3. 6. 6.] median is with default axis = 1: [3. nan nan nan] median handling nan with default axis = 1: [3. 4.5 4.5 4.5]
How do you find the median of a list in Python? The list can be of any size and the numbers are not guaranteed to be in any particular order.
If the list contains an even number of elements, the function should return the average of the middle two.
Here are some examples (sorted for display purposes):
median([1]) == 1
median([1, 1]) == 1
median([1, 1, 2, 4]) == 1.5
median([0, 2, 5, 6, 8, 9, 9]) == 6
median([0, 0, 0, 0, 4, 4, 6, 8]) == 2
The simplest way to get row counts per group is by calling .size()
, which returns a Series
:
df.groupby(["col1","col2"]).size()
Usually you want this result as a DataFrame
(instead of a Series
) so you can do:
df.groupby(["col1", "col2"]).size().reset_index(name="counts")
If you want to find out how to calculate the row counts and other statistics for each group continue reading below.
Consider the following example dataframe:
In [2]: df
Out[2]:
col1 col2 col3 col4 col5 col6
0 A B 0.20 0.61 0.49 1.49
1 A B 1.53 1.01 0.39 1.82
2 A B 0.44 0.27 0.72 0.11
3 A B 0.28 1.32 0.38 0.18
4 C D 0.12 0.59 0.81 0.66
5 C D 0.13 1.65 1.64 0.50
6 C D 1.42 0.11 0.18 0.44
7 E F 0.00 1.42 0.26 1.17
8 E F 0.91 0.47 1.35 0.34
9 G H 1.48 0.63 1.14 0.17
First let"s use .size()
to get the row counts:
In [3]: df.groupby(["col1", "col2"]).size()
Out[3]:
col1 col2
A B 4
C D 3
E F 2
G H 1
dtype: int64
Then let"s use .size().reset_index(name="counts")
to get the row counts:
In [4]: df.groupby(["col1", "col2"]).size().reset_index(name="counts")
Out[4]:
col1 col2 counts
0 A B 4
1 C D 3
2 E F 2
3 G H 1
When you want to calculate statistics on grouped data, it usually looks like this:
In [5]: (df
...: .groupby(["col1", "col2"])
...: .agg({
...: "col3": ["mean", "count"],
...: "col4": ["median", "min", "count"]
...: }))
Out[5]:
col4 col3
median min count mean count
col1 col2
A B 0.810 1.32 4 0.372500 4
C D 0.110 1.65 3 0.476667 3
E F 0.475 0.47 2 0.455000 2
G H 0.630 0.63 1 1.480000 1
The result above is a little annoying to deal with because of the nested column labels, and also because row counts are on a per column basis.
To gain more control over the output I usually split the statistics into individual aggregations that I then combine using join
. It looks like this:
In [6]: gb = df.groupby(["col1", "col2"])
...: counts = gb.size().to_frame(name="counts")
...: (counts
...: .join(gb.agg({"col3": "mean"}).rename(columns={"col3": "col3_mean"}))
...: .join(gb.agg({"col4": "median"}).rename(columns={"col4": "col4_median"}))
...: .join(gb.agg({"col4": "min"}).rename(columns={"col4": "col4_min"}))
...: .reset_index()
...: )
...:
Out[6]:
col1 col2 counts col3_mean col4_median col4_min
0 A B 4 0.372500 0.810 1.32
1 C D 3 0.476667 0.110 1.65
2 E F 2 0.455000 0.475 0.47
3 G H 1 1.480000 0.630 0.63
The code used to generate the test data is shown below:
In [1]: import numpy as np
...: import pandas as pd
...:
...: keys = np.array([
...: ["A", "B"],
...: ["A", "B"],
...: ["A", "B"],
...: ["A", "B"],
...: ["C", "D"],
...: ["C", "D"],
...: ["C", "D"],
...: ["E", "F"],
...: ["E", "F"],
...: ["G", "H"]
...: ])
...:
...: df = pd.DataFrame(
...: np.hstack([keys,np.random.randn(10,4).round(2)]),
...: columns = ["col1", "col2", "col3", "col4", "col5", "col6"]
...: )
...:
...: df[["col3", "col4", "col5", "col6"]] =
...: df[["col3", "col4", "col5", "col6"]].astype(float)
...:
Disclaimer:
If some of the columns that you are aggregating have null values, then you really want to be looking at the group row counts as an independent aggregation for each column. Otherwise you may be misled as to how many records are actually being used to calculate things like the mean because pandas will drop NaN
entries in the mean calculation without telling you about it.
To begin, note that quantiles is just the most general term for things like percentiles, quartiles, and medians. You specified five bins in your example, so you are asking qcut
for quintiles.
So, when you ask for quintiles with qcut
, the bins will be chosen so that you have the same number of records in each bin. You have 30 records, so should have 6 in each bin (your output should look like this, although the breakpoints will differ due to the random draw):
pd.qcut(factors, 5).value_counts()
[2.578, 0.829] 6
(0.829, 0.36] 6
(0.36, 0.366] 6
(0.366, 0.868] 6
(0.868, 2.617] 6
Conversely, for cut
you will see something more uneven:
pd.cut(factors, 5).value_counts()
(2.583, 1.539] 5
(1.539, 0.5] 5
(0.5, 0.539] 9
(0.539, 1.578] 9
(1.578, 2.617] 2
That"s because cut
will choose the bins to be evenly spaced according to the values themselves and not the frequency of those values. Hence, because you drew from a random normal, you"ll see higher frequencies in the inner bins and fewer in the outer. This is essentially going to be a tabular form of a histogram (which you would expect to be fairly bell shaped with 30 records).
Here are some benchmarks for the various answers to this question. There were some surprising results, including wildly different performance depending on the string being tested.
Some functions were modified to work with Python 3 (mainly by replacing /
with //
to ensure integer division). If you see something wrong, want to add your function, or want to add another test string, ping @ZeroPiraeus in the Python chatroom.
In summary: there"s about a 50x difference between the best and worstperforming solutions for the large set of example data supplied by OP here (via this comment). David Zhang"s solution is the clear winner, outperforming all others by around 5x for the large example set.
A couple of the answers are very slow in extremely large "no match" cases. Otherwise, the functions seem to be equally matched or clear winners depending on the test.
Here are the results, including plots made using matplotlib and seaborn to show the different distributions:
Corpus 1 (supplied examples  small set)
mean performance:
0.0003 david_zhang
0.0009 zero
0.0013 antti
0.0013 tigerhawk_2
0.0015 carpetpython
0.0029 tigerhawk_1
0.0031 davidism
0.0035 saksham
0.0046 shashank
0.0052 riad
0.0056 piotr
median performance:
0.0003 david_zhang
0.0008 zero
0.0013 antti
0.0013 tigerhawk_2
0.0014 carpetpython
0.0027 tigerhawk_1
0.0031 davidism
0.0038 saksham
0.0044 shashank
0.0054 riad
0.0058 piotr
Corpus 2 (supplied examples  large set)
mean performance:
0.0006 david_zhang
0.0036 tigerhawk_2
0.0036 antti
0.0037 zero
0.0039 carpetpython
0.0052 shashank
0.0056 piotr
0.0066 davidism
0.0120 tigerhawk_1
0.0177 riad
0.0283 saksham
median performance:
0.0004 david_zhang
0.0018 zero
0.0022 tigerhawk_2
0.0022 antti
0.0024 carpetpython
0.0043 davidism
0.0049 shashank
0.0055 piotr
0.0061 tigerhawk_1
0.0077 riad
0.0109 saksham
Corpus 3 (edge cases)
mean performance:
0.0123 shashank
0.0375 david_zhang
0.0376 piotr
0.0394 carpetpython
0.0479 antti
0.0488 tigerhawk_2
0.2269 tigerhawk_1
0.2336 davidism
0.7239 saksham
3.6265 zero
6.0111 riad
median performance:
0.0107 tigerhawk_2
0.0108 antti
0.0109 carpetpython
0.0135 david_zhang
0.0137 tigerhawk_1
0.0150 shashank
0.0229 saksham
0.0255 piotr
0.0721 davidism
0.1080 zero
1.8539 riad
The tests and raw results are available here.
To understand what yield
does, you must understand what generators are. And before you can understand generators, you must understand iterables.
When you create a list, you can read its items one by one. Reading its items one by one is called iteration:
>>> mylist = [1, 2, 3]
>>> for i in mylist:
... print(i)
1
2
3
mylist
is an iterable. When you use a list comprehension, you create a list, and so an iterable:
>>> mylist = [x*x for x in range(3)]
>>> for i in mylist:
... print(i)
0
1
4
Everything you can use "for... in...
" on is an iterable; lists
, strings
, files...
These iterables are handy because you can read them as much as you wish, but you store all the values in memory and this is not always what you want when you have a lot of values.
Generators are iterators, a kind of iterable you can only iterate over once. Generators do not store all the values in memory, they generate the values on the fly:
>>> mygenerator = (x*x for x in range(3))
>>> for i in mygenerator:
... print(i)
0
1
4
It is just the same except you used ()
instead of []
. BUT, you cannot perform for i in mygenerator
a second time since generators can only be used once: they calculate 0, then forget about it and calculate 1, and end calculating 4, one by one.
yield
is a keyword that is used like return
, except the function will return a generator.
>>> def create_generator():
... mylist = range(3)
... for i in mylist:
... yield i*i
...
>>> mygenerator = create_generator() # create a generator
>>> print(mygenerator) # mygenerator is an object!
<generator object create_generator at 0xb7555c34>
>>> for i in mygenerator:
... print(i)
0
1
4
Here it"s a useless example, but it"s handy when you know your function will return a huge set of values that you will only need to read once.
To master yield
, you must understand that when you call the function, the code you have written in the function body does not run. The function only returns the generator object, this is a bit tricky.
Then, your code will continue from where it left off each time for
uses the generator.
Now the hard part:
The first time the for
calls the generator object created from your function, it will run the code in your function from the beginning until it hits yield
, then it"ll return the first value of the loop. Then, each subsequent call will run another iteration of the loop you have written in the function and return the next value. This will continue until the generator is considered empty, which happens when the function runs without hitting yield
. That can be because the loop has come to an end, or because you no longer satisfy an "if/else"
.
Generator:
# Here you create the method of the node object that will return the generator
def _get_child_candidates(self, distance, min_dist, max_dist):
# Here is the code that will be called each time you use the generator object:
# If there is still a child of the node object on its left
# AND if the distance is ok, return the next child
if self._leftchild and distance  max_dist < self._median:
yield self._leftchild
# If there is still a child of the node object on its right
# AND if the distance is ok, return the next child
if self._rightchild and distance + max_dist >= self._median:
yield self._rightchild
# If the function arrives here, the generator will be considered empty
# there is no more than two values: the left and the right children
Caller:
# Create an empty list and a list with the current object reference
result, candidates = list(), [self]
# Loop on candidates (they contain only one element at the beginning)
while candidates:
# Get the last candidate and remove it from the list
node = candidates.pop()
# Get the distance between obj and the candidate
distance = node._get_dist(obj)
# If distance is ok, then you can fill the result
if distance <= max_dist and distance >= min_dist:
result.extend(node._values)
# Add the children of the candidate in the candidate"s list
# so the loop will keep running until it will have looked
# at all the children of the children of the children, etc. of the candidate
candidates.extend(node._get_child_candidates(distance, min_dist, max_dist))
return result
This code contains several smart parts:
The loop iterates on a list, but the list expands while the loop is being iterated. It"s a concise way to go through all these nested data even if it"s a bit dangerous since you can end up with an infinite loop. In this case, candidates.extend(node._get_child_candidates(distance, min_dist, max_dist))
exhaust all the values of the generator, but while
keeps creating new generator objects which will produce different values from the previous ones since it"s not applied on the same node.
The extend()
method is a list object method that expects an iterable and adds its values to the list.
Usually we pass a list to it:
>>> a = [1, 2]
>>> b = [3, 4]
>>> a.extend(b)
>>> print(a)
[1, 2, 3, 4]
But in your code, it gets a generator, which is good because:
And it works because Python does not care if the argument of a method is a list or not. Python expects iterables so it will work with strings, lists, tuples, and generators! This is called duck typing and is one of the reasons why Python is so cool. But this is another story, for another question...
You can stop here, or read a little bit to see an advanced use of a generator:
>>> class Bank(): # Let"s create a bank, building ATMs
... crisis = False
... def create_atm(self):
... while not self.crisis:
... yield "$100"
>>> hsbc = Bank() # When everything"s ok the ATM gives you as much as you want
>>> corner_street_atm = hsbc.create_atm()
>>> print(corner_street_atm.next())
$100
>>> print(corner_street_atm.next())
$100
>>> print([corner_street_atm.next() for cash in range(5)])
["$100", "$100", "$100", "$100", "$100"]
>>> hsbc.crisis = True # Crisis is coming, no more money!
>>> print(corner_street_atm.next())
<type "exceptions.StopIteration">
>>> wall_street_atm = hsbc.create_atm() # It"s even true for new ATMs
>>> print(wall_street_atm.next())
<type "exceptions.StopIteration">
>>> hsbc.crisis = False # The trouble is, even postcrisis the ATM remains empty
>>> print(corner_street_atm.next())
<type "exceptions.StopIteration">
>>> brand_new_atm = hsbc.create_atm() # Build a new one to get back in business
>>> for cash in brand_new_atm:
... print cash
$100
$100
$100
$100
$100
$100
$100
$100
$100
...
Note: For Python 3, useprint(corner_street_atm.__next__())
or print(next(corner_street_atm))
It can be useful for various things like controlling access to a resource.
The itertools module contains special functions to manipulate iterables. Ever wish to duplicate a generator?
Chain two generators? Group values in a nested list with a oneliner? Map / Zip
without creating another list?
Then just import itertools
.
An example? Let"s see the possible orders of arrival for a fourhorse race:
>>> horses = [1, 2, 3, 4]
>>> races = itertools.permutations(horses)
>>> print(races)
<itertools.permutations object at 0xb754f1dc>
>>> print(list(itertools.permutations(horses)))
[(1, 2, 3, 4),
(1, 2, 4, 3),
(1, 3, 2, 4),
(1, 3, 4, 2),
(1, 4, 2, 3),
(1, 4, 3, 2),
(2, 1, 3, 4),
(2, 1, 4, 3),
(2, 3, 1, 4),
(2, 3, 4, 1),
(2, 4, 1, 3),
(2, 4, 3, 1),
(3, 1, 2, 4),
(3, 1, 4, 2),
(3, 2, 1, 4),
(3, 2, 4, 1),
(3, 4, 1, 2),
(3, 4, 2, 1),
(4, 1, 2, 3),
(4, 1, 3, 2),
(4, 2, 1, 3),
(4, 2, 3, 1),
(4, 3, 1, 2),
(4, 3, 2, 1)]
Iteration is a process implying iterables (implementing the __iter__()
method) and iterators (implementing the __next__()
method).
Iterables are any objects you can get an iterator from. Iterators are objects that let you iterate on iterables.
There is more about it in this article about how for
loops work.
You might be interested in the SciPy Stats package. It has the percentile function you"re after and many other statistical goodies.
percentile()
is available in numpy
too.
import numpy as np
a = np.array([1,2,3,4,5])
p = np.percentile(a, 50) # return 50th percentile, e.g median.
print p
3.0
This ticket leads me to believe they won"t be integrating percentile()
into numpy anytime soon.
Python 3.4 has statistics.median
:
Return the median (middle value) of numeric data.
When the number of data points is odd, return the middle data point. When the number of data points is even, the median is interpolated by taking the average of the two middle values:
>>> median([1, 3, 5]) 3 >>> median([1, 3, 5, 7]) 4.0
Usage:
import statistics
items = [6, 1, 8, 2, 3]
statistics.median(items)
#>>> 3
It"s pretty careful with types, too:
statistics.median(map(float, items))
#>>> 3.0
from decimal import Decimal
statistics.median(map(Decimal, items))
#>>> Decimal("3")
Something important when dealing with outliers is that one should try to use estimators as robust as possible. The mean of a distribution will be biased by outliers but e.g. the median will be much less.
Building on eumiro"s answer:
def reject_outliers(data, m = 2.):
d = np.abs(data  np.median(data))
mdev = np.median(d)
s = d/mdev if mdev else 0.
return data[s<m]
Here I have replace the mean with the more robust median and the standard deviation with the median absolute distance to the median. I then scaled the distances by their (again) median value so that m
is on a reasonable relative scale.
Note that for the data[s<m]
syntax to work, data
must be a numpy array.
>>> k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
>>> import itertools
>>> k.sort()
>>> list(k for k,_ in itertools.groupby(k))
[[1, 2], [3], [4], [5, 6, 2]]
itertools
often offers the fastest and most powerful solutions to this kind of problems, and is well worth getting intimately familiar with!)
Edit: as I mention in a comment, normal optimization efforts are focused on large inputs (the bigO approach) because it"s so much easier that it offers good returns on efforts. But sometimes (essentially for "tragically crucial bottlenecks" in deep inner loops of code that"s pushing the boundaries of performance limits) one may need to go into much more detail, providing probability distributions, deciding which performance measures to optimize (maybe the upper bound or the 90th centile is more important than an average or median, depending on one"s apps), performing possiblyheuristic checks at the start to pick different algorithms depending on input data characteristics, and so forth.
Careful measurements of "point" performance (code A vs code B for a specific input) are a part of this extremely costly process, and standard library module timeit
helps here. However, it"s easier to use it at a shell prompt. For example, here"s a short module to showcase the general approach for this problem, save it as nodup.py
:
import itertools
k = [[1, 2], [4], [5, 6, 2], [1, 2], [3], [4]]
def doset(k, map=map, list=list, set=set, tuple=tuple):
return map(list, set(map(tuple, k)))
def dosort(k, sorted=sorted, xrange=xrange, len=len):
ks = sorted(k)
return [ks[i] for i in xrange(len(ks)) if i == 0 or ks[i] != ks[i1]]
def dogroupby(k, sorted=sorted, groupby=itertools.groupby, list=list):
ks = sorted(k)
return [i for i, _ in itertools.groupby(ks)]
def donewk(k):
newk = []
for i in k:
if i not in newk:
newk.append(i)
return newk
# sanity check that all functions compute the same result and don"t alter k
if __name__ == "__main__":
savek = list(k)
for f in doset, dosort, dogroupby, donewk:
resk = f(k)
assert k == savek
print "%10s %s" % (f.__name__, sorted(resk))
Note the sanity check (performed when you just do python nodup.py
) and the basic hoisting technique (make constant global names local to each function for speed) to put things on equal footing.
Now we can run checks on the tiny example list:
$ python mtimeit s"import nodup" "nodup.doset(nodup.k)"
100000 loops, best of 3: 11.7 usec per loop
$ python mtimeit s"import nodup" "nodup.dosort(nodup.k)"
100000 loops, best of 3: 9.68 usec per loop
$ python mtimeit s"import nodup" "nodup.dogroupby(nodup.k)"
100000 loops, best of 3: 8.74 usec per loop
$ python mtimeit s"import nodup" "nodup.donewk(nodup.k)"
100000 loops, best of 3: 4.44 usec per loop
confirming that the quadratic approach has smallenough constants to make it attractive for tiny lists with few duplicated values. With a short list without duplicates:
$ python mtimeit s"import nodup" "nodup.donewk([[i] for i in range(12)])"
10000 loops, best of 3: 25.4 usec per loop
$ python mtimeit s"import nodup" "nodup.dogroupby([[i] for i in range(12)])"
10000 loops, best of 3: 23.7 usec per loop
$ python mtimeit s"import nodup" "nodup.doset([[i] for i in range(12)])"
10000 loops, best of 3: 31.3 usec per loop
$ python mtimeit s"import nodup" "nodup.dosort([[i] for i in range(12)])"
10000 loops, best of 3: 25 usec per loop
the quadratic approach isn"t bad, but the sort and groupby ones are better. Etc, etc.
If (as the obsession with performance suggests) this operation is at a core inner loop of your pushingtheboundaries application, it"s worth trying the same set of tests on other representative input samples, possibly detecting some simple measure that could heuristically let you pick one or the other approach (but the measure must be fast, of course).
It"s also well worth considering keeping a different representation for k
 why does it have to be a list of lists rather than a set of tuples in the first place? If the duplicate removal task is frequent, and profiling shows it to be the program"s performance bottleneck, keeping a set of tuples all the time and getting a list of lists from it only if and where needed, might be faster overall, for example.
(Works with python2.x):
def median(lst):
n = len(lst)
s = sorted(lst)
return (sum(s[n//21:n//2+1])/2.0, s[n//2])[n % 2] if n else None
>>> median([5, 5, 3, 4, 0, 1])
3.5
>>> from numpy import median
>>> median([1, 4, 1, 1, 1, 3])
1.0
For python3.x, use statistics.median
:
>>> from statistics import median
>>> median([5, 2, 3, 8, 9, 2])
4.0
Levenshtein Python extension and C library.
https://github.com/ztane/pythonLevenshtein/
The Levenshtein Python C extension module contains functions for fast computation of  Levenshtein (edit) distance, and edit operations  string similarity  approximate median strings, and generally string averaging  string sequence and set similarity It supports both normal and Unicode strings.
$ pip install pythonlevenshtein
...
$ python
>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
Compute similarity of two strings.
ratio(string1, string2)
The similarity is a number between 0 and 1, it"s usually equal or
somewhat higher than difflib.SequenceMatcher.ratio(), becuase it"s
based on real minimal edit distance.
Examples:
>>> ratio("Hello world!", "Holly grail!")
0.58333333333333337
>>> ratio("Brian", "Jesus")
0.0
>>> help(Levenshtein.distance)
distance(...)
Compute absolute Levenshtein distance of two strings.
distance(string1, string2)
Examples (it"s hard to spell Levenshtein correctly):
>>> distance("Levenshtein", "Lenvinsten")
4
>>> distance("Levenshtein", "Levensthein")
2
>>> distance("Levenshtein", "Levenshten")
1
>>> distance("Levenshtein", "Levenshtein")
0
A ProblemSolver’s Guide to Building RealWorld Intelligent Systems. Data is the new oil and Machine Learning is a powerful concept and framework for making the best out of it. In this age of aut...
23/09/2020
The ability to identify patterns is an essential component of sensory intelli gent machines. Pattern recognition is therefore an indispensible component of the socalled “Intelligent Control System...
10/07/2020
Spark is one of the hottest technologies in big data analysis right now, and with good reason. If you work for, or you hope to work for, a company that has massive amounts of data to analyze, Spark of...
10/07/2020
This book introduces machine learning methods in finance. It features a unified treatment of machine learn...
12/08/2021