Is Adam's optimization susceptible to Local Minima?2019 Community Moderator ElectionWhich Optimization method to use?Can overfitting occur in Advanced Optimization algorithms?Why is vanishing gradient a problem?Does a neural network continue to change after SGD stops improving?local minima vs saddle points in deep learningNeural Network: how to interpret this loss graph?Linear Regression OptimizationWhy are optimization algorithms slower at critical points?How does Gradient Descent and Backpropagation work together?Understanding general approach to updating optimization function parameters
What defenses are there against being summoned by the Gate spell?
Do Phineas and Ferb ever actually get busted in real time?
Do any Labour MPs support no-deal?
Representing power series as a function - what to do with the constant after integration?
What do you call a Matrix-like slowdown and camera movement effect?
How can I fix this gap between bookcases I made?
Download, install and reboot computer at night if needed
Today is the Center
Email Account under attack (really) - anything I can do?
Find original functions from a composite function
Should I join office cleaning event for free?
I probably found a bug with the sudo apt install function
What is the offset in a seaplane's hull?
A Journey Through Space and Time
Why Is Death Allowed In the Matrix?
DOS, create pipe for stdin/stdout of command.com(or 4dos.com) in C or Batch?
How can the DM most effectively choose 1 out of an odd number of players to be targeted by an attack or effect?
Copenhagen passport control - US citizen
How can I hide my bitcoin transactions to protect anonymity from others?
Is it possible to do 50 km distance without any previous training?
When blogging recipes, how can I support both readers who want the narrative/journey and ones who want the printer-friendly recipe?
Why linear maps act like matrix multiplication?
Suffixes -unt and -ut-
Disadvantages of online checking accounts?
Is Adam's optimization susceptible to Local Minima?
2019 Community Moderator ElectionWhich Optimization method to use?Can overfitting occur in Advanced Optimization algorithms?Why is vanishing gradient a problem?Does a neural network continue to change after SGD stops improving?local minima vs saddle points in deep learningNeural Network: how to interpret this loss graph?Linear Regression OptimizationWhy are optimization algorithms slower at critical points?How does Gradient Descent and Backpropagation work together?Understanding general approach to updating optimization function parameters
$begingroup$
# Neural Network Architecture
no_hid_layers = 1
hid = 3
no_out = 1
# Xavier Ininitialization of weights w
w1 = np.random.randn(hid, n+1)*np.sqrt(2/(hid+n+1))
w2 = np.random.randn(no_out, hid+1)*np.sqrt(2/(no_out+hid+1))
# Sigmoid Activation Function
def g(x):
sig = 1/(1+np.exp(-x))
return sig
def frwrd_prop(X, w1, w2):
z2 = w1 @ X.T
z2 = norm(z2, axis=0)
a2 = np.insert(g(z2), 0, 1, axis=0)
h = g((w2@a2))
return (h,a2)
# Calculating Cost and Gradient
def Cost(X, y, w1, w2, lmbda=0):
# Initializing Cost J and Gradients dw
J = 0
dw1 = np.zeros(w1.shape)
dw2 = np.zeros(w2.shape)
# Forward Propagation to calculate the value of the output
h, a2 = frwrd_prop(X, w1, w2)
# Calculate the Cost Function J
J = -(np.sum(y.T*np.log(h) + (1-y).T*np.log(1-h)) - lmbda/2*(np.sum(np.sum(w1[:,1:].T@w1[:,1:])) + np.sum(w2[:,1:].T@w2[:,1:])))/m
# Applying Back Propagation to calculate the Gradients dw
D3 = h-y
D2 = (w2.T@D3)*a2*(1-a2)
dw1[:,0] = (D2[1:]@X)[:,0]/m
dw2[:,0] = (D3@a2.T)[:,0]/m
dw1[:, 1:] = ((D2[1:]@X)[:,1:] + lmbda*w1[:,1:])/m
dw2[:, 1:] = ((D3@a2.T)[:,1:] + lmbda*w2[:,1:])/m
# Gradient clipping
if(abs(np.linalg.norm(dw1))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw1))
if(abs(np.linalg.norm(dw2))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw2))
return (J, dw1, dw2)
# Adam's Optimization technique for training w
def Train(w1, w2, maxIter=50):
# Algorithm
a, b1, b2, e = 0.001, 0.9, 0.999, 10**(-8)
V1 = np.zeros(w1.shape)
V2 = np.zeros(w2.shape)
S1 = np.zeros(w1.shape)
S2 = np.zeros(w2.shape)
for i in range(maxIter):
J, dw1, dw2 = Cost(X, y, w1, w2)
V1 = b1*V1 + (1-b1)*dw1
S1 = b2*S1 + (1-b2)*(dw1**2)
V2 = b1*V2 + (1-b1)*dw2
S2 = b2*S2 + (1-b2)*(dw2**2)
if i!=0:
V1 = V1/(1-b1**i)
S1 = S1/(1-b2**i)
V2 = V2/(1-b1**i)
S2 = S2/(1-b2**i)
w1 = w1 - a*V1/(np.sqrt(S1)+e)*dw1
w2 = w2 - a*V2/(np.sqrt(S2)+e)*dw2
print("tttIteration : ", i+1, " tCost : ", J)
return (w1, w2)
# Training Neural Network
w1, w2 = Train(w1,w2)
I'm using Adam's Optimization to converge Gradient Descent to a global minima but the cost is becoming stagnant (not changing) after around 15 iterations(the number is not fixed). The initial cost due to random initialization of weights is changing very minutely before becoming constant. And this is giving training accuracy from 45% to 70% for different runs of the exact same code. Can you help me with the reason behind this?
optimization gradient-descent loss-function
New contributor
Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
add a comment |
$begingroup$
# Neural Network Architecture
no_hid_layers = 1
hid = 3
no_out = 1
# Xavier Ininitialization of weights w
w1 = np.random.randn(hid, n+1)*np.sqrt(2/(hid+n+1))
w2 = np.random.randn(no_out, hid+1)*np.sqrt(2/(no_out+hid+1))
# Sigmoid Activation Function
def g(x):
sig = 1/(1+np.exp(-x))
return sig
def frwrd_prop(X, w1, w2):
z2 = w1 @ X.T
z2 = norm(z2, axis=0)
a2 = np.insert(g(z2), 0, 1, axis=0)
h = g((w2@a2))
return (h,a2)
# Calculating Cost and Gradient
def Cost(X, y, w1, w2, lmbda=0):
# Initializing Cost J and Gradients dw
J = 0
dw1 = np.zeros(w1.shape)
dw2 = np.zeros(w2.shape)
# Forward Propagation to calculate the value of the output
h, a2 = frwrd_prop(X, w1, w2)
# Calculate the Cost Function J
J = -(np.sum(y.T*np.log(h) + (1-y).T*np.log(1-h)) - lmbda/2*(np.sum(np.sum(w1[:,1:].T@w1[:,1:])) + np.sum(w2[:,1:].T@w2[:,1:])))/m
# Applying Back Propagation to calculate the Gradients dw
D3 = h-y
D2 = (w2.T@D3)*a2*(1-a2)
dw1[:,0] = (D2[1:]@X)[:,0]/m
dw2[:,0] = (D3@a2.T)[:,0]/m
dw1[:, 1:] = ((D2[1:]@X)[:,1:] + lmbda*w1[:,1:])/m
dw2[:, 1:] = ((D3@a2.T)[:,1:] + lmbda*w2[:,1:])/m
# Gradient clipping
if(abs(np.linalg.norm(dw1))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw1))
if(abs(np.linalg.norm(dw2))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw2))
return (J, dw1, dw2)
# Adam's Optimization technique for training w
def Train(w1, w2, maxIter=50):
# Algorithm
a, b1, b2, e = 0.001, 0.9, 0.999, 10**(-8)
V1 = np.zeros(w1.shape)
V2 = np.zeros(w2.shape)
S1 = np.zeros(w1.shape)
S2 = np.zeros(w2.shape)
for i in range(maxIter):
J, dw1, dw2 = Cost(X, y, w1, w2)
V1 = b1*V1 + (1-b1)*dw1
S1 = b2*S1 + (1-b2)*(dw1**2)
V2 = b1*V2 + (1-b1)*dw2
S2 = b2*S2 + (1-b2)*(dw2**2)
if i!=0:
V1 = V1/(1-b1**i)
S1 = S1/(1-b2**i)
V2 = V2/(1-b1**i)
S2 = S2/(1-b2**i)
w1 = w1 - a*V1/(np.sqrt(S1)+e)*dw1
w2 = w2 - a*V2/(np.sqrt(S2)+e)*dw2
print("tttIteration : ", i+1, " tCost : ", J)
return (w1, w2)
# Training Neural Network
w1, w2 = Train(w1,w2)
I'm using Adam's Optimization to converge Gradient Descent to a global minima but the cost is becoming stagnant (not changing) after around 15 iterations(the number is not fixed). The initial cost due to random initialization of weights is changing very minutely before becoming constant. And this is giving training accuracy from 45% to 70% for different runs of the exact same code. Can you help me with the reason behind this?
optimization gradient-descent loss-function
New contributor
Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all theimports for a fast assessment.
$endgroup$
– Esmailian
4 hours ago
$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
2 hours ago
$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
2 hours ago
$begingroup$
kaggle.com/starkark31/ann-titanic-survival/code Here's a link to the kernel. @Esmailian
$endgroup$
– Arka Patra
2 hours ago
add a comment |
$begingroup$
# Neural Network Architecture
no_hid_layers = 1
hid = 3
no_out = 1
# Xavier Ininitialization of weights w
w1 = np.random.randn(hid, n+1)*np.sqrt(2/(hid+n+1))
w2 = np.random.randn(no_out, hid+1)*np.sqrt(2/(no_out+hid+1))
# Sigmoid Activation Function
def g(x):
sig = 1/(1+np.exp(-x))
return sig
def frwrd_prop(X, w1, w2):
z2 = w1 @ X.T
z2 = norm(z2, axis=0)
a2 = np.insert(g(z2), 0, 1, axis=0)
h = g((w2@a2))
return (h,a2)
# Calculating Cost and Gradient
def Cost(X, y, w1, w2, lmbda=0):
# Initializing Cost J and Gradients dw
J = 0
dw1 = np.zeros(w1.shape)
dw2 = np.zeros(w2.shape)
# Forward Propagation to calculate the value of the output
h, a2 = frwrd_prop(X, w1, w2)
# Calculate the Cost Function J
J = -(np.sum(y.T*np.log(h) + (1-y).T*np.log(1-h)) - lmbda/2*(np.sum(np.sum(w1[:,1:].T@w1[:,1:])) + np.sum(w2[:,1:].T@w2[:,1:])))/m
# Applying Back Propagation to calculate the Gradients dw
D3 = h-y
D2 = (w2.T@D3)*a2*(1-a2)
dw1[:,0] = (D2[1:]@X)[:,0]/m
dw2[:,0] = (D3@a2.T)[:,0]/m
dw1[:, 1:] = ((D2[1:]@X)[:,1:] + lmbda*w1[:,1:])/m
dw2[:, 1:] = ((D3@a2.T)[:,1:] + lmbda*w2[:,1:])/m
# Gradient clipping
if(abs(np.linalg.norm(dw1))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw1))
if(abs(np.linalg.norm(dw2))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw2))
return (J, dw1, dw2)
# Adam's Optimization technique for training w
def Train(w1, w2, maxIter=50):
# Algorithm
a, b1, b2, e = 0.001, 0.9, 0.999, 10**(-8)
V1 = np.zeros(w1.shape)
V2 = np.zeros(w2.shape)
S1 = np.zeros(w1.shape)
S2 = np.zeros(w2.shape)
for i in range(maxIter):
J, dw1, dw2 = Cost(X, y, w1, w2)
V1 = b1*V1 + (1-b1)*dw1
S1 = b2*S1 + (1-b2)*(dw1**2)
V2 = b1*V2 + (1-b1)*dw2
S2 = b2*S2 + (1-b2)*(dw2**2)
if i!=0:
V1 = V1/(1-b1**i)
S1 = S1/(1-b2**i)
V2 = V2/(1-b1**i)
S2 = S2/(1-b2**i)
w1 = w1 - a*V1/(np.sqrt(S1)+e)*dw1
w2 = w2 - a*V2/(np.sqrt(S2)+e)*dw2
print("tttIteration : ", i+1, " tCost : ", J)
return (w1, w2)
# Training Neural Network
w1, w2 = Train(w1,w2)
I'm using Adam's Optimization to converge Gradient Descent to a global minima but the cost is becoming stagnant (not changing) after around 15 iterations(the number is not fixed). The initial cost due to random initialization of weights is changing very minutely before becoming constant. And this is giving training accuracy from 45% to 70% for different runs of the exact same code. Can you help me with the reason behind this?
optimization gradient-descent loss-function
New contributor
Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$endgroup$
# Neural Network Architecture
no_hid_layers = 1
hid = 3
no_out = 1
# Xavier Ininitialization of weights w
w1 = np.random.randn(hid, n+1)*np.sqrt(2/(hid+n+1))
w2 = np.random.randn(no_out, hid+1)*np.sqrt(2/(no_out+hid+1))
# Sigmoid Activation Function
def g(x):
sig = 1/(1+np.exp(-x))
return sig
def frwrd_prop(X, w1, w2):
z2 = w1 @ X.T
z2 = norm(z2, axis=0)
a2 = np.insert(g(z2), 0, 1, axis=0)
h = g((w2@a2))
return (h,a2)
# Calculating Cost and Gradient
def Cost(X, y, w1, w2, lmbda=0):
# Initializing Cost J and Gradients dw
J = 0
dw1 = np.zeros(w1.shape)
dw2 = np.zeros(w2.shape)
# Forward Propagation to calculate the value of the output
h, a2 = frwrd_prop(X, w1, w2)
# Calculate the Cost Function J
J = -(np.sum(y.T*np.log(h) + (1-y).T*np.log(1-h)) - lmbda/2*(np.sum(np.sum(w1[:,1:].T@w1[:,1:])) + np.sum(w2[:,1:].T@w2[:,1:])))/m
# Applying Back Propagation to calculate the Gradients dw
D3 = h-y
D2 = (w2.T@D3)*a2*(1-a2)
dw1[:,0] = (D2[1:]@X)[:,0]/m
dw2[:,0] = (D3@a2.T)[:,0]/m
dw1[:, 1:] = ((D2[1:]@X)[:,1:] + lmbda*w1[:,1:])/m
dw2[:, 1:] = ((D3@a2.T)[:,1:] + lmbda*w2[:,1:])/m
# Gradient clipping
if(abs(np.linalg.norm(dw1))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw1))
if(abs(np.linalg.norm(dw2))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw2))
return (J, dw1, dw2)
# Adam's Optimization technique for training w
def Train(w1, w2, maxIter=50):
# Algorithm
a, b1, b2, e = 0.001, 0.9, 0.999, 10**(-8)
V1 = np.zeros(w1.shape)
V2 = np.zeros(w2.shape)
S1 = np.zeros(w1.shape)
S2 = np.zeros(w2.shape)
for i in range(maxIter):
J, dw1, dw2 = Cost(X, y, w1, w2)
V1 = b1*V1 + (1-b1)*dw1
S1 = b2*S1 + (1-b2)*(dw1**2)
V2 = b1*V2 + (1-b1)*dw2
S2 = b2*S2 + (1-b2)*(dw2**2)
if i!=0:
V1 = V1/(1-b1**i)
S1 = S1/(1-b2**i)
V2 = V2/(1-b1**i)
S2 = S2/(1-b2**i)
w1 = w1 - a*V1/(np.sqrt(S1)+e)*dw1
w2 = w2 - a*V2/(np.sqrt(S2)+e)*dw2
print("tttIteration : ", i+1, " tCost : ", J)
return (w1, w2)
# Training Neural Network
w1, w2 = Train(w1,w2)
I'm using Adam's Optimization to converge Gradient Descent to a global minima but the cost is becoming stagnant (not changing) after around 15 iterations(the number is not fixed). The initial cost due to random initialization of weights is changing very minutely before becoming constant. And this is giving training accuracy from 45% to 70% for different runs of the exact same code. Can you help me with the reason behind this?
optimization gradient-descent loss-function
optimization gradient-descent loss-function
New contributor
Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
asked 4 hours ago
Arka PatraArka Patra
62
62
New contributor
Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
New contributor
Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.
$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all theimports for a fast assessment.
$endgroup$
– Esmailian
4 hours ago
$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
2 hours ago
$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
2 hours ago
$begingroup$
kaggle.com/starkark31/ann-titanic-survival/code Here's a link to the kernel. @Esmailian
$endgroup$
– Arka Patra
2 hours ago
add a comment |
$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all theimports for a fast assessment.
$endgroup$
– Esmailian
4 hours ago
$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
2 hours ago
$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
2 hours ago
$begingroup$
kaggle.com/starkark31/ann-titanic-survival/code Here's a link to the kernel. @Esmailian
$endgroup$
– Arka Patra
2 hours ago
$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all the
imports for a fast assessment.$endgroup$
– Esmailian
4 hours ago
$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all the
imports for a fast assessment.$endgroup$
– Esmailian
4 hours ago
$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
2 hours ago
$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
2 hours ago
$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
2 hours ago
$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
2 hours ago
$begingroup$
kaggle.com/starkark31/ann-titanic-survival/code Here's a link to the kernel. @Esmailian
$endgroup$
– Arka Patra
2 hours ago
$begingroup$
kaggle.com/starkark31/ann-titanic-survival/code Here's a link to the kernel. @Esmailian
$endgroup$
– Arka Patra
2 hours ago
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Arka Patra is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48825%2fis-adams-optimization-susceptible-to-local-minima%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Arka Patra is a new contributor. Be nice, and check out our Code of Conduct.
Arka Patra is a new contributor. Be nice, and check out our Code of Conduct.
Arka Patra is a new contributor. Be nice, and check out our Code of Conduct.
Arka Patra is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Data Science Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48825%2fis-adams-optimization-susceptible-to-local-minima%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all the
imports for a fast assessment.$endgroup$
– Esmailian
4 hours ago
$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
2 hours ago
$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
2 hours ago
$begingroup$
kaggle.com/starkark31/ann-titanic-survival/code Here's a link to the kernel. @Esmailian
$endgroup$
– Arka Patra
2 hours ago