Is Adam's optimization susceptible to Local Minima?2019 Community Moderator ElectionWhich Optimization method to use?Can overfitting occur in Advanced Optimization algorithms?Why is vanishing gradient a problem?Does a neural network continue to change after SGD stops improving?local minima vs saddle points in deep learningNeural Network: how to interpret this loss graph?Linear Regression OptimizationWhy are optimization algorithms slower at critical points?How does Gradient Descent and Backpropagation work together?Understanding general approach to updating optimization function parameters

What defenses are there against being summoned by the Gate spell?

Do Phineas and Ferb ever actually get busted in real time?

Do any Labour MPs support no-deal?

Representing power series as a function - what to do with the constant after integration?

What do you call a Matrix-like slowdown and camera movement effect?

How can I fix this gap between bookcases I made?

Download, install and reboot computer at night if needed

Today is the Center

Email Account under attack (really) - anything I can do?

Find original functions from a composite function

Should I join office cleaning event for free?

I probably found a bug with the sudo apt install function

What is the offset in a seaplane's hull?

A Journey Through Space and Time

Why Is Death Allowed In the Matrix?

DOS, create pipe for stdin/stdout of command.com(or 4dos.com) in C or Batch?

How can the DM most effectively choose 1 out of an odd number of players to be targeted by an attack or effect?

Copenhagen passport control - US citizen

How can I hide my bitcoin transactions to protect anonymity from others?

Is it possible to do 50 km distance without any previous training?

When blogging recipes, how can I support both readers who want the narrative/journey and ones who want the printer-friendly recipe?

Why linear maps act like matrix multiplication?

Suffixes -unt and -ut-

Disadvantages of online checking accounts?



Is Adam's optimization susceptible to Local Minima?



2019 Community Moderator ElectionWhich Optimization method to use?Can overfitting occur in Advanced Optimization algorithms?Why is vanishing gradient a problem?Does a neural network continue to change after SGD stops improving?local minima vs saddle points in deep learningNeural Network: how to interpret this loss graph?Linear Regression OptimizationWhy are optimization algorithms slower at critical points?How does Gradient Descent and Backpropagation work together?Understanding general approach to updating optimization function parameters










1












$begingroup$


# Neural Network Architecture 

no_hid_layers = 1
hid = 3
no_out = 1

# Xavier Ininitialization of weights w

w1 = np.random.randn(hid, n+1)*np.sqrt(2/(hid+n+1))
w2 = np.random.randn(no_out, hid+1)*np.sqrt(2/(no_out+hid+1))

# Sigmoid Activation Function
def g(x):
sig = 1/(1+np.exp(-x))
return sig

def frwrd_prop(X, w1, w2):
z2 = w1 @ X.T
z2 = norm(z2, axis=0)
a2 = np.insert(g(z2), 0, 1, axis=0)
h = g((w2@a2))
return (h,a2)

# Calculating Cost and Gradient

def Cost(X, y, w1, w2, lmbda=0):
# Initializing Cost J and Gradients dw
J = 0
dw1 = np.zeros(w1.shape)
dw2 = np.zeros(w2.shape)
# Forward Propagation to calculate the value of the output
h, a2 = frwrd_prop(X, w1, w2)
# Calculate the Cost Function J
J = -(np.sum(y.T*np.log(h) + (1-y).T*np.log(1-h)) - lmbda/2*(np.sum(np.sum(w1[:,1:].T@w1[:,1:])) + np.sum(w2[:,1:].T@w2[:,1:])))/m
# Applying Back Propagation to calculate the Gradients dw
D3 = h-y
D2 = (w2.T@D3)*a2*(1-a2)
dw1[:,0] = (D2[1:]@X)[:,0]/m
dw2[:,0] = (D3@a2.T)[:,0]/m
dw1[:, 1:] = ((D2[1:]@X)[:,1:] + lmbda*w1[:,1:])/m
dw2[:, 1:] = ((D3@a2.T)[:,1:] + lmbda*w2[:,1:])/m
# Gradient clipping
if(abs(np.linalg.norm(dw1))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw1))
if(abs(np.linalg.norm(dw2))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw2))
return (J, dw1, dw2)

# Adam's Optimization technique for training w

def Train(w1, w2, maxIter=50):
# Algorithm
a, b1, b2, e = 0.001, 0.9, 0.999, 10**(-8)
V1 = np.zeros(w1.shape)
V2 = np.zeros(w2.shape)
S1 = np.zeros(w1.shape)
S2 = np.zeros(w2.shape)
for i in range(maxIter):
J, dw1, dw2 = Cost(X, y, w1, w2)
V1 = b1*V1 + (1-b1)*dw1
S1 = b2*S1 + (1-b2)*(dw1**2)
V2 = b1*V2 + (1-b1)*dw2
S2 = b2*S2 + (1-b2)*(dw2**2)
if i!=0:
V1 = V1/(1-b1**i)
S1 = S1/(1-b2**i)
V2 = V2/(1-b1**i)
S2 = S2/(1-b2**i)
w1 = w1 - a*V1/(np.sqrt(S1)+e)*dw1
w2 = w2 - a*V2/(np.sqrt(S2)+e)*dw2
print("tttIteration : ", i+1, " tCost : ", J)
return (w1, w2)

# Training Neural Network

w1, w2 = Train(w1,w2)


I'm using Adam's Optimization to converge Gradient Descent to a global minima but the cost is becoming stagnant (not changing) after around 15 iterations(the number is not fixed). The initial cost due to random initialization of weights is changing very minutely before becoming constant. And this is giving training accuracy from 45% to 70% for different runs of the exact same code. Can you help me with the reason behind this?










share|improve this question







New contributor




Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all the imports for a fast assessment.
    $endgroup$
    – Esmailian
    4 hours ago











  • $begingroup$
    @Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
    $endgroup$
    – Arka Patra
    2 hours ago










  • $begingroup$
    Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
    $endgroup$
    – Esmailian
    2 hours ago










  • $begingroup$
    kaggle.com/starkark31/ann-titanic-survival/code Here's a link to the kernel. @Esmailian
    $endgroup$
    – Arka Patra
    2 hours ago















1












$begingroup$


# Neural Network Architecture 

no_hid_layers = 1
hid = 3
no_out = 1

# Xavier Ininitialization of weights w

w1 = np.random.randn(hid, n+1)*np.sqrt(2/(hid+n+1))
w2 = np.random.randn(no_out, hid+1)*np.sqrt(2/(no_out+hid+1))

# Sigmoid Activation Function
def g(x):
sig = 1/(1+np.exp(-x))
return sig

def frwrd_prop(X, w1, w2):
z2 = w1 @ X.T
z2 = norm(z2, axis=0)
a2 = np.insert(g(z2), 0, 1, axis=0)
h = g((w2@a2))
return (h,a2)

# Calculating Cost and Gradient

def Cost(X, y, w1, w2, lmbda=0):
# Initializing Cost J and Gradients dw
J = 0
dw1 = np.zeros(w1.shape)
dw2 = np.zeros(w2.shape)
# Forward Propagation to calculate the value of the output
h, a2 = frwrd_prop(X, w1, w2)
# Calculate the Cost Function J
J = -(np.sum(y.T*np.log(h) + (1-y).T*np.log(1-h)) - lmbda/2*(np.sum(np.sum(w1[:,1:].T@w1[:,1:])) + np.sum(w2[:,1:].T@w2[:,1:])))/m
# Applying Back Propagation to calculate the Gradients dw
D3 = h-y
D2 = (w2.T@D3)*a2*(1-a2)
dw1[:,0] = (D2[1:]@X)[:,0]/m
dw2[:,0] = (D3@a2.T)[:,0]/m
dw1[:, 1:] = ((D2[1:]@X)[:,1:] + lmbda*w1[:,1:])/m
dw2[:, 1:] = ((D3@a2.T)[:,1:] + lmbda*w2[:,1:])/m
# Gradient clipping
if(abs(np.linalg.norm(dw1))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw1))
if(abs(np.linalg.norm(dw2))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw2))
return (J, dw1, dw2)

# Adam's Optimization technique for training w

def Train(w1, w2, maxIter=50):
# Algorithm
a, b1, b2, e = 0.001, 0.9, 0.999, 10**(-8)
V1 = np.zeros(w1.shape)
V2 = np.zeros(w2.shape)
S1 = np.zeros(w1.shape)
S2 = np.zeros(w2.shape)
for i in range(maxIter):
J, dw1, dw2 = Cost(X, y, w1, w2)
V1 = b1*V1 + (1-b1)*dw1
S1 = b2*S1 + (1-b2)*(dw1**2)
V2 = b1*V2 + (1-b1)*dw2
S2 = b2*S2 + (1-b2)*(dw2**2)
if i!=0:
V1 = V1/(1-b1**i)
S1 = S1/(1-b2**i)
V2 = V2/(1-b1**i)
S2 = S2/(1-b2**i)
w1 = w1 - a*V1/(np.sqrt(S1)+e)*dw1
w2 = w2 - a*V2/(np.sqrt(S2)+e)*dw2
print("tttIteration : ", i+1, " tCost : ", J)
return (w1, w2)

# Training Neural Network

w1, w2 = Train(w1,w2)


I'm using Adam's Optimization to converge Gradient Descent to a global minima but the cost is becoming stagnant (not changing) after around 15 iterations(the number is not fixed). The initial cost due to random initialization of weights is changing very minutely before becoming constant. And this is giving training accuracy from 45% to 70% for different runs of the exact same code. Can you help me with the reason behind this?










share|improve this question







New contributor




Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$











  • $begingroup$
    Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all the imports for a fast assessment.
    $endgroup$
    – Esmailian
    4 hours ago











  • $begingroup$
    @Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
    $endgroup$
    – Arka Patra
    2 hours ago










  • $begingroup$
    Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
    $endgroup$
    – Esmailian
    2 hours ago










  • $begingroup$
    kaggle.com/starkark31/ann-titanic-survival/code Here's a link to the kernel. @Esmailian
    $endgroup$
    – Arka Patra
    2 hours ago













1












1








1





$begingroup$


# Neural Network Architecture 

no_hid_layers = 1
hid = 3
no_out = 1

# Xavier Ininitialization of weights w

w1 = np.random.randn(hid, n+1)*np.sqrt(2/(hid+n+1))
w2 = np.random.randn(no_out, hid+1)*np.sqrt(2/(no_out+hid+1))

# Sigmoid Activation Function
def g(x):
sig = 1/(1+np.exp(-x))
return sig

def frwrd_prop(X, w1, w2):
z2 = w1 @ X.T
z2 = norm(z2, axis=0)
a2 = np.insert(g(z2), 0, 1, axis=0)
h = g((w2@a2))
return (h,a2)

# Calculating Cost and Gradient

def Cost(X, y, w1, w2, lmbda=0):
# Initializing Cost J and Gradients dw
J = 0
dw1 = np.zeros(w1.shape)
dw2 = np.zeros(w2.shape)
# Forward Propagation to calculate the value of the output
h, a2 = frwrd_prop(X, w1, w2)
# Calculate the Cost Function J
J = -(np.sum(y.T*np.log(h) + (1-y).T*np.log(1-h)) - lmbda/2*(np.sum(np.sum(w1[:,1:].T@w1[:,1:])) + np.sum(w2[:,1:].T@w2[:,1:])))/m
# Applying Back Propagation to calculate the Gradients dw
D3 = h-y
D2 = (w2.T@D3)*a2*(1-a2)
dw1[:,0] = (D2[1:]@X)[:,0]/m
dw2[:,0] = (D3@a2.T)[:,0]/m
dw1[:, 1:] = ((D2[1:]@X)[:,1:] + lmbda*w1[:,1:])/m
dw2[:, 1:] = ((D3@a2.T)[:,1:] + lmbda*w2[:,1:])/m
# Gradient clipping
if(abs(np.linalg.norm(dw1))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw1))
if(abs(np.linalg.norm(dw2))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw2))
return (J, dw1, dw2)

# Adam's Optimization technique for training w

def Train(w1, w2, maxIter=50):
# Algorithm
a, b1, b2, e = 0.001, 0.9, 0.999, 10**(-8)
V1 = np.zeros(w1.shape)
V2 = np.zeros(w2.shape)
S1 = np.zeros(w1.shape)
S2 = np.zeros(w2.shape)
for i in range(maxIter):
J, dw1, dw2 = Cost(X, y, w1, w2)
V1 = b1*V1 + (1-b1)*dw1
S1 = b2*S1 + (1-b2)*(dw1**2)
V2 = b1*V2 + (1-b1)*dw2
S2 = b2*S2 + (1-b2)*(dw2**2)
if i!=0:
V1 = V1/(1-b1**i)
S1 = S1/(1-b2**i)
V2 = V2/(1-b1**i)
S2 = S2/(1-b2**i)
w1 = w1 - a*V1/(np.sqrt(S1)+e)*dw1
w2 = w2 - a*V2/(np.sqrt(S2)+e)*dw2
print("tttIteration : ", i+1, " tCost : ", J)
return (w1, w2)

# Training Neural Network

w1, w2 = Train(w1,w2)


I'm using Adam's Optimization to converge Gradient Descent to a global minima but the cost is becoming stagnant (not changing) after around 15 iterations(the number is not fixed). The initial cost due to random initialization of weights is changing very minutely before becoming constant. And this is giving training accuracy from 45% to 70% for different runs of the exact same code. Can you help me with the reason behind this?










share|improve this question







New contributor




Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




# Neural Network Architecture 

no_hid_layers = 1
hid = 3
no_out = 1

# Xavier Ininitialization of weights w

w1 = np.random.randn(hid, n+1)*np.sqrt(2/(hid+n+1))
w2 = np.random.randn(no_out, hid+1)*np.sqrt(2/(no_out+hid+1))

# Sigmoid Activation Function
def g(x):
sig = 1/(1+np.exp(-x))
return sig

def frwrd_prop(X, w1, w2):
z2 = w1 @ X.T
z2 = norm(z2, axis=0)
a2 = np.insert(g(z2), 0, 1, axis=0)
h = g((w2@a2))
return (h,a2)

# Calculating Cost and Gradient

def Cost(X, y, w1, w2, lmbda=0):
# Initializing Cost J and Gradients dw
J = 0
dw1 = np.zeros(w1.shape)
dw2 = np.zeros(w2.shape)
# Forward Propagation to calculate the value of the output
h, a2 = frwrd_prop(X, w1, w2)
# Calculate the Cost Function J
J = -(np.sum(y.T*np.log(h) + (1-y).T*np.log(1-h)) - lmbda/2*(np.sum(np.sum(w1[:,1:].T@w1[:,1:])) + np.sum(w2[:,1:].T@w2[:,1:])))/m
# Applying Back Propagation to calculate the Gradients dw
D3 = h-y
D2 = (w2.T@D3)*a2*(1-a2)
dw1[:,0] = (D2[1:]@X)[:,0]/m
dw2[:,0] = (D3@a2.T)[:,0]/m
dw1[:, 1:] = ((D2[1:]@X)[:,1:] + lmbda*w1[:,1:])/m
dw2[:, 1:] = ((D3@a2.T)[:,1:] + lmbda*w2[:,1:])/m
# Gradient clipping
if(abs(np.linalg.norm(dw1))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw1))
if(abs(np.linalg.norm(dw2))>4.5):
dw1 = dw1*4.5/(np.linalg.norm(dw2))
return (J, dw1, dw2)

# Adam's Optimization technique for training w

def Train(w1, w2, maxIter=50):
# Algorithm
a, b1, b2, e = 0.001, 0.9, 0.999, 10**(-8)
V1 = np.zeros(w1.shape)
V2 = np.zeros(w2.shape)
S1 = np.zeros(w1.shape)
S2 = np.zeros(w2.shape)
for i in range(maxIter):
J, dw1, dw2 = Cost(X, y, w1, w2)
V1 = b1*V1 + (1-b1)*dw1
S1 = b2*S1 + (1-b2)*(dw1**2)
V2 = b1*V2 + (1-b1)*dw2
S2 = b2*S2 + (1-b2)*(dw2**2)
if i!=0:
V1 = V1/(1-b1**i)
S1 = S1/(1-b2**i)
V2 = V2/(1-b1**i)
S2 = S2/(1-b2**i)
w1 = w1 - a*V1/(np.sqrt(S1)+e)*dw1
w2 = w2 - a*V2/(np.sqrt(S2)+e)*dw2
print("tttIteration : ", i+1, " tCost : ", J)
return (w1, w2)

# Training Neural Network

w1, w2 = Train(w1,w2)


I'm using Adam's Optimization to converge Gradient Descent to a global minima but the cost is becoming stagnant (not changing) after around 15 iterations(the number is not fixed). The initial cost due to random initialization of weights is changing very minutely before becoming constant. And this is giving training accuracy from 45% to 70% for different runs of the exact same code. Can you help me with the reason behind this?







optimization gradient-descent loss-function






share|improve this question







New contributor




Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|improve this question







New contributor




Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|improve this question




share|improve this question






New contributor




Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 4 hours ago









Arka PatraArka Patra

62




62




New contributor




Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






Arka Patra is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











  • $begingroup$
    Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all the imports for a fast assessment.
    $endgroup$
    – Esmailian
    4 hours ago











  • $begingroup$
    @Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
    $endgroup$
    – Arka Patra
    2 hours ago










  • $begingroup$
    Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
    $endgroup$
    – Esmailian
    2 hours ago










  • $begingroup$
    kaggle.com/starkark31/ann-titanic-survival/code Here's a link to the kernel. @Esmailian
    $endgroup$
    – Arka Patra
    2 hours ago
















  • $begingroup$
    Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all the imports for a fast assessment.
    $endgroup$
    – Esmailian
    4 hours ago











  • $begingroup$
    @Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
    $endgroup$
    – Arka Patra
    2 hours ago










  • $begingroup$
    Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
    $endgroup$
    – Esmailian
    2 hours ago










  • $begingroup$
    kaggle.com/starkark31/ann-titanic-survival/code Here's a link to the kernel. @Esmailian
    $endgroup$
    – Arka Patra
    2 hours ago















$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all the imports for a fast assessment.
$endgroup$
– Esmailian
4 hours ago





$begingroup$
Welcome to SE.DataScience! Adam and similar optimizers (Nestrov, Nadam, etc.) are all converging to a local minimum, no global optimum is guaranteed. This high variability could be due to (1) too much parameters, (2) too few training samples, (3) bugs in implementation, etc.. As you see, there are many causes for this symptom. You better provide an executable code with all the imports for a fast assessment.
$endgroup$
– Esmailian
4 hours ago













$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
2 hours ago




$begingroup$
@Esmailian Hello and Thank you. Is there any way to prevent the gradient from falling into local minima? I think Geoffrey Hinton produced a paper on that but I'm not sure which one. And if that's not possible how to resolve the issue? Besides few training examples or more features is an issue when overfitting but low training accuracy seems to be an issue of underfitting and doesn't the training accuracy be more for less number of features because the weights will adjust more accurately if there's less training example? P.S. I'm writing this in python and have only imported Pandas and NumPy.
$endgroup$
– Arka Patra
2 hours ago












$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
2 hours ago




$begingroup$
Is there any way to prevent the gradient from falling into local minima? No. One optimizer may perform better, but all fall into local minima. The high instability of accuracy cannot be attributed to over- or under-fitting surely yet. Please place a code that can be executed with no modification.
$endgroup$
– Esmailian
2 hours ago












$begingroup$
kaggle.com/starkark31/ann-titanic-survival/code Here's a link to the kernel. @Esmailian
$endgroup$
– Arka Patra
2 hours ago




$begingroup$
kaggle.com/starkark31/ann-titanic-survival/code Here's a link to the kernel. @Esmailian
$endgroup$
– Arka Patra
2 hours ago










0






active

oldest

votes












Your Answer





StackExchange.ifUsing("editor", function ()
return StackExchange.using("mathjaxEditing", function ()
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix)
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
);
);
, "mathjax-editing");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "557"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);






Arka Patra is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48825%2fis-adams-optimization-susceptible-to-local-minima%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes








Arka Patra is a new contributor. Be nice, and check out our Code of Conduct.









draft saved

draft discarded


















Arka Patra is a new contributor. Be nice, and check out our Code of Conduct.












Arka Patra is a new contributor. Be nice, and check out our Code of Conduct.











Arka Patra is a new contributor. Be nice, and check out our Code of Conduct.














Thanks for contributing an answer to Data Science Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdatascience.stackexchange.com%2fquestions%2f48825%2fis-adams-optimization-susceptible-to-local-minima%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown